IT-Conductor focused this quarter to increase the ROI for our customer's daily SAP Operations by automating the management of critical areas to minimize disruptions in services and maximizing the availability and performance of systems:
- Pacemaker Cluster Resource Management
- Linux Kernel Resource Management
- HANA Memory Management
- HANA Automated Backup
- SAP qRFC Inbound and Outbound Queue Management
- Printer Queue Management
- SAP Jobs Management
- SAP System Performance Reporting
- Daily Recovery Reports
- Service Operations for SAP/DB/OS/VM stop/start
A few of these newly added features include:
In Q2-2019, ITC introduced Pacemaker HA Cluster Monitoring, which has been enhanced to allow detection of many types of cluster events and logs, that can be matched to specific resource or engine errors. In Pacemaker, the resource failcount keeps track of errors in operation such as start, stop, monitor for every resource managed by the cluster. These failcount are often result of soft errors such as timeout during startup of resources, which can take longer depending on system size, e.g. HANA startup can be longer on a large system. Thus, it is not desirable to trigger resource recovery or failover which can be simply retried. ITC can monitor these resource failures and trigger auto-recovery to reset the failcount (back to zero) so that the cluster can retry the operations which often succeeds, otherwise a failover may occur. Meanwhile, the original reason for the failure could still be alerted and notified so the underlying cause can be further investigated.
In Q2-2019, ITC also introduced Linux Monitoring Enhancements, including kernel usage monitoring of critical resources like processes, threads, memory usage. While these are useful for capacity and tuning, there is often a need to proactively manage the kernel resources to prevent system or workload failures due to bottlenecks. One such example is a common problem seen on Linux systems running HANA where Soft CPU Lockup can occur due to Directory entry unused cache (dentunusd, as monitored by 'sar -v') growing very large to possibly billions. When the system runs low on memory available, it will trigger a reclamation of the dentunusd cache using all available CPU threads, thus locking up the system and HANA can be unresponsive for many seconds and even minutes.
ITC can now monitor these resources and based on threshold, trigger automatic reclamation via recovery action, so the system avoids lockup when the resource utilization gets too high.
In the recent SAP TechEd 2019 Microsoft Azure announcement for a 12 TB VM, it's hard to believe that HANA can run out of memory and start column store unloads which impacts performance. We know that proactive performance management can help avoid these situations by tracking usage and where appropriate manage it within defined thresholds. In HANA many operations can affect the ballooning of resident memory usage, such as delta merge, large volume of row store tables that by design loads on startup, long-running uncommitted transactions, etc. HANA has a very complex algorithm to manage the resident memory which sometimes can allow the memory management to grow for a long time without efficient reclamation. In these situations, when actual used memory by HANA services within the resident memory is low, it is often best to reclaim it via garbage collection without affecting long-running uncommitted transactions. ITC can use one or more memory KPIs, to trigger on-demand garbage collection and free up memory so system utilization can reflect actual resource usage, helping overall performance and right-size of VM instances to save cost.
ITC already can Automate the HANA Backup and Cleanup, now by monitoring the database backup age, we can use either script based or SQL to backup the database when the age since last successful backup reaches a desired threshold. This is ideal as we can set the threshold differently for weekdays versus weekend or any day of the week to backup more or less frequently to meet specific database RPO/RTO. This centralizes the monitoring and backup of databases and avoids the need for local cron schedules, and move towards an enterprise solution.
Most SAP customers have a complex multi-system landscape where data flows between various systems such as to/from ERP, CRM, SCM/APO. They require the Core SAP Basis Monitoring to deal with one of the biggest pain point by managing the qRFC queues. Specifically, monitor and manage the inbound/outbound processing and ensure failed or stuck queues are recovered in a timely manner to avoid business disruptions. Some of these systems may transact thousands of these queue objects daily as part of time-sensitive business processes, such as orders and supply chain jobs. ITC can now detect specific inbound and outbound queue conditions, and automatically re-process individual queues which have errors or stuck. Customers have found these automation to resolve more than 70% of issues which would have otherwise needed manual intervention.
Although many customers have migrated to online printing from SAP to ADS (Adobe Document Services), there are still many use-cases where physical printing from SAP are needed. Some examples involve manufacturing floor where printouts are part of the production fulfillment process. When printing fails due to device errors, it can totally stop production. ITC already monitors the Printing and Spool Administration within SAP and underlying spool subsystem, and now we have introduced remote printer restart which can resolve most of the print queue issues detected by ITC, except ink and paper (obviously).
Most Jobs in SAP that fails eventually require a manual restart. When ITC monitors SAP Background Processing environment and Batch Jobs failures can now trigger for specific jobs which should attempt a restart and/or send the job log as part of the notification. We understand not all jobs qualify for an auto-restart such as those that may have dependencies and manual data reconciliation, however a large number of customer jobs if designed well should be safe to restart. In such case, ITC recovery-action can automatically copy the job to a new job and schedule it to run. The delivery of the job log that failed to the customer's inbox expedite the RCA and recovery.
ITC provides dynamic dashboards and charts, but we also recognize that customers want periodic reporting sent to their inbox with an overview of their system's health. ITC supports report templates which can be easily scheduled and sent to a user or distribution list showing performance overview of an entire SAP system and individual application servers, for the last 24 hours (configurable). Charts can include: Dialog utilization, Background utilization, CPU utilization, Response time, Connections, Memory utilization, Users logged in.
For compliance and audit log, ITC not alert customers on specific issues, run automatic recovery, and then on a configurable period such as daily, send a Recovery Report of the actions taken on the customer's behalf. These reports can show when, where, what, and how successful the recovery action was.
More than 2.5 years ago ITC introduced SAP on Azure monitoring and operations. Over the last year ITC has Monitored and Managed the Largest SAP on Azure environment and continued to innovate our SysDevOps capability especially in operations automation. In virtualized or physical environment, ITC can monitor and connect the relationships between services, applications, databases, hosts, and infrastructure components in order to facilitate service-oriented actions (e.g. stop, start). In a service context, ITC can propagate actions up or down the hierarchy to allow centralized control and scheduled operations for stopping, starting, snoozing systems for maintenance or cost saving.
Go ahead and give these new features a try in your account, or if you would like to try IT-Conductor