7 Signs Your IT Operations Need Better Monitoring

Authored by Danica Esteban
  

How organizations monitor IT operations has evolved since the early years of computing—from having small IT teams manage everything to splitting responsibilities across different technical teams. If you have been in the industry since the 1970s, you might have heard about SAP Basis being able to deliver a full range of services, including all technical aspects of managing SAP, databases, operating systems, servers, storage, and even network and security. After all, it was designed to work as a bridge between operating systems and various business applications. This is why Basis teams are equipped with skills to design, build, and manage all things related to technical SAP, as well as monitor the performance of its underlying infrastructure.

Working in Basis operations may be long gone for large enterprises now, with the segregation of duties and specialization due to increased complexities and compliance requirements. Still, sometimes we wonder if those were the good old days when monitoring IT operations was more efficient.

Today, monitoring IT operations has branched out to different teams:

  • Some organizations have one centralized monitoring system, which is handled by the Network Operations Center (NOC), which shares responsibilities with the IT Helpdesk as Level 1 support and provides incident management to end-users.

  • Other organizations transfer the accountability of managing application performance to the application owners, leaving NOC teams to monitor the network and IT Helpdesk, which is more focused on low-impact end-user issues.

  • Now, with a hybrid environment that includes outsourcing and managed services providers (MSPs), some organizations hand it off to other companies.

While this scenario seems ideal for large enterprises, there may still be some consequences of having several teams monitor IT operations separately.

Is Your IT Operations Operational?

IT Service Management (ITSM) specialists would know that there must be people, processes, and products needed to make IT operations operational. It is like a three-legged stool where all must be present to be functional.

Having been on both sides of Basis and operations, here are seven signs that indicate your IT operations may not be operational or at least not fully efficient, why we think they are not, and how to fix them.

The following are listed in no particular order of importance.

1. The monitoring plan is unclear about WHAT and HOW services are monitored and documented responses to exceptions.

Monitoring IT operations is all about managing the availability and performance of both the hardware and the software components in the entire IT infrastructure. Without a clear monitoring plan, it would be difficult for technical teams to discern what they need to monitor, how they would be notified when an issue occurs, how they would address those issues, and if those issues even deserve their attention. Especially when managing SAP system landscapes, you can easily fall into the trap of paying too much attention to the surge of alerts rather than focusing on issues that must be prioritized.

How to address this?

Organizations must have a configuration management database (CMDB) or a standard source of all the hardware and software components in the entire IT infrastructure. This database should preferably be integrated into an ITSM (IT Service Management) platform or at least updated daily. This would make it easier for technical teams to check whether they manage a particular component.

It is also necessary to clearly define how these components will be monitored and their respective service level objectives (SLO). Do you need to monitor them 24/7 or only during business hours? How about the CPU and memory utilization—would you prefer getting notifications when utilization is nearly at its highest value or being warned only when it reaches a certain threshold? Those are just a few examples you need to consider when monitoring IT operations.

Lastly, technical teams must provide the requirements for monitoring critical services. In monitoring SAP systems, metrics and key performance indicators (KPIs) must be clearly defined. The list of critical services to monitor, frequencies, and thresholds should align with these metrics and KPIs to execute effective and efficient monitoring. Depending on the application monitoring tool's capability, make sure to weigh in on all these things we mentioned so that you can achieve better monitoring.

2. Your monitoring tool may lack this important availability monitoring feature—maintenance mode.

Monitoring availability, as mentioned above, is a top priority in monitoring IT operations. That said, monitoring tools should be equipped with this capability to notify you when a device is down. As long as the monitoring tool receives the heartbeat, expect to see a bunch of greens on your monitoring screen. However, when maintenance activities are performed by various teams and maintenance mode for the configuration items (CIs) in scope is not turned on, or perhaps there is no option available to do that, the operations team would most probably struggle to recognize the false alerts from the genuine ones when devices and services are rebooted/restarted.

How to address this?

This is where the ability of the monitoring tool to tag devices in maintenance mode serves its purpose. Putting devices and/or services in maintenance mode during maintenance activities or emergency changes will save you from determining which ones are genuine and which are false alerts. In turn, you’d be more confident that the monitoring tool would notify you should unplanned availability occur, given predetermined criteria such as percentage and time window are configured properly.

3. You are complaining that you are receiving too many alerts.

When operations get flooded with alerts and don’t know why, they tend to ignore them. This is typical behavior; unfortunately, this does not do the business any good. What if the alerts ignored do present an underlying issue that would cost the business some monetary loss? Will the blame be targeted at the monitoring tool? Of course not. It is always the people who get reprimanded when problems like these happen. Then, when faced with that distressing situation, the management will either propose to look for a better monitoring tool, giving you some major work to do that you most probably wouldn’t like, or worse, propose to change the dynamics of the teams, which could lead to unexpected circumstances. We are sure you do not want that to happen either.

How to address this?

Having the proper monitoring tool would save you from this headache. It would be a great relief to have a monitoring tool to filter what goes to operations for monitoring and what can be ignored. Also, having the ability to configure the frequency of events before alerting is beneficial if you do not want to get notified each time the problem occurs and only if it indicates a potential problem, say the event happened more than N times already. Lastly, the monitoring tool should be able to remove duplicate copies of repetitive events while still alerting the main problem. Think of it as having one parent ticket and relating all the reoccurring events as child tickets.

4. Trouble tickets are automatically assigned to the technical team without any first-level diagnosis.

This is like the helpdesk passing on every user complaint to the technical team without performing the basic troubleshooting or going through a pre-defined checklist of whether an issue needs to be escalated. This is a very common scenario in organizations where the first level of support is not equipped with the right skills or probably lacks the proper process and tools to perform an initial diagnosis.

How to address this?

Operations, or whoever is at the first level of support, should have the right tools to perform basic troubleshooting, such as RCA (Root Cause Analysis). Aside from the tools, it is also of utmost importance that processes for first-level diagnosis be streamlined so that it is easier to distinguish issues that require the domain expertise of the technical teams. The alerts should also be easy to customize to provide more context and details such as when, what, where, and how.

5. Service levels are not being monitored or sufficiently managed.

Service level management (SLM) provides metrics or KPIs on service health and quality objectives. Without SLM, application monitoring would be too noisy.

How to address this?

Tools are essential to automate service-level monitoring, alerting, and reporting. They provide proactive SLM rather than just historical reporting of why service levels were missed. Additionally, automated reporting will enhance the awareness of how well the service complies with established service levels.

6. There is a lack of centralized enterprise monitoring.

Every technical team prefers to have its own set of tools, but it is essential to integrate them for centralized management and correlation. Especially in large enterprises where different applications in varying platforms are being used all throughout the organization, it would be easy to get lost in a sea of alerts. You wouldn’t want that to happen, as we described in item number three.

In addition, the lack of centralized enterprise monitoring would spark trouble among technical teams during root cause analysis (RCA) calls. With the built-up pressure of trying to find where the fault occurred, there may be a chance that major incidents (MIs) would last not just for hours but days or even weeks!

How to address this?

A 360-degree view of services across applications and infrastructure would provide better root cause analysis and top-down service management. Joining MI calls will save you time, effort, and unnecessary worries. It’s a win-win for everyone!

7. There is a separation of responsibilities—US and THEM.

Operations, IT Support, and SMEs are all critical parts of IT Service Management (ITSM). Operational excellence is a shared role and responsibility of all teams. However, with everyone trying to do their jobs, it would be a challenge for one to help another. This creates division among teams when it comes to taking accountability when, in fact, all should be held responsible one way or another because everything is cause and effect.

How to address this?

IT Service Management should incorporate cross-functional DevOps to automate the data flow between traditional silos to share information and intelligence rather than segregate workflows between systems administrators, DBAs, etc. Integrating workflows from one team to another makes it possible for automation to become part of the solution, such as the ability to execute recovery actions as part of the automated response to an alert.

Innovate, Automate, Deliver

Whether you choose to be philosophical about these seven signs or prefer to have your own, the premise is that IT Operations is a critical part of any enterprise IT organization. Without them, the services delivered to end-users will be poorly managed.

So, where do we go from here?

We can debate where the future of enterprise IT is heading, but one thing is for sure. With the proliferation of business services and applications and every kind of organization needing to innovate with software to be efficient, automation of IT Operations will be one of the significant keys to high-quality service delivery.

How is the health of your IT Operations?

Want to learn more about how we can help automate monitoring and service level management?