7 Signs Your IT Operations Need Better Monitoring


With contributions from Danica Esteban

How organizations monitor IT operations has evolved since the early years of computing—from having small IT teams manage everything to splitting responsibilities across different technical teams. If you have been in the industry since the 1970s, you might have heard about SAP Basis being able to deliver a full range of services including all technical aspects in managing SAP, databases, operating systems, servers, storage, and even network and security. After all, it was designed to work as a bridge between operating systems and various business applications. This is the very reason why Basis teams are equipped with skills to design, build, and operate all things related to technical SAP, as well as monitor the performance of its underlying infrastructure.

Those days of working in Basis operations may be long gone for large enterprises now with segregation of duties and specialization due to increased complexities and compliance requirements, but sometimes we wonder if those were the good old days when monitoring IT operations was more efficient.

Today, monitoring IT operations has branched out to different teams:

  • Some organizations have one centralized monitoring handled by Network Operations Center (NOC) with shared responsibilities with IT Helpdesk as Level 1 support, providing incident management to end-users.

  • Other organizations transfer the accountability of managing application performance to the application owners, leaving NOC teams with monitoring solely the network and IT Helpdesk more focused on end-user issues that have low impact.

  • Now with a hybrid environment including outsourcing and managed services providers (MSPs), some organizations totally hand it off to other companies.

While this looks like an ideal scenario for large enterprises, there may still be some consequences of having several teams monitor IT operations in dissociated manner.

Is Your IT Operations Operational?

IT Service Management (ITSM) specialists would know that there must be people, processes, and products needed to make IT operations operational. It is like a three-legged stool where all must be present to be functional.

Having been on both sides of Basis and operations, here are seven signs that indicate your IT operations may not be operational or at least not fully efficient, why we think it's not, and how to fix them.

IT Operations Infographic

The following is listed in no particular order of importance.

1. Monitoring plan is unclear about WHAT and HOW services are monitored as well as documented responses to exceptions.

Monitoring IT operations is all about managing the availability and performance of both the hardware and the software components in the entire IT infrastructure. Without a clear monitoring plan, it would be difficult for technical teams to discern what they need to monitor, how they would be notified when an issue occurs, how they would address those issues, and if those issues do even deserve their attention. Especially when managing SAP system landscapes, you can easily fall into the trap of paying too much attention to the surge of alerts rather than focusing on issues that need to be prioritized.

How to address this?

Organizations must have a configuration management database (CMDB) or a standard source of all the hardware and software components in the entire IT infrastructure, preferably integrated into an ITSM (IT Service Management) platform or at least keeping it updated at least daily. This way, it would be easier for technical teams to check whether they manage a particular component or not.

It is also necessary to clearly define how these components will be monitored, along with their respective service level objectives (SLO). Do you need to monitor them 24/7 or only during business hours? How about the CPU and memory utilization—would you prefer getting notifications when utilization is nearly on its highest value or be warned only when it reached a certain threshold? Those are just a few examples that you need to take into consideration when monitoring IT operations.

Lastly, it is necessary for technical teams to provide the requirements for monitoring critical services. In the context of monitoring SAP systems, metrics and key performance indicators (KPIs) must be clearly defined. The list of critical services to monitor, frequencies, and thresholds should all be aligned with these metrics and KPIs to execute effective and efficient monitoring. Depending on the application monitoring tool's capability, make sure to weigh in all these things we mentioned so that you can achieve better monitoring.

2. Your monitoring tool may be lacking this important availability monitoring feature—maintenance mode.

Monitoring availability, as mentioned above, is a top priority in monitoring IT operations. With that being said, monitoring tools should be equipped with this capability to notify you when a device is down. As long as the heartbeat is being received by the monitoring tool, expect to see a bunch of greens on your monitoring screen. However, when maintenance activities are performed by various teams and maintenance mode for the configuration items (CIs) in scope was not turned on, or perhaps, there is no option available to do that, the operations team would most probably struggle to recognize the false alerts from the genuine ones when devices and services are rebooted/restarted.

How to address this?

This is where the ability of the monitoring tool to tag devices in maintenance mode serves its purpose. Putting devices and/or services in maintenance mode during maintenance activities or emergency changes will save you from the hurdle of determining which ones are genuine and which ones are false alerts. In turn, you’d be more confident that the monitoring tool would notify you should unplanned availability occur, given predetermined criteria such as percentage and time window are configured properly.

3. You are complaining that you are receiving too many alerts.

When operations get flooded with alerts and don’t know why, they always tend to ignore them. This is a common behavior and unfortunately, this is not doing the business any good. What if the alerts ignored actually do present an underlying issue that would cost the business some monetary loss? Will the blame be targeted to the monitoring tool? Of course not. It is always the people who get reprimanded when problems like these happen. Then, when faced with that distressing situation, the management will either propose to look for a better monitoring tool, giving you some major work to do which you most probably wouldn’t like or worse, propose to change the dynamics of the teams which could lead to unexpected circumstances. We are sure you do not want that to happen either.

How to address this?

Having the right monitoring tool would save you from this headache. It would be a big sigh of relief to have a monitoring tool where you can filter what goes to operations for monitoring and what can be ignored. Also, having the ability to configure the frequency of events prior to alerting is beneficial if you do not want to get notified each time the problem occurs and only if it indicates a potential problem, say the event happened more than N times already. Lastly, the monitoring tool should have the ability to remove duplicate copies of repetitive events while still alerting on the main problem. Think of it as having one parent ticket and just relating all the reoccurring events as child tickets.

4. Trouble tickets are automatically assigned to the technical team without any first-level diagnosis.

This is like helpdesk passing on every user complaint to the technical team without performing the basic troubleshooting or going through a pre-defined checklist of whether an issue needs to be escalated or not. This is a very common scenario in organizations where the first level of support is not equipped with the right skills or probably lacks the right process and tools to perform an initial diagnosis.

How to address this?

Operations, or whoever is the first level of support, should have the right tools to perform basic troubleshooting such as RCA (Root Cause Analysis). Aside from the tools, it is also of utmost importance that processes for first-level diagnosis must be streamlined so that it would be easier to distinguish issues that require the domain expertise of the technical teams. Additionally, the alerts should also be easy to be customized to provide more context and details such as when, what, where, and how.

5. Service levels are not being monitored or sufficiently managed.

Service level management (SLM) provides metrics or KPIs on service health and quality objectives. Without SLM, application monitoring would be too noisy.

How to address this?

Tools are essential to automate service-level monitoring, alerting, and reporting. They provide proactive SLM rather than just historical reporting of why service levels were missed. Additionally, automated reporting will enhance the awareness of how well the service is in compliance with established service levels.

6. There is a lack of centralized enterprise monitoring.

Every technical team prefers to have its own set of tools, but it is essential to integrate them for centralized management and correlation. Especially in large enterprises where different applications in varying platforms are being used all throughout the organization, it would be easy to get lost in a sea of alerts. You wouldn’t want that to happen as we described in item number three.

In addition, the lack of centralized enterprise monitoring would spark trouble among technical teams during root cause analysis (RCA) calls. With the built-up pressure, trying to find where the fault occurred, there may be a chance that major incidents (MIs) would last not just for hours but days, or even weeks!

How to address this?

360 degrees view of services across application and infrastructure would provide better root cause analysis and top-down service management. Not only will you save some time and effort, but you will also save a significant amount of unnecessary worries when you join MI calls. It’s a win-win for everyone!

7. There is a separation of responsibilities—US and THEM.

Operations, IT Support, and SMEs are all critical parts of IT Service Management (ITSM). Operational excellence is a shared role and responsibility of all teams. However, with everyone trying to do their jobs, it would be a challenge for one to help another. This creates division among teams when it comes to taking accountability when in fact, all should be held responsible one way or another because everything is cause and effect.

How to address this?

IT Service Management should incorporate cross-functional DevOps to automate the data flow between traditional silos, to share information and intelligence, rather than segregate workflows between systems administrators, DBAs, etc. The ability to integrate workflows from one team to another makes it possible for automation to become part of the solution, such as the ability to execute recovery actions as part of the automated response to an alert.

Innovate, Automate, Deliver

Whether you choose to be philosophical about these seven signs or prefer to have your own, the premise is that—IT Operations is a critical part of any enterprise IT organization. Without it, the services delivered to end-users will be poorly managed.

So where do we go from here?

We can debate on where the future of enterprise IT is heading, but one thing is for sure, with the proliferation of business services and applications, as well as every kind of organization needing to innovate with software to be efficient, automation of IT Operations will be one of the major keys to high-quality service delivery.

How is the health of your IT Operations?

Want to find out more about how we can help automate monitoring and service level management?

Instantly Book FREE Online Consultation