With the majority of today’s monitoring systems currently still focused on analysing on-premise deployments, most IT operations are increasingly unprepared when it comes to tracking the performance of their growing number of cloud-based systems and applications. Being able to perform in-depth responsiveness and error checking on both internal and externally-facing customer sites, as well as major third-party applications from providers such as Microsoft, Google and Salesforce, is quickly becoming a critical requirement for today’s resource-challenged in-house IT teams.
For organisations needing to ensure the 24×7 performance of both on premise and cloud solutions, but unsure whether their existing networking infrastructure is up to the task, I’ve put together an Infrastructure Monitoring checklist to highlight five key principles/tips to help achieve real-time issues resolution and pre-empt infrastructure issues happening in the first place. Having a clearer picture of overall infrastructure performance is essential, particularly as cloud becomes the de facto model for organisations regardless of size. Ensuring that IT infrastructure performs optimally not only saves on missed revenue opportunities but can also significantly reduce user frustration.
The five key Infrastructure Monitoring principles are:
1. Keep Things Simple
Overly complex monitoring processes can negatively impact the performance of systems, creating network bottlenecks and generating too many warnings – leading to an inevitable ‘cry wolf’ effect. Over-monitoring can also lead to excess infrastructure adjustments as IT teams attempt to keep pace with their tuning parameters.
2. Focus On What’s Actually Going Wrong
Smarter infrastructure monitoring systems eliminate the noise caused by significant outages by alerting conditionally as to which component has actually failed – instead of over-reporting based on the subsequent systems and processes that could be indirectly impacted. Look for solutions that allow critical components to be ranked with an appropriate level of importance, and also for a monitoring approach that can collate data from across existing enterprise monitoring systems.
3. Optimise Usage By Alerting The Right People
Monitoring alerts should always be tuned to go to individuals in skills groups, for instance network issues to network teams, server issues to server teams. Adopting a Network Operation Centre (NOC) mentality helps, with responsible NOC agents tracking down fault owners to confirm issues are being addressed before standing down.
4. Automate Wherever Possible
With the application of skilled expert support and in-depth technology remediations, issues that are prone to reoccurrence can be successfully automated. For example, the restart of a software component on a critical finance server can be automated on detection of a process crash. In this example, the automation could become even more pre-emptive by parsing component log files for error conditions typically generated prior to a crash. This way components can be scheduled for a restart out-of-hours, preventing the crash event from happening at all.
5. Ensure Monitoring Parameters Are Adaptable
Utilisation patterns inevitably change throughout the lifecycle of any given infrastructure component, either with increased usage or the addition of further access points. Such changes require a review of threshold and alerting parameters for monitoring to remain fit for purpose. Just because a component is working well now, doesn’t mean that it will continue to in today’s evolving infrastructure monitoring environment.