Blog

Top 8 Network Monitoring Best Practices

Beginner’s Corner

Slika

UMBOSS Team

May. 7, 2024
7 min. read
Slika

Network monitoring can feel like navigating through a maze of technical details, giving engineers a bunch of methods and practices to dig deep into how well a network is doing. In this blog post, we're going on a journey to figure out which network monitoring practices are the best for keeping an eye on network health and performance.

What is Network Monitoring?

Discussing network monitoring best practices obviously requires a proper primer on the topic. For this purpose, we've prepared a dedicated blog post on what is networking monitoring to serve as an introduction to the topic.

If you're still hungry for more information, we also recommend checking out our ebook titled Introduction to Network and IT Monitoring for Rookies.

Network Monitoring Best Practices:

Discovery of all resources needed to provide services

Before any monitoring can start, one must understand what there is within the network. This is essential because in the case of active polling you must know who to poll. Even in the case of passive monitoring when devices are sending events to the monitoring system, you need to understand how to interpret, and map received data.

Theoretically, a resource inventory or CMDB is where you should find all the data about devices being monitored. Unfortunately, the reality of life is that the data is often inaccurate, and you simply can’t rely on it. Therefore, an automatic discovery process must be employed to scan the entire network, detect all connected devices, learn everything there is to know about them, and reconcile the discovered data with the inventory/CMDB.

A discovery process that merely scans IP addresses in ranges and detects which services are up or down is not sufficient. A proper discovery process usually employs various SNMP, CLI, and other methods to discover a device’s chassis data, modules, interfaces, CPUs, memory banks, and other components that can be sources of events and performance data.

In addition to physical resources, the discovery process must identify virtual and logical resources. The most crucial part of discovery is topology discovery. This is vital because troubleshooting, topology-based alarm and performance analysis, root-cause analysis, and other aspects of monitoring rely on it.

Discovery is meaningless without reconciliation with the network inventory or CMDB. This is because a well-designed monitoring system will use the inventory/CMDB to enrich the monitoring data, thus placing alarms and performance data into the context of the network layout.

Slika

Implementation of an umbrella approach

Many organizations employ types of network monitoring specialized for specific parts of the network. For instance, the core network may use one monitoring system, while the monitoring of transport systems like DWDM may rely on a specialized element manager. Such situations often stem from siloed organizational structures and can result in poor performance during troubleshooting when network issues arise. This scenario is well illustrated in our use on how to correlate IaaS VMware and network alarms to fix issues fast.

To overcome such situations and streamline troubleshooting, one should have the ability to centralize all alarms and perform correlation and root cause analysis to understand the true cause of network issues. Therefore, a consolidation of all event data is necessary. The consolidation capability must be implemented in a manner that ensures scalability, supporting any future network growth.

Furthermore, issues may be caused by a combination of performance degradation and network faults. To comprehend the situation, engineers must be provided with alarm data, performance data, and network inventory data combined. Some of this data comes from various network monitoring protocols. Hence, a form of unified monitoring is essential to address the issue effectively.

Ultimately, engineers need access to all technical and administrative data in a single, comprehensive view to visualize the situation and make informed decisions regarding troubleshooting steps.

These three features together constitute what is known as an umbrella monitoring system, or more precisely, an umbrella network assurance system.

All-encompassing alarm management

Alarms are generated based on collected events from the network, either through SNMP, Syslog, or other methods. Collecting events can be done either passively, meaning waiting for devices to report an event, or actively polling devices. See our blog post on different types of network monitoring. Finding the right balance between the two is crucial.

Having only active monitoring will unnecessarily burden your devices with repetitive polling, requiring some CPU usage, but will result in the most precise information about the health of your network. On the other hand, passive monitoring will not create any extra load on your network, but in case there is something very wrong with your device, you will not know it.

Therefore, you must understand your business needs and map them to the right (optimal) data acquisition strategy. Once all the alarms are known, one must execute correlations and root cause analysis (RCA) so the monitoring system can precisely identify the problem when alarms are coming from everywhere. But before the monitoring system can execute correlations and RCA, one must enrich the alarms with data from network inventory/CMDB and other administrative data, as only this way can one include geo-spatial, administrative, and other dimensions into the alarm handling. Enriched alarms are also the only proper way to provide your engineers with the right context of the alarm.

Alarm suppression, alarm clearing, and the ability to define maintenance windows are also good approaches to alarm management that significantly reduce alarm noise during daily operations. The monitoring system should be configured to notify you of any critical alarms when you are out of the office, especially when you don’t have a 24/7 NOC in place.

Yet another good practice is the ability to maintain an alarm journal and execute automatic actions on certain alarms, for example, opening an external ticket in specific situations. Finally, historical alarms are important for forensic purposes, manual or automatic. Therefore, one must pay attention to the alarm retention period as this is related to the requirements of your business.

Slika

All-encompassing performance management

When facing the challenge of performance monitoring on a large network, one immediate issue arises: what metrics should be collected and what sampling rate should be used? This million-dollar question lacks a recipe providing a straightforward answer.

Another one of network monitoring best practices is that you should begin by determining the right balance between metrics that will be sampled more frequently and those sampled less frequently. A high frequency of polling may overload the device being monitored. However, this may be necessary when observing performance degradations that operate on very short time scales, such as in broadcasting. Conversely, other metrics, like the fuel level of a power engine, can be sampled much less frequently as these values do not change much over time.

The shorter the sampling rate, the more storage space is required. Therefore, one must consider business needs and determine a non-aggregated retention period, then decide on the length of hourly and daily aggregations. These parameters will significantly impact the required storage capacity.

The third variable is the list of metrics to be monitored. Devices provide various metrics, and selecting those important to your business is crucial for the efficiency of your operations, as well as for the storage space and computing resources required for your performance collectors.

Your monitoring system must support persistent KPIs, which are indicators calculated based on the values of collected metric values. The number of KPIs defined also influences capacity requirements, and one should avoid overengineering the number of KPIs. Aggregate KPIs, such as average network utilization, are likely the best option for measuring the overall performance of your network.

Once all performance data and KPIs are gathered, establishing baseline network behavior is essential. This can be a tricky task. Many systems tend to learn the behavior of the network over time and define this as the baseline behavior. However, this may not necessarily be the normal or preferred behavior. Consulting network architects to determine the targeted baseline is crucial. Then, monitor and engineer traffic or even expand network capacity to meet business objectives.

Finally, with baselining defined, the monitoring system must support threshold violation alarming, anomaly detection, trending analysis, and forecasting, which is key to proper capacity planning.

Set essential automations

Monitoring is the foundation for what engineers who care about their network must actually do – ensure it works fine. Therefore, assurance must rely on diagnostic and remediation tools as well as automated remediations. A proper monitoring system should provide tools for basic and more complex tests when a fault is detected. Basic tools usually include ping, traceroute, SSH, and HTTPS terminals, along with other capabilities that allow engineers to quickly connect to faulty devices and execute informed actions to detect the right issue and fix the problem fast.

More advanced tools typically include diagnostic scripts that may involve measuring attenuation or reflectometric measurements on optical interfaces, or even more complex analyses involving external measurement probes and devices. Even though all this directly impacts resolution time, it may still be overwhelming for engineers when many things are happening in the network. In such situations, one mechanism is particularly important – automations.

With automations, your monitoring system can trigger external actions for well-known problems. Pareto’s distribution plays a key role again, and automating 20% of well-known problems usually reduces the load by up to 80%, which is a no-brainer fact prompting the introduction of automatic remediation in your network operations center (NOC).

Slika

Implement proper reporting

Reporting is one of the pain points of every network operations, especially for NOCs, which spend too much time generating reports instead of troubleshooting. Therefore, proper reporting is a network monitoring best practice. It must be put in place to ensure that periodic operational or management reports can be generated on demand or even scheduled.

Now, it is important to understand that simply designing a report, scheduling it, and forgetting about it will create problems. It is easy to become absorbed in your daily operations and overlook the big picture.

Therefore, you must receive and analyze reports yourselves and react to any deviation from normal behavior or performance of the network you are responsible for.

Build a unified view over the whole infrastructure

The essence of monitoring lies in gaining insight. Simply scanning through a sea of alarms won't lead you anywhere. You need data visualization that offers a 360° unified view of your network, its current overall performance and health, as well as your customers and the services you provide to them.

To achieve this, the monitoring system should feature configurable dashboards, comprising various gauges, tables, maps, and other widgets that truly provide insight into what's happening with your network globally or with specific parts of it.

Consider these examples: A heatmap of devices or data rooms (POPs) overlaid on a map of the state or city where your network resides can provide an instant geographically-oriented indication if something is amiss. Schematic and geographic network topology representations, enriched with live network data, allow you to immediately correlate performance degradation and alarms to different layers of the network, thus clearly illustrating the situation on a more nuanced level.

Live free-hand topologies are also crucial, as they transform technical diagrams into dynamic representations, aiding in better understanding of the situation, especially when something is seriously wrong with the network.

Monitoring architecture and high availability

The monitoring system is supposed to inform you about how the network is performing, whether you're facing localized or global issues. Therefore, a proper design of the monitoring infrastructure is crucial and is one of network monitoring best practices. You don’t want to overengineer it, as you won't have enough budget for important necessities, but you still want the monitoring function to be resilient when things go wrong. Once again, the decision should be based on the business needs.

One demanding business environment is a company with critical communication infrastructure, such as railways, air traffic control, highways, healthcare systems, airports, police, and military installations. These industries require monitoring to be bulletproof. The architecture is focused on two important aspects:

  1. The right choice of monitoring types
  2. High availability

You can learn more about different types of network monitoring in our blog post.

One approach to consider is isolating a separate monitoring network (monitoring VRF, subnet) that can provide you with the necessary monitoring data during a crisis. Other more advanced approaches involve adding passive taps and probes to the network.

In any case, the required availability, reliability, and risk analysis should guide the decision-making process.

High availability of monitoring can be considered in the context of geo-redundancy and infrastructure redundancy. In many cases, a simple virtual or cloud environment may provide sufficient resilience in the event of infrastructure faults. However, for critical communication environments, a geo-redundant high-availability approach should be considered. In this scenario, the monitoring architecture should offer resilience in case of a complete data center outage, outage of a large portion of the network, technical failures, and natural disasters.

Slika

UMBOSS's implementation of network monitoring best practices

In previous sections, we have discussed many monitoring functions and concepts that must be combined to build the right monitoring solution for your specific situation. The number of combinations is limitless, but key concepts are mandatory and must be included if you require the right monitoring solution. Make sure to read more about the benefits of network monitoring.

UMBOSS is an umbrella network and service assurance platform that implements the best practices mentioned in this blog. Its discovery and reconciliation, combined with resource inventory, guarantee that everything that needs to be monitored is discovered and documented. UMBOSS’s consolidation layer allows for putting all sources of events and performance data in one place, and its alarm management and performance management assume all data sources are equal.

This enables all the advanced umbrella-specific functions, such as cross-domain correlations, root cause analysis, enrichment of monitoring data with technical and administrative data, autonomous automation of remediation and other processes, a unified single pane of glass view of network infrastructure overlaid with monitored data, flexible reporting, and a flexible architecture that can be adapted to all levels of complexity of the underlying network and ICT infrastructure.

Have any questions? Want to learn more? Get in touch and let us know how we can help. Send us a message or book a demo today.

Interested in discovering more?

You can read all you want about UMBOSS, but the best way is to experience it through a demo.

Slika