In this blog, we expand our discussion from our previous post on network fault management and explore its companion function, network performance management. While fault management deals with detecting and isolating network problems, performance management focuses on gaining insight into how well the network operates and identifying potential issues with network performance.
The Importance of Network Performance Management (NPM)
Performance management provides crucial insights into the health of your network and offers opportunities for improvement. It encompasses all activities required to maintain your network’s performance as close to optimum as possible, ensuring customer/user satisfaction.
Primarily, it involves systematically gathering data about the quality of various network aspects to understand current performance. This data is then used to detect performance degradations, resolve detected and pending performance issues, analyze key insights, and establish processes to optimize performance, prevent future degradations, and execute network capacity planning.
The primary role of performance management is to prevent network degradations and faults. Therefore, it serves a preventive or protective role within overall network management, aiming to ensure high availability and reliability for the network and its services.
Key Processes in Performance Management
Since performance management involves providing actionable end-to-end insights into network performance, it is crucial to understand that it encompasses the execution of the following key processes, each with well-defined goals:
- Detection of performance degradation that may impact user/customer satisfaction leads to quick resolution of the underlying problem.
- Daily operational performance tuning results in the adjusting network configuration to accommodate changes in traffic layout.
- Performance forecasting and reporting processes predict future network or data center performance degradation based on trend analysis.
- Capacity planning utilizes historical data analysis and forecasting to effectively plan for future capacity expansion. On a related note, read our blog on data center capacity planning.
Insight into performance data
All these processes require the constant collection and storage of performance data, their processing, analysis, and operational alarming and reporting. The essential thing that any monitoring system must provide is quick access to the performance data of all metrics on all devices in the form of graphs. This is usually accompanied by the ability to define the time period that needs to be visualized, get a condensed history view per day, week, or month, combine different metrics in one graph or one screen, and so on.
Example performance graph dashboard
When we talk about statistical parameters, performance management generally collects gauge and counter performance metrics. A gauge represents a single numerical value that can go up or down arbitrarily or remain constant over time, making them ideal for metrics that fluctuate, such as current memory usage, CPU utilization, or the number of active connections on a web server or database. For example, monitoring the CPU load on a server is a typical use case for a gauge, as this number can increase when CPU is utilized more or decrease when the CPU is utilized less.
On the other hand, a counter is a cumulative metric that only increases in value or resets to zero; it never decreases. They are used for metrics that continuously grow, such as the total number of requests processed, or total errors encountered. For instance, counting the total number of HTTP requests received by a server is a typical use case for a counter, as this count increases with each new request and resets only when the server restarts.
However, there are some exceptions to this. For example, when monitoring network interfaces using the SNMP (Simple Network Management Protocol), counters are typically used to track metrics like the number of bytes or packets sent and received. These counters continuously increase and provide a cumulative total, which is essential for calculating rates and detecting trends over time. Even though interface traffic can fluctuate up or down, the use of counters in SNMP helps in accurately measuring the total volume of traffic. By periodically polling these counters, the performance management system can determine the rate of traffic by calculating the difference between successive readings and dividing it by the time passed between these readings.
The key differences between these two types of metrics lie in their behavior and reset handling. Counters only increase or reset to zero, while gauges can increase, decrease, or stay constant. When a counter resets (e.g., during a service restart), monitoring systems must detect and handle this situation.
Performance data exists in certain contexts, so every good performance management system will enrich performance data with other technical (e.g., inventory) and non-technical data (e.g., customer CRM or even billing). This is done to provide engineers with the context of a performance metric when something is wrong. Some examples of additional context data are device data, device location, responsible person, customer, and services related to the performance metric.
Statistical parameters of a network’s performance (e.g., average utilization, main traffic hour, percentage of customers connected to the network) are usually presented graphically to make the analysis easier. This all helps engineers figure out what’s wrong or what parameter values indicate future problems.
Network Performance Management KPIs
Key performance indicators (KPIs) in network performance management are specific, measurable metrics used to evaluate the effectiveness and efficiency of a network. These indicators help network administrators monitor and assess various aspects of network performance, such as bandwidth utilization, latency, packet loss, etc. By tracking KPIs, organizations can ensure their network is operating optimally, identify potential issues before they become critical, and make informed decisions to enhance network reliability and performance.
KPIs can be obtained directly from the network (e.g., CPU load, number of connections to a server, etc.) or can be calculated using multiple metrics and mathematical formulas (e.g., signal-to-noise ratio - SNR) to provide a comprehensive view of network performance.
If we go back to the example of interface speed, its value is calculated within a time window as the number of octets (bytes) transmitted or received by the interface multiplied by 8 to get bits and then divided by the duration of the time window. This is usually expressed in bit/s or Mb/s, or Gb/s.
However, very often, engineers want to understand the utilization of the available capacity of the interface. The way to do this is by creating another metric (variable), called utilization, which is expressed as the percentage of the total capacity of the interface used. Written out as a formula it looks like this:
Utilization = Speed/Capacity x 100%
This new variable is an example of a key performance indicator (KPI) of the interface. Of course, in this case utilization is a number between 0% and 100%. Any value above 90%, for example, may indicate some level of service degradation.
There are many examples of KPIs, and monitoring systems allow for the creation of new KPIs when needed. For illustration, one KPI can be the total electrical current being consumed by a rack in a data center. This is obtained by summing several metrics of several devices providing such performance data. Also, the sum of total internet traffic of a telecom is another example of a KPI, and a very important one as it is related to the total cost of internet upstream. It is obtained by summing speeds of all interfaces through which the telecom is exchanging internet traffic with other telecoms.
KPIs are the cornerstone of proper management in mobile networks. Examples of basic KPIs in LTE networks are RSRP (Reference Signal Received Power), RSRQ (Reference Signal Received Quality), SINR (Signal to Interference & Noise Ratio) and many others.
Detecting performance degradations
One important process in performance management is detecting performance degradations and anomalies. This is achieved by using fixed or dynamic thresholds. When a performance metric value becomes higher or lower than a threshold level, the performance management system will trigger a threshold violation alarm.
Thresholds can be set as fixed values (say 70% of link utilization), with different threshold values for different periods of time, or a dynamic method can be used. In the latter case, one can define a threshold as any deviation higher than a percentage or fixed value of the baseline. In this context, the baseline is calculated using statistical methods over historical performance metrics or KPI values to establish the “normal” behavior. Performance management systems can also detect any rapid change in the metric’s or KPI’s value to identify sudden changes in the metric’s behavior. As you can see, it is described by many parameters such as severity, source of alarm (agent), alarm raised (creation) time, last update time, description, etc.
Automated actions
When a certain degradation is detected, the performance management system must trigger an alarm. This ensures that engineers understand there is a problem in the network. Since there is an alarm in place, similar tactics can be used as in the case of synthetic alarms and root-cause analysis — utilizing automated actions to automatically apply remedies for well-known performance degradations.
For instance, detecting a sudden temperature increase within a data center is a critical event. The system can automatically initiate a telephone call to engineers to prompt immediate action. Alternatively, the problem can be addressed by triggering the HVAC system to perform urgent cooling activities.
Forecasting and capacity planning
We’ve emphasized that performance management encompasses all the activities necessary to maintain the performance of your digital infrastructure as close to optimum as possible, with the goal of ensuring customer/user satisfaction.
This definition primarily includes daily or short-term operational routines that may involve the replacement of faulty devices, tuning of air conditioning, tilting antennas, fixing optical cable attenuation, and many other activities typically coordinated by network, data center, or IT operations teams. However, all these activities may not provide significant help if the capacity of the digital infrastructure is insufficient or not being utilized optimally.
That’s why telecoms and IT organizations typically have strategic planning units or task forces responsible for optimizing, expanding, and modernizing their digital infrastructure to streamline daily operations and enable the optimal functioning of the services being provided. To accomplish all this, two long-term impactful activities are necessary: capacity management and capacity planning. Furthermore, it’s crucial to consider both business decisions and the requirements of existing services (standard operations).
Business decisions directly impact the capacity of the underlying digital infrastructure. This can be illustrated with the following example: suppose a CSP (Cloud Service Provider) is planning to launch a new cloud file storage service. In this scenario, the capacity of the data storage systems must be upgraded to accommodate the data of customers. The capacity planning process answers the question, “How much data storage is required?”
The future needs of services that are already operational also impact the capacity of the underlying digital infrastructure. For example, standard internet access service necessitates a continual increase in the utilization of network links, particularly internet uplinks, due to the growing demand for bandwidth caused by new video streaming services and the constant influx of new customers. This is just one of many factors influencing constant capacity upgrades.
Therefore, constant attention to capacity is necessary, and this focus is known as capacity management. It is a continuous process that involves the following key activities:
- Performance monitoring and recording the performance of the digital infrastructure.
- Analyzing historical performance records and detecting performance issues.
- Optimizing the utilization of existing capacities as a result of performance analysis.
- Optimizing the utilization of existing capacities to accommodate business decisions that impact capacity requirements.
- Planning capacity expansion to meet business-driven requirements, such as the introduction of new services.
- Planning capacity expansion for ongoing services based on forecasts of the expected growth in capacity needs.
- Implementing changes in the digital infrastructure to align with the plans.
As one can observe, capacity management encompasses a wide range of activities, and this post cannot cover them all in detail. However, do check out our capacity planning blog post.
The figure below illustrates a very simplified representation of capacity management. It is an iterative process in which capacity planning plays a vital role. Any changes made to the digital infrastructure as a result of capacity planning need to be re-evaluated, and appropriate re-planning should be carried out as necessary.
Generalized capacity planning process
Operational performance tuning and reporting
A network requires daily operational performance tuning, especially those with dynamic customer behavior. Network engineers make tuning decisions based on performance data. To facilitate easier decision-making, performance management must implement flexible reporting processes that provide engineers with relevant information. To illustrate some of the activities and corresponding reports, let’s look at a few typical examples:
- Top N interface utilization: Shows which interfaces have the highest utilization so engineers can redistribute the traffic load and balance network traffic properly.
- SIP trunk statistics: Provides insight into SIP trunk utilization of the upstream telco voice trunk.
- Top N temperature per data room: Provides insight into data rack and data room temperature to facilitate decision-making on the redistribution of temperature/power load.
- Number of users connected per Wi-Fi access point: Helps redistribute the load across the available APs.
- Latency Report: Measures the time it takes for data to travel from one point to another in the network, which is crucial for assessing network performance and user experience.
- Packet Loss Report: Tracks the percentage of packets that are lost during transmission, which can indicate network issues such as congestion or faulty hardware.
- Error Rate Report: Monitors the rate of errors occurring in the network, such as CRC errors or collisions, which can help in diagnosing and resolving network problems.
- Traffic Analysis Report: Breaks down network traffic by type, source, and destination, offering insights into how network resources are being used.
- Top Talkers Report: Identifies the devices or users generating the most traffic, which can help in managing bandwidth and prioritizing critical applications.
- Application Performance Report: Assesses the performance of specific applications running on the network, helping to ensure they meet performance expectations.
- Security Incident Report: Logs and analyzes security events, such as unauthorized access attempts or malware detections, to help maintain network security.
- Capacity Planning Report: Provides insights into current resource usage and forecasts future needs, aiding in network planning and upgrades.
The list above is not complete but a mere sampling, as there are many other reports that are used and utilized.
Continuous improvement of processes and procedures
Performance management directly impacts how engineers care for their network. As network managers gain insights into network health, they can easily pinpoint drawbacks in existing processes and procedures related to network management. This provides an impulse to continuously revise these processes and introduce necessary steps to further improve them, avoiding future management pitfalls.
Continuous reporting of all performance data and process analytics is crucial for providing a solid foundation to execute any viable improvements in these processes and procedures.
The Connection Between Network Performance Management, Network Monitoring and Fault Management
Network performance management is a key component of network monitoring, alongside fault management. Often, engineers focus solely on fault management, which may suffice for some systems where the network is well-designed and well-dimensioned for a stable traffic load. However, network monitoring without performance management is a significantly constrained approach. Detecting pending performance degradation and reacting accordingly is crucial for ensuring the network will provide services as expected in the long term.
Let’s look at a small example illustrating how performance and fault management are both necessary to keep the network operating smoothly. Imagine a large corporation with a 1 Gb/s internet uplink shared among all its employees. If employees suddenly start using a new cloud application that consumes a lot of bandwidth, the uplink becomes overutilized, and they start experiencing issues with the new cloud service and general internet access.
Theoretically, there is no fault in the network since all components are functioning correctly. However, a properly configured fault management system will detect a severe packet drop rate on the internet uplink and trigger an alarm indicating a fault. This is where performance management kicks in. Performance management continually monitors the internet uplink utilization, allowing engineers to see that the total speed of the link is approaching its capacity, thereby understanding the cause of the problem.
Furthermore, a well-configured performance management system will have thresholds set for, say, 70% uplink utilization generating a warning alarm, 80% generating a major alarm, and 90% generating a critical alarm. This alerts network engineers to the new traffic situation much earlier than the customers start experiencing problems, giving them time to act, such as redistributing the traffic load on other uplinks.
This clearly explains the relationship between fault and performance management. One means little without the other, and network monitoring and management are only complete with both components deployed.
Network Performance Management (NPM) Best Practices
The best way to manage your network's performance may depend on the architecture, type, and use of your network. However, some common recommended practices should be in place in any situation:
- Real-time Consolidated Monitoring: The best way to ensure proper performance management is to monitor (collect data) from all parts and systems of the network in real time. This includes network devices themselves, network elements managed by Element Management Systems (EMSs), data provided by active and passive probes in the network, power, HVAC, environmental sensors, etc. Combine all available options for collecting performance data (any protocol, any device).
- Define Relevant Performance Metrics and KPIs: You cannot collect all performance metrics from the network as you will need super-large storage space and too much computing power to process all the metrics and KPIs. Limit the metrics to what is relevant for your specific situation.
- Baselining: Define the standard behavior for relevant metrics and KPIs – the process of baselining. This can be done by defining technical limits to certain parameters of network elements or using machine learning to establish baseline behavior that may be considered “normal” behavior.
- Set Thresholds and Alerts: Based on baseline behavior, configure fixed or dynamic thresholds that the performance management system will use to trigger alarms when a metric or KPI crosses upper or lower threshold values. Good practice is to define warning, minor, major, and critical threshold values so one can get progressive alerts that an outage may happen.
- Enrich Performance Data: Utilize network inventory and other technical and administrative data to provide context to performance data. This makes problem localization and isolation from performance data much easier and faster.
- Establish Performance Reports: Schedule regular performance reports such as top N utilized interfaces, top N device memory utilization, etc. This will help you detect critical portions of the network and take action to prevent any future faults.
- Automate remediation actions: Performance management should have the possibility to initiate remediation
- Employ Capacity Planning Practices: Performance management is all about taking care of the network's future health. The essential process is capacity planning based on analysis of historical performance data by utilizing trend analysis, forecasting, and other methodologies.
- Integrate Network Performance Monitoring with Application Performance Monitoring: Combining the two provides clear correlations between application performance deterioration and network performance degradation, allowing you to pinpoint the root cause.
- Implement and Constantly Update Processes and Procedures: Knowing what actions to take when network performance is jeopardized is the best way to manage degradation quickly and efficiently.
By implementing these and other practices specific to your situation, you can ensure that your network operates optimally and is prepared for future demands.
Common Challenges in Network Performance Management
The best practices described in the previous section are in place to solve many of the challenges that exist when performance management is being introduced. However, some challenges still remain, and they are related to various factors that may have little or no commonality across different network monitoring deployments. The following are a few common issues one faces when introducing performance management.
One of the most frustrating elements is establishing the connectivity between the performance management system and the network elements that must be monitored. This is always an issue due to the unfortunate combination of the need to establish performance management quickly and the internal IT and security procedures that must be followed and executed by network engineers.
The scalability of the performance management system, or in other words, the ability to collect and process the enormous amount of data being collected, is yet another challenge. This relates to the technology implemented in your performance management systems. Today’s systems utilize multiple time-series processing techniques to handle performance data regardless of the size of the network.
Integrations between the monitoring system and the various systems that must be monitored are another technological challenge that must be resolved. For instance, many systems use proprietary protocols, which require significant integration expertise to fetch performance data. There are still legacy standard protocols, like CORBA, which may represent challenges when integration is necessary. Very often, one must combine multiple network monitoring protocols to obtain performance data from a single system. Unfortunately, there is no common solution to this problem, and it all depends on the ingenuity of integration engineers.
Consolidated monitoring is always a challenge as one can never be sure if all network elements are being monitored. To address this, one must combine automatic network discovery solutions with the inventory from Element Management Systems. By analyzing the discovered inventory data, one can detect potential parts that are missing for any reason (e.g., uncovered or inaccessible network segments). Only when the network inventory is complete can one be sure there are no network performance visibility gaps and that everything is under control.
Future Trends in Network Performance Management
One can hardly call any trend in network performance management a future one because all new concepts become a reality very fast. Therefore, many of the trends below are a reality today – it is only a question of their expansion in the industry:
- Performance-driven fixed network and mobile RAN automation: Performance data is used to automate remediation and other operational activities through well-defined processes. These processes are triggered and steered using a combination of fault, performance, inventory data, and external data.
- Use of AI/ML: Machine learning algorithms are a key advantage of today’s modern operations (AI Ops). Baselining, forecasting, capacity planning, and other processes become much easier and more precise thanks to the utilization of new AI technologies, including generative AI, which shows promise in capacity planning and network optimization.
- Umbrella management: End-to-end visibility is becoming a de facto standard requirement for all telecom and enterprise IT organizations. Thus, data consolidation, unified management, and an umbrella 360-view become the focus of tomorrow’s network monitoring and management.
- Integration with security systems: Network monitoring, management, and security management are becoming more integrated. Boosting security with network health data is an essential ingredient of tomorrow's management functions.
- Adopting new technologies: Performance management always tends to adopt new network technologies. Among others, there is a strong trend of managing the performance of today’s leading-edge technologies such as 5G, IoT, edge computing, and others. All these technologies have specific performance metrics and must be combined to provide an overview of the health of the entire system.
- Migration to cloud: There is a constant trend towards migrating the performance management function to the cloud. The engineering community still has many arguments against this direction due to issues related to connectivity breakdowns and other effects that may render NPM dysfunctional. However, the cloud is just other people's hardware and software. If actual connectivity redundancy, delay, jitter, and other elements can be provided at the same level of reliability as on-premises installations, then the cloud option is definitely viable.
- Integration with public clouds: Monitoring network performance in hybrid and dynamic environments is crucial for many organizations. These environments often span across public and private clouds, as well as private data centers, creating a complex web of interconnected elements. The dynamic nature of these environments, where instances of monitored elements can be dynamically provisioned and deprovisioned based on utilization, adds another layer of complexity. For instance, integrating with tools like Amazon CloudWatch and Azure Monitor is essential. Amazon CloudWatch allows for real-time monitoring of AWS resources and applications, providing metrics and logs that help in identifying performance bottlenecks and ensuring the health of the infrastructure. Similarly, Azure Monitor offers comprehensive monitoring for Azure services, enabling proactive management of applications and infrastructure. These integrations facilitate a unified view of the entire environment, making it easier to manage and optimize performance across different platforms.
UMBOSS & Performance Management
UMBOSS is an umbrella network management product that includes its own performance management system as one of its key elements. The UMBOSS Performance Management module collects and stores performance data and calculates KPIs from all underlying network elements, element management systems (EMSs), and other network management systems (NMSs). This makes UMBOSS a performance management consolidation platform. Additionally, UMBOSS enriches all alarms and performance data with non-technical data, providing a 360-degree view of network health. It is the perfect solution for implementing an umbrella approach in fault and performance management.
Have any questions? Want to learn more? Get in touch and let us know how we can help. Send us a message or book a demo today.