Blog

Data Center Monitoring: What You Need To Know

Beginner’s Corner

Slika

UMBOSS Team

Apr. 23, 2024
5 min. read
Slika

In our previous blog on Data Center Infrastructure Management (DCIM) we touched upon DCIM as a subset of Data Center Management activities and processes as well as it being a term for software used to support them. In this blog we discuss a specific activity that also supports Data Center Management – Data Center Monitoring.

This important activity, as well as the data center monitoring software and hardware that enables it, is key to gaining insight into the health and status of a data center as it continually tracks specific metrics. A data center monitoring system alerts engineers to any existing or pending faults that may influence data center operations.

What is Data Center Monitoring?

Data center monitoring systems follow the same principles as the monitoring of any other technical system. They involve the continuous and systematic collection of events and performance data from technical systems, their processing, analysis, and presentation to engineers. Data center engineers must consistently observe system health and promptly react to any existing or impending degradations or faults.

Data center monitoring is specific as it involves the monitoring of ICT and data center supporting systems as well as correlations among these. As already mentioned in our blog on DCIM there are many systems to care about when a data center is concerned, and the following are data center-specific systems:

  • Power sources: UPS (Batteries, fuses/breakers, transfer switches), power generators (voltage, fuel level, temperature, oil level, etc.), utility grid
  • Power distribution (rack and floor PDUs, remote power panels - RPP, rack automatic transfer switches – ATS, busways)
  • HVAC (heating, ventilation, and air conditioning, CRAC units, etc.)
  • Environmental sensors (temperature, humidity, airflow)
  • Physical security (smoke/fire detectors, doors, door locks, access control units, cameras, etc.)

Of course, ICT (communication and computing) systems are also the focus of data center monitoring (network, servers, storages, storage area network, tape libraries, virtualization platforms, container platforms, as well as application-level monitoring). Proper data center monitoring solutions will provide an all-encompassing view of all the systems mentioned before.

Key Aspects of Data Center Monitoring

Power grid and transformers

The main source of power in a data center is the utility grid, that is usually high voltage (depending on the utility operator, voltages between 10 kV and 35 kV). The status of the utility grid is usually provided by specialized sensors that provide real-time data like voltage, power for phases and equipment temperature. The data is used to detect power grid outage, voltage fluctuations, phase power balance and other data important for proper power monitoring.

Next, the high-voltage power is transformed into transformers that provide low-voltage (0,4 kV) power to be used by data center equipment. Many large data center facilities monitor transformer-specific structural parameters such as temperature, oil level and quality, humidity, vibration, gas (DGA – dissolved gas analysis), winding temperature, tap changer status as well as service parameters such as load, utilization, and power factor. All these parameters are crucial to guarantee there will be no major malfunction of the transformer as the replacement of these units take many months and generate huge costs. The data is usually provided by specialized systems implemented within the transformer or sensors added on top of standard transformer equipment.

Slika

Backup power generators, UPSs, batteries, transfer switches

Backup power generators are an essential component of every data center when utility grid power is down. They must provide enough power to run the entire data center until the utility grid is operational again, and ensuring they start when needed is crucial for data center uptime. Therefore, the monitoring of power generators is also essential. The parameters that must be constantly monitored include fuel level and quality, oil level and quality, engine temperature, coolant level and its temperature (both in stand-by and active states) and start battery voltage and capacity.

A good practice is to conduct test drives of the power engine, usually weekly or monthly, during which one can check if all systems work properly and if the engine starts reliably. During this test drive, as well as during the power engine’s actual operation, one must additionally monitor engine load and utilization, output voltage and current, engine speed, frequency of generated voltage/current, vibration level, exhaust gas temperature, and gas emissions for environmental compliance checking.

In some data centers, the monitoring system is employed to detect instances where the primary backup power generator fails to start on time. Engineers are promptly alerted about the situation and can initiate the start-up of secondary backup power generators either remotely or manually.

Uninterruptible Power Supply (UPS) plays two important roles: it provides a buffer between utility power outage and backup power generator ignition (up to several minutes) and stabilizes and conditions the power output eliminating any surges, spikes, and other anomalies. Its main component is a set of batteries that are used to store electrical energy during normal operation and provide power during power outage.

Monitoring UPS is therefore crucial to ensure no power disruptions and the main parameters being monitored include battery status (normal, low, depleted), remaining battery runtime, time on battery, battery charge, humidity, temperature, battery current, total UPS load, input line voltage, and output voltage. The monitoring ensures that engineers understand which batteries are to be replaced and when, and what upgrades are required to ensure normal DC operations.

Transfer switches are devices used to shift the electric power load from one power source to another without interruption. Automatic Transfer Switch (ATS) automatically senses the primary source failure and automatically switches to secondary. They are part of UPSs but can also be used within data center racks to ensure redundant supply for devices having a single power unit. Such switches can be effectively monitored and used to provide engineers with necessary data during any power interruption in a data center.

Power distribution

Power Distribution Unit (PDU) is an essential component of a data center’s power distribution system. It is generally a device with multiple outlets used to plug in end devices. PDUs come in different shapes and forms. In terms of placement, these can be rack PDUs or floor PDUs used in smaller deployments. Large data centers deploy cabinet PDUs that are placed in separate cabinets in the data centers, close to multiple racks they deliver power to. Another component used in large data center facilities is Remote Power Panels (RPP), which are used to distribute power to a group of racks (e.g., the whole data room), and UPSs are typically connected to RPPs. Some data centers employ overhead Busway technology that acts as one of the power distribution elements.

For the optimum operation and management of the data center, all UPSs, RPPs, and busways should be remotely manageable. This means that all the components can send statuses and performance data to the central monitoring/management system. One can monitor phase load, power break/fuse status, current load, and receive efficiency notifications, etc. Measurement data is also important to control proper phase load balance.

In colocation data centers, it is essential for PDUs to provide power usage per outlet, as this is the foundation of power consumption charging to end customers. It is also used to calculate the availability of power to end customers and check against agreed SLA parameters.

The overall power distribution measurements are continually monitored to calculate overall power consumption and power distribution efficiency, as well as input data to calculate one of the main KPIs of the data center: PUE. PUE stands for power usage effectiveness and is defined as the ratio of the total power of the facility to the total equipment power.

HVAC/CRAC and environmental sensors

Air conditioning (HVAC) is responsible for maintaining the temperature, humidity, and air quality at the level required by IT equipment. It is a complex system that must be adapted to heat dissipation, density, and the thermal room design. This is achieved by properly deploying CRAC (Computer Room Air Conditioning) units. Regardless of the inherent redundancy of the system, constant monitoring of many parameters is necessary to guarantee its proper operation: the status of cooling, dehumidifier, heating, compressor power, pressure, and temperature in different sections of the systems, etc. Setting high/low values for pressure, temperature, fluid flow, and other parameters is crucial to detect any impending threats to proper conditioning in the data center.

However, the performance of HVAC/CRAC is also obtained through indirect measurements of environmental parameters within the data center.

One important aspect of air conditioning is ensuring that CRAC cold air reaches all sections of the data center as planned. For that purpose, the data center is equipped with sensors that measure airflow, temperature, and humidity. The sensors are (or should be) placed above and below the floor, and possibly in computer racks. These parameters and sensor availability must be continually monitored to ensure that cooling and humidity control work as planned and that the ability to control the parameters is present.

Slika

Safety and Physical Security

Safety in the data center is managed by the use of measurements from sensors related to smoke/fire detection and flooding. The monitoring can easily detect smoke and correlate it with the status of the fire suppression system. If the fire suppression remains inactive in the case of fire, engineers can take actions to manually trigger extinguishers or take other necessary measures.

A flooding sensor is in place to detect water in case of catastrophic events but more frequently to detect water due to condensation or coolant outflow from HVAC. In such cases, data center monitoring can help to take proper actions.

Physical security of the data center is implemented through the use of camera surveillance inside the facility and on the perimeter of the data center. Therefore, monitoring the availability and proper operation of all cameras is important. Physical access control is composed of a number of elements that must be monitored for proper operation as well as their actual status. This includes access card readers, biometric readers, keypads, pin pads, door controllers, door locks and electric strikes, turnstiles, motion detectors, and other elements.

The video control system is combined with access control to provide full control over access to data center facilities. The data center monitoring system can easily correlate the statuses of the elements to provide comprehensive information about potential threats. For instance, an unauthorized rack door unlocked alarm can be enriched with the link to a video feed that is focused on the rack to check the status.

Read more about data center environmental monitoring and how you can ensure that problems are avoided.

ICT systems monitoring

Now, besides the data center-specific monitoring aspects, the full potential of data center monitoring is achieved only by combining it with "classic" ICT system monitoring. This includes monitoring communication networks, servers (physical and virtual), storage area networks and storage systems, backup and archiving systems, operating systems, databases, and applications.

Yet another aspect that contributes to total security is IT security, usually implemented through advanced threat monitoring and detection systems, commonly used in Security Operation Centers (SOC).

How does Data Center Monitoring Work?

Data center monitoring follows the patterns of classic umbrella monitoring that must consolidate data from its diverse systems. Now, there is no out-of-the-box data center monitoring solution to collect all data, but there are typical data center monitoring systems and data exchange protocols that are used for data acquisition.

SCADA / Modbus

On of the traditional (legacy) systems used to remotely monitor and control industrial processes in general that is also used in data center facilities is SCADA (Supervisory Control and Data Acquisition). However, SCADA utilizes a number of communication protocols for control and data acquisition in industrial settings, and some of these include Modbus (TCP/RTU), DNP3, IEC, but also Ethernet/IP.

Therefore, to acquire data from systems managed by SCADA, one must connect using IP protocols, and this is usually a TCP/IP version of Modbus and even SNMP when available in the concrete implementation of SCADA.

Building Management System (BMS)

A BMS (Building Management System) is a control system used to monitor and manage various electrical and mechanical systems within a building or facility. Many data centers implement BMS to ensure enhanced operation, safety, and management of critical data center infrastructure systems. BMS is a system composed of a network of sensors and actuators. It manages HVAC, lighting, access control, energy, fire control, alarm systems, water management, and other systems. Therefore, collecting data from the BMS and correlating it with data available from ICT systems and other systems not managed by BMS is crucial to ensure proper data center management.

There are multiple ways in which monitoring data can be acquired. This includes communication protocols like BACnet (Building Automation and Control Networks), Modbus, OPC, but also through API integration, legacy Web Services, SNMP, or even by directly accessing databases or logs. It all depends on the manufacturer and the concrete implementation of the BMS.             

Critical Power Management System (CPMS)

Critical Power Management System (CPMS) is a modern system designed to monitor, control, and optimize electrical power infrastructure in critical facilities like hospitals, financial institutions and data centers. It is essentially based on real-time monitoring of power grid, transformers, UPSs, backup power generators, batteries, power distribution, integration with BMS and use of the data to detect faults, analyze the situation and automate responses to critical events.

Once implemented, CPMS becomes a valuable tool for data acquisition of power data as part of the overall data center monitoring.

DCIM overhauling

As mentioned in our blog post, DCIM includes the precise documentation and monitoring of all available data from the data center. Monitored data correlates with inventory data and allows for advanced operational and capacity management of data centers, as well as for data center operations. It encompasses data acquisition by any method mentioned before and using any protocol. It represents an umbrella approach to data center monitoring, combining infrastructural and ICT monitoring, and represents the most complete approach to data center monitoring and management.

Slika

Benefits of Data Center Monitoring

  • Data center monitoring is crucial for various aspects of its operation and multiple roles involved in management and operations.
  • All-encompassing monitoring enables the Network Operations Center (NOC) to execute preemptive corrective actions and promptly respond to any outage or safety/security threats.
  • Modern monitoring empowers engineers to react to any situation on-site or remotely.
  • Continuous performance monitoring of power, HVAC, and ICT provides the means for operational optimization and capacity planning, minimizing potential future degradations through preventive maintenance.
  • The combination of reactive and proactive activities aims to reduce downtime and minimize the chance of any service level objective (SLO) being violated.
  • Continuous calculations of key parameters, such as power usage effectiveness (PUE), allow management to observe how optimization efforts and investments lower its value as close to 1.0 as possible.
  • When advanced monitoring is employed, such as DCIM-based monitoring with enrichment, one can boost management efforts to further reduce costs, enhance security, reduce issue resolution time, and, in general, enable informed decision-making.
  • Furthermore, through environmental data monitoring and power and cooling optimization, one can significantly reduce greenhouse gas emissions and the data center's CO2 footprint, thus minimizing its environmental impact.

Challenges in Data Center Monitoring

In an ideal data center environment, everything can be monitored, and management is optimal. However, the real world is far from ideal, and here are some of the challenges our team encounters when implementing data center monitoring and data center monitoring tools.

Existing DCIM solution is closed to integration

There are many vendors of DCIM solutions, some of which are for documentation purposes only, while others also monitor portions of the data center. When monitoring an entire data center, data must be collected from these systems and correlated against other data. However, some systems are closed to integration, often provided by well-known vendors, presenting a challenge in terms of implementation. One approach is to either replace the existing DCIM, with customers purchasing integration licenses, or to implement parallel monitoring of the equipment already managed by the closed DCIM solution.

There is no DCIM software present at all

Non-existent DCIM software is a slightly better situation than having a closed one in place. It allows for direct component discovery and monitoring, opening up the opportunity for the customer to acquire a proper DCIM with a fully open integration API.

Legacy data center monitoring systems do not provide measurement data

In data center facilities built in the 80s or earlier, legacy systems (e.g., backup power generators, transfer switches, etc.) often lack the means to transmit data. There are still old technologies that operate with SMS messages, which is more manageable compared to analog systems. Therefore, in this scenario, data center management must purchase and install fs such as sensors to monitor critical parameters if proper data center management is required.

Not all data center monitoring systems provide data in real-time

On many occasions, one can encounter systems that provide data in CSV files and other formats, but the delay in generating these files is, for example, 15 minutes or more. If an alarm must be generated based on the data, then the alarm will be significantly delayed.

One way to tackle this problem is to attempt to access the system's database and perform some form of reverse engineering to tap into the data in a near-real-time fashion. However, this is only a makeshift data center monitoring solution, and the best approach is to simply replace such a system.

IT Security does not allow access to certain portions of a data center

One common issue that arises during the implementation of data center monitoring is security constraints that restrict access to protected resources. Given that security measures are often stringent, it may require an extended period of persuasion and negotiation to find an alternative data center monitoring solution to the problem. Therefore, the only concern is a delay in implementation.

Distinguishing between relevant and irrelevant data

Encountering many vendors with variations in the way data is presented often poses a challenge in distinguishing which parameters being acquired are important and which are not. This challenge is typically resolved by communicating with domain experts in data center organizations or vendors themselves and using some common sense.

Slika

Best Practices for Data Center Monitoring

Umbrella monitoring of the whole data center

Data center monitoring services entail the monitoring of everything. Therefore, data center monitoring solutions must become an all-encompassing umbrella solution that consolidates all available data and provides alarming, performance management, capacity management, cross-domain correlation, service assurance, and other functionalities in a unified way.

Enrich acquired data with DCIM inventory data

Technical data acquired from systems lack context, and the primary task of the monitoring system should be to add context. Context must be provided by DCIM documentation systems that must be accurate and up to date. For that purpose, one must establish regular discovery processes that will detect any undocumented elements and reconcile them with DCIM.

Define main KPIs on the dashboard

One should define the main KPIs and lower and upper threshold values that should be constantly monitored. These values represent the common goal of the entire staff. One such parameter is PUE, and its constant lowering can be a shared goal for everyone involved in data center management.

Integration with ITSM and other external systems

Having the monitoring system in place without integration with the external world is blasphemous. Therefore, the monitoring system must be integrated with essential systems like ITSM (ticketing) but also with other non-technical systems such as CRM, billing, and others if the full potential of monitoring is to be realized.

Implement automations

Monitoring is great, but one thing engineers need is for the monitoring to be upgraded to autonomous assurance. They want it to remediate common problems that require well-described actions for well-known issues. In our experience, addressing 20% of typical use cases saves 80% of engineers' time. Therefore, automating remediation activities for well-known issues that can be detected by, say, alarm correlations or root-cause analysis is the way to go.

Integration with IT Security monitoring

Security incidents may be a valuable source of information in the NOC. But also vice versa, as faulty conditions with systems may indicate a security breach in the SOC. Therefore, integration with SOC's systems is recommended and needed.

Define proper access control to monitored data

Allowing all NOC staff access to all the data is not a best practice, especially if untrained staff is allowed access to automation functions. Therefore, a proper data center monitoring system must implement access control that will allow NOC managers to run operations properly, minimizing the impact of unauthorized data access and automation activations.

NOC staff training

Finally, if the monitoring system is to be used to its full potential, proper NOC staff training is necessary. The data center monitoring system should be comprehensive and allow access to all the necessary data, but it is the engineers in the NOC who should know how to interpret the data and act accordingly. This can only be achieved through constant training and experience exchange among people working in the data center's NOC.

Data Center Monitoring Tools: UMBOSS example

There are many data center monitoring tools used today, mostly specialized versions that are part of larger DCIM solutions. Unfortunately, many of these data center monitoring solutions focus solely on power, HVAC and physical security monitoring, as we have pointed out in our previous discussion, and this is only one portion of overall data center monitoring.

When it comes to examples of complete data center monitoring tools, we can mention UMBOSS combined with a DCIM product by FNT Software called FNT Command. In this example, UMBOSS acts as an umbrella monitoring system for all components of a data center, including power distribution, air conditioning, physical security, networking, computing, etc., while FNT Command acts as a data center infrastructure (resource) management tool that documents all the resources of a data center while providing planning, forecasting, capacity management and other features. The two tools combined provide an end-to-end software solution and act as a complete data center monitoring tool.

Need help with data center monitoring?

To conclude, data center monitoring is the backbone of efficient data center management and is critical to support business operations. If you’re having a hard time finding a solution for your needs or have any questions, we’re happy to help.

Ask us your question today or schedule a demo to see UMBOSS in action.

Interested in discovering more?

You can read all you want about UMBOSS, but the best way is to experience it through a demo.

Slika