Blog

Data Center Infrastructure Management (DCIM) Explained

Beginner’s Corner

Slika

UMBOSS Team

Apr. 11, 2024
5 min. read
Slika

Managing the infrastructure of data centers with data center infrastructure management (DCIM) is important today because demand for digital services continues to grow. Balancing growth while providing excellent service for end customers is a hard task. DCIM helps data centers operate more efficiently, save costs, plan for the future, manage risks, comply with regulations, and contribute to sustainability goals, all in the context of their data center operations.

What is data center infrastructure management (DCIM)?

Data Center Infrastructure Management or DCIM is a subset of all data center management activities and processes, which focuses on data center monitoring and the management of the data center’s infrastructure.

Data center infrastructure is huge. It encompasses facilities comprising of one or more data rooms, often with cages, physical access control systems, fire-protection systems, security measures, and power supply infrastructure (electric) that includes power engines, UPSs, batteries, PDUs, etc. Additionally, there is a heating, ventilation, and air conditioning (HVAC) system, along with many environmental control sensors. Furthermore, it includes networking infrastructure and the physical data center equipment itself, which comprises servers (rack-mountable and blade), storage area networks (storage, switches, tape libraries), and hosted network devices. On top of this, there is virtual computing and networking, as well as management software, which we will delve into later.

DCIM has evolved over time from manual operation in the old days to modern software-backed operations that optimize data center operations and output. As such, DCIM today is also often used to describe data center monitoring tools and software that support DCIM processes and activities.

What are the components of DCIM?

DCIM involves many groups of activities usually associated with different elements of data center infrastructure, their relationships, and their needs to serve customers in various ways. Even though the design, construction, and installation of a data center generally falls under DCIM, it is typically regarded as a separate set of one-off activities. The activities and processes listed below form the basis for defining the requirements of DCIM software and data center monitoring tools, and we will delve more thoroughly into the requirements in later sections.

Facility Documentation

Having detailed documentation of your data center facilities is crucial for properly managing essential environmental elements (power, cooling, physical security), devising business continuity plans, planning for accommodating new equipment (customers), scaling up data center capacity, as well as for certifying and auditing against industry standards like ANSI/TIA-942.

Power and Cooling Management

Power and cooling traditionally represent key elements of a data center, and along with security, they are the major drivers behind the initial construction of a data center. When not managed properly, they can cause significant costs and have a substantial impact on the financial aspects of the data center.

The goal of power and cooling management is to achieve optimum energy efficiency while meeting the targeted ANSI/TIA-942 rating levels:

  • Tier I: Basic components with up to 28.8 hours of annual downtime (including planned maintenance and outages), non-redundant power and cooling.
  • Tier II: Redundant components with up to 22 hours of annual downtime.
  • Tier III: Concurrently maintainable design with up to 1.6 hours of annual downtime, implying multiple paths for power and cooling, allowing maintenance without operational disruptions.
  • Tier IV: Fault-tolerant design with up to 26 minutes of annual downtime, implying full redundancy of all components, multiple independent distribution paths for power and cooling, allowing multiple simultaneous maintenance activities and one fault anywhere in the system.

As you can see, the management of different tier designs significantly impacts management activities, the cost of management and maintenance, as well as the lifecycle management of components, processes, and standard operating procedures.

The management itself primarily relies on continuous real-time data center monitoring of faults and the performance of power and cooling systems, especially tracking parameters at the rack level and even at the device level. Automation of different critical scenarios is mandatory to achieve the annual downtime goals due to power and cooling outages.

Engineers constantly analyze long-run performance data to optimize power consumption (maximize Power Usage Effectiveness - PUE) using various energy-efficiency practices. The same applies to cooling systems management, which includes the analysis of long-run temperature and humidity data and the implementation of different cooling optimization techniques, such as airflow optimization, deploying cold air containment and ventilation solutions, implementing in-row cooling, etc. This requires the skills of engineers in electrical engineering, machine engineering, and civil engineering to ensure smooth operations.

Physical Security Management

Physical security involves multiple systems like access control (biometric access, card readers, key fobs), fire detection and fire suppression systems, perimeter security elements (fences, walls, turnstiles, mantraps) and surveillance (security cameras with motion detection) and intrusion detection (motion and vibration detection).

Managing all these important systems includes many activities and it all starts with regular execution of risk assessment that aims to identify potential security vulnerabilities and threats to the data center and its customers’ data. Based on the assessment, a DCIM team devises appropriate security policies and operating procedures. This includes access control procedures, security checkpoints and security incident response procedures.

Defining visitor procedures for colocation data centers is exceptionally important as one must retain a high level of security while not making the life of its customers overly complicated.
The execution of the processes falls on the shoulders of the security team which includes security staff and security managers. The primary task of managers is to continuously train the staff to follow security protocols, constantly keeping security awareness at a high level, and execute frequent security drills to verify security protocol execution readiness and effectiveness.

Managers are also responsible for assessment reporting and incident analysis and reporting as well as for implementation and compliance audit for implemented standards like ISO/IEC 27001 or PCI DSS.

Slika

IT Assets Inventory and Configuration Management

The physical facility elements described above all serve a single purpose – accommodating ICT equipment for internal and external customers. In addition to serving customers, there is also ICT equipment used to provide data center services. This includes passive equipment, racks, cages, optical and copper cabling, and patch panels. However, the data center has its own active equipment. This is networking equipment used to provide connectivity to customers and the entire network, and computing and storage systems with virtual and software platforms used to offer IaaS, PaaS, and SaaS services to data center customers.

All these physical and virtual systems, combined with the physical and virtual systems of customers, comprise the data center's IT assets. Managing these assets is the most intricate task of all, as the number of different systems to manage and integrate is enormous.

In addition to defining processes and standard operating procedures, one of the most fundamental aspects of IT asset management is its very precise, accurate, and up-to-date documentation. For this purpose, resource (asset) inventory tools like CMDB (Configuration Management Database) or other specialized tools that support both physical, logical, and virtual asset documentation are used. One important aspect of documentation is data quality. For this purpose, the team must devise and deploy inventory check techniques, including software discovery tools that can greatly assist the process.

Naturally, one important aspect of asset management is the configuration management of hardware (configuration settings of each physical and virtual device, their specifications, hardware and firmware versions, network configurations, etc.) and software (asset software configurations, operating systems configurations, application settings, etc.). This element is necessary for the proper maintenance of systems as well as for effectively managing backups and other elements of business continuity.

Change Management

In a dynamic IT environment, changes in infrastructure are frequent, especially when a data center is focused on external customers, such as providing colocation and hosting services. The process begins with requests for change (RFC) that are manually or automatically assessed and authorized. The latter largely depends on the nature of the change, and the change itself may imply numerous modifications in the data center, depending on its magnitude. In some cases, the request may necessitate upgrades in facility rooms, power, cooling, and even security. The implementation of the change must be orchestrated in a manner that involves coordination among multiple units of the DCIM team, including the purchasing unit.

The team designs the change management process in such a way that there is clear logging and tracking of the approval process and all changes to physical assets, configurations, designs, communications, etc. The changes are supported by orchestration and automation systems that significantly reduce the time required for change implementation.

Preventive Maintenance and Lifecycle Management

Power, cooling, networking, and storage infrastructure all require regular preventive maintenance activities in accordance with prescribed maintenance procedures and intervals provided by equipment vendors. Power system maintenance usually includes procedures such as regular battery tests, power outage tests, power engine tests, as well as the replacement of outdated components. Cooling systems maintenance involves regular changes of cooling fluids, cleaning air filters, stress testing cooling systems, etc. Networking maintenance includes regular updates/upgrades of firmware, configurations audit, and other practices.

From a business continuity standpoint, one important aspect of maintenance is the regular checking of data backup and replication processes.

Thanks to the asset inventory, the team can easily implement processes and standards to procure, deploy, and, once at the end of its life, decommission equipment. Of course, based on performance tracking, fault frequency, and other elements, one can define lifecycle processes and durations that do not necessarily correspond to the specifications of vendors. For instance, the organization can repurpose older servers for use in other non-critical applications, thereby optimizing the overall cost of investment.

Slika

Fault and Performance Monitoring - Assurance

Even though preventive maintenance and lifecycle management greatly contribute to achieving the reliability and availability targets of data center infrastructure, the key component is still assurance: continuous monitoring, performance analysis, preventive activities to prevent incidents, and incident management itself.

The key factor in proper incident management is consolidated data center monitoring of all networking, computing, power, cooling, and security infrastructure of the data center.

When the collected events and alarms are enriched with assets and administrative data, the DCIM management team has an umbrella view over the situation and can engage automatic and manual activities to rectify any fault that pops up. An integration of ticketing systems is necessary to properly address the incident management process and combine it with monitoring data. Therefore, integration of all systems is key to effective incident management.

Performance monitoring is key to preventing any incidents and tuning the overall infrastructure for optimum performance. Performance also must collect data from networking, computing, power, cooling, and security infrastructure. Teams must devise proper KPIs to be constantly monitored and used as an indication of sub-optimal system operation. One such example is PUE (Power Usage Effectiveness) that indicates how effective the power and cooling systems are.

Forecasting of performance KPI data is crucial to predict any future performance degradations that can lead to degradations in operation or even faults, say overheating in certain parts of the data center, battery capacity dropping below allowed minimum, or a switch’s processor overloading. This allows engineers to execute preventive actions and remove the cause of the problems before a fault happens.  

However, performance management and forecasting has another key role – determining the baseline for capacity planning.

Capacity Planning

Capacity planning is a complex process to determine the future capacity needs of a data center, and this process includes many elements. A well-maintained and accurate assets inventory and documentation allow the identification of existing bottlenecks, short-term capacity outage threats, as well as underutilized resources.

Historical data on capacity use is very important for foreseeing future requirements. However, it is crucial to rely on actual and accurate usage and demand data. Of course, when establishing both current capacity and forecasting future needs, one must be careful not to rely only on boilerplate numbers. It is crucial to use actual metered historic performance data; otherwise, there is a risk of easily overestimating or underestimating existing and future capacity needs.

Another important element to account for is the impact of future technology on infrastructure needs. It is a balancing game that involves figuring out how to reduce the physical footprint while also providing larger capacity demands due to new digital services.

The capacity plans must account for all network, computing, room size, and power and cooling requirements.

Business Continuity

One important activity of the DCIM team is the implementation and execution of business continuity activities. It all starts with risk assessment and analysis, determining the severity of each risk as well as its impact on the business. What follows is the business continuity plan, a composition of well-designed emergency response and crisis communication plans. These plans are translated into many business continuity procedures, which are step-by-step actions that must be executed in crisis situations. The procedures are executed by the business continuity team, which must be well-trained and supported by regular drills.

It is crucial to emphasize the importance of the IT disaster recovery plan, which includes designs, procedures, and automation for the recovery of IT services and data either in the same facility or in a remote facility. This plan must employ a well-designed IT redundancy design, as well as activities of regular data backup and restoration.

Slika

What are the benefits and challenges of DCIM?

The many components of DCIM imply that without well-structured processes and procedures, it would be nearly impossible to provide quality services to data center customers. Therefore, well-structured DCIM provides many benefits.

DCIM:

  • Provides ways to document all physical assets precisely and accurately.
  • Improves the efficiency and reliability of power and cooling in a data center, significantly reducing costs.
  • Establishes and manages physical security in the data center.
  • Provides clear operating procedures for all changes, reducing the probability of error and faults in the data center.
  • Provides real-time fault and performance data and the ability to execute automated assurance actions to achieve targeted service level objectives, maximizing infrastructure uptime.
  • Provides procedures to utilize historical measured data to extrapolate future needs, optimizing maintenance and investment costs.
  • Results in the consolidation and integration of management software and allows for procedure automation, thus improving the DCIM team’s productivity.

While the benefits are many, there are also numerous challenges to achieving proper DCIM and executing the processes and plans on a daily basis. The following challenges are ones we encountered when working with our customers, but there are many others:

  • Highly complex integration of all physical infrastructure elements together, as well as with management software.
  • Proper data center monitoring and determination of environmental parameters in all racks due to a limited number of sensors.
  • Enforcement of change management processes and related asset inventory.
  • Correct documentation of passive elements such as patch panels, cables, cable trays, and equipment positioning in racks.
  • Appropriate capacity planning in newly founded data centers due to a lack of historical data and unforeseen demands.
  • Unconsolidated DCIM management software providing a limited overview of the available assets and their performance.
  • Suboptimal cooling design due to limitations of facilities.
  • Limited recruiting ability for technicians and engineers due to limitations in the labor market.
  • Non-existent umbrella view over infrastructure, its performance, services, and customers.

What is data center management software?

We have discussed various DCIM processes and highlighted instances where robust data center management software support is essential for their effective execution. The need for strong data center management software support is so pronounced that the term DCIM is often used to describe software tools that facilitate DCIM processes. In the following table, we attempt to summarize all the essential features of DCIM software that genuinely supports DCIM processes.

FunctionalityDescription
Facility, power and cooling inventory and documentationDocumentation of real estate, properties, contracts, users, graphical representation, and documentation of facilities with 3D floor plans, including racks, floor-standing devices, tiles, cable trays, areas, enclosures, power and cooling systems, and zones. Power plan and distributor configuration (power buses, rails, UPSs, fuses, etc.) and power signal tracing are also covered. This involves hierarchical structuring, the ability to reserve resources, and provide collision detection and prevention, among other functionalities.
Active and passive IT assets inventory and documentationComplete inventory of all physical, logical, and virtual IT assets, including racks, backbone and rack cabling, patch panels, inter-site cabling, patch cabling, signal tracing chain between any pair of ports, patch protection, auto-routing function, etc.
Assets discovery and reconciliationAutomatic discovery of all active devices with algorithm configurations providing internal designation-specific detection and rack placing. Discovery of hardware and firmware versions, serial numbers, part numbers, modules, interfaces, management modules, operating systems, and active topologies, as well as reconciliation with facility and IT asset inventory, with the ability to define reconciliation policies.
Umbrella fault and performance monitoring and assurance automationEnd-to-end consolidated monitoring (event and performance data acquisition) of power, cooling, environmental, physical security, network, computing, storage, and other systems in one unified tool, allowing for data correlation and enrichment with IT, facility, and external business data. The ability to automatically trigger remediation actions for well-defined faulty conditions, service monitoring and assurance with automatic service impact analysis, and operational and management reporting. The assurance function is fully integrated with an external ticketing system.
Data center assets and health status visualizationAdvanced and integrated visualization capabilities, such as 3D facility plan representations with overlaid measured performance KPIs, data dashboard visualization, realistic front and back views of racks and all containing equipment, cable distribution, patch panels, patch cables, etc.
Change management processes support and trackingSupport for change management process definition and orchestration integrated with external systems such as ticketing. Ability to define change management planning protocols, generate work orders for team members, validation checkpoints prior to installation, and post-implementation verifications, with the ability to visualize planned change procedures. Ability to document and track all steps of the CM process.
Capacity planningDashboards for the visualization of current power, cooling, and space utilization, with the ability to set thresholds and generate warnings. Extrapolation of future needs based on boilerplate and monitoring-based actual space, energy, and network utilization, combined with input data from sales and marketing plans. Calculation of future needs and determination of required resources and costs in the future.
Efficiency control and optimizationCombining monitored performance data with asset data to detect potential bottlenecks and calculate power, cooling, and network reconfiguration settings to optimize current resource consumption.
Operational and Management ReportingThe DCIM tool must provide efficient operation and management reporting capabilities across all monitored and asset data, with the ability for the team to create and schedule their own reports as needed.
Preventive maintenance supportThe asset boilerplate data and monitored performance data are combined to generate a regular and preemptive maintenance activities schedule to prevent any future performance degradation.
Floorspace and rack optimization algorithmsDCIM tools may provide algorithms that can calculate the optimum distribution of racks across a data center and determine the maximum weight, power, and thermal loads based on predefined facility design.
Integration capabilitiesDCIM tools must have the ability for northbound and southbound integrations. Northbound integration should support interaction with peer systems, such as ticketing and customer portals, while southbound integration must allow connectivity with data center devices and element managers of specialized data center systems, such as power and cooling (e.g., Modbus) and element managers (e.g., API integration).
Multitenancy supportIn some cases, customers of the data center may be allowed to access restricted parts of the DCIM solution and use its functions to manage their own collocated resources.

Now, the list of functionalities is much longer. However, even with this essential list, the provided data center is capable of running efficiently, meeting all the needs of its customers at a minimum cost.

Slika

Data center infrastructure management tools

The number of commercially available data center monitoring tools in the market is large, and the choice is not easy. Some of the tools tend to encompass all of the functions mentioned in the previous section, while others focus on inventory and processes and collaborate closely with other vendors to provide an end-to-end solution.

One example of a successful cooperation is between the companies FNT Software, specialized in inventory and processes, and UMBOSS, specialized in monitoring and assurance. The major advantage of such a combination is the ability to extend the use to service assurance, inventory of an external and active telecom plant, and so on.

Other vendors such as Nlyte, Vertiv, Device42, Schneider Electric, ABB, Sunbird, and others provide all-encompassing solutions. However, some of them are limited only to data center infrastructure and are not able to delve deeper into network monitoring, service assurance, and other aspects of data center operation.

Conclusion

If you’ve stuck through this blog until the end, we hope you agree that the TLDR version can be summarized as this: DCIM is both a framework of operational processes and procedures, as well as data center monitoring tools that support the installation, maintenance, and management of a data center.

By achieving the optimum performance of data center infrastructure and services, investment and maintenance costs will be minimized while customer satisfaction is maximized.

Implementation of all the working parts is complex, but by combining systems, best practices, and management tools, data centers are becoming more effective in their growing market.  If you’re interested in learning even more, read our Data Center Monitoring – Explained blog now.

Interested in discovering more?

You can read all you want about UMBOSS, but the best way is to experience it through a demo.

Slika