Continuous CRAC Monitoring: Early Warning Signs That Prevent Cooling Failure

A compressor does not fail without notice. Current draw rises weeks before seizure. Discharge temperatures creep upward. Suction pressures drift. The question is whether anyone is watching the data.

Why reactive maintenance fails in data centres

The traditional CRAC maintenance model is calendar-based: a technician visits quarterly, checks refrigerant pressures, cleans filters, logs readings on a paper form, and leaves. Between visits, the unit runs unmonitored except for basic high-temperature alarms.

This model made sense when data centres had generous cooling redundancy and moderate heat loads. It no longer works. Modern data halls run at higher utilisation, tighter thermal margins, and with AI/GPU racks generating 60 to 120 kW per cabinet (compared to 5 to 15 kW a decade ago). A single CRAC failure can push remaining units past their rated capacity within minutes, and a second failure cascades into a thermal event.

The Uptime Institute's 2024 annual survey found that 70% of significant data centre outage incidents involved power or cooling failures. The average thermal event lasts 29 minutes and costs roughly $9,000 per minute in direct losses. Calendar-based maintenance is not enough to prevent those events.

What continuous monitoring actually measures

Continuous CRAC monitoring goes beyond the binary alarm state (unit running / unit alarmed) that most BMS installations provide. It tracks the trend data that reveals problems weeks before they become failures.

The parameters that matter most:

Compressor current draw

This is the single most predictive metric for compressor health. A healthy scroll compressor draws a stable current that correlates closely with cooling load. When bearings wear, scroll clearances open, or motor windings degrade, current draw rises. A 3 to 5% increase over baseline, sustained across multiple operating cycles, signals a failure developing. This pattern typically appears 4 to 8 weeks before the compressor locks out or trips on overcurrent.

Without trending, this rise is invisible. The compressor still runs, the unit still cools, and the quarterly inspection may or may not catch the shift depending on when it occurs relative to the visit.

Discharge air temperature

Every CRAC unit has a design delta-T between return air and supply air. For a typical DX unit, this is 10 to 14 degrees. If the discharge temperature creeps up by 1 to 2 degrees over two weeks, with no corresponding change in room load or ambient conditions, the unit is losing cooling capacity. Common causes: low refrigerant charge, evaporator coil fouling, or failing expansion valve.

Suction and discharge pressure

Refrigerant circuit pressures reveal system health that temperature alone cannot. Falling suction pressure with stable load suggests refrigerant loss. Rising discharge pressure with stable ambient suggests condenser fouling or fan degradation. The ratio between the two (compression ratio) indicates compressor efficiency.

Chilled water delta-T (CDW/CHW units)

For chilled water CRAC and CRAH units, the temperature difference between supply and return water lines indicates heat exchanger effectiveness. A narrowing delta-T means the coil is not transferring heat efficiently, typically due to fouling, air locks, or control valve drift.

Sensor calibration drift

Temperature and humidity sensors drift over time. A sensor reading 0.5 degrees low causes the unit to under-cool, while the BMS reports everything normal. Comparing return air sensor readings against independent reference measurements (portable loggers placed at the intake) reveals drift that the system cannot self-diagnose.

The standby unit blind spot

The most dangerous monitoring gap in most data centres is the standby unit fleet. Standby CRAC units may cycle on for a few hours per week during rotation schedules, but spend most of their time idle. During idle periods, monitoring data is minimal (the unit is not cooling, so there are no meaningful trend points).

This creates the exact scenario that causes most cooling outages: a primary unit fails, the standby unit is called into service, and the standby unit also fails because it had low refrigerant, a stuck expansion valve, or a degraded compressor that was not visible during brief rotation cycles.

Continuous monitoring of standby units requires a different approach. Rather than watching for trend degradation during normal operation, it means verifying health during each rotation cycle: confirming the unit reaches rated capacity within its expected ramp time, that pressures stabilise at expected values, and that it can sustain full load for the duration of the rotation window.

BMS integration versus dedicated monitoring

Most data centres have a building management system (BMS) that receives basic alarm feeds from CRAC units via BACnet or Modbus. This is not the same as continuous monitoring.

A typical BMS integration captures:

Unit on/off state
High temperature alarm
Filter alarm
General fault alarm

Continuous monitoring captures:

Compressor amperage at 1 to 5 minute intervals
Supply and return air temperature at 1 minute intervals
Refrigerant pressures (where sensor-equipped)
Fan speed and power draw
Control valve position (CDW/CHW units)
Trend data with 90+ day history for comparison

The difference is the data density needed for predictive analysis. A BMS alarm tells you something has already gone wrong. Trending data tells you something is going to go wrong.

Dedicated monitoring platforms (whether embedded in modern CRAC controllers like Vertiv iCOM-S or deployed as third-party IoT sensor networks) feed this data into dashboards with threshold alerting, trend analysis, and anomaly detection. The investment is modest ($2,000 to $5,000 per unit for sensor installation and first-year monitoring) against the $250,000+ cost of a thermal event.

What the data actually looks like before a failure

A real-world example from a Brisbane data centre:

Week 1: Liebert PEX4 compressor drawing 18.2A on a rated load of 17.8A. Delta of 2.2% above baseline. No alarm triggered.

Week 3: Current draw at 18.9A. Delta now 6.2%. Discharge temperature up 0.8 degrees. Still within alarm thresholds.

Week 5: Current draw at 19.4A. Delta 9.0%. Suction pressure has dropped 12 kPa. The unit is losing refrigerant and the compressor is working harder to compensate.

Week 6: Without intervention, the compressor trips on overcurrent protection during a hot afternoon when ambient conditions push the condenser above its comfort zone. The unit drops offline. The standby unit activates but takes 8 minutes to reach rated output because its expansion valve is sluggish from disuse.

With continuous monitoring, the week 3 readings would have triggered an alert. A technician would have found the slow refrigerant leak, repaired it during a scheduled maintenance window, and the unit would never have tripped.

Implementation priority

For sites that do not currently have continuous monitoring, we recommend a phased rollout:

Start with compressor amperage monitoring on all units (active and standby). This is the highest-value, lowest-cost sensor addition and catches the most common failure mode.

Add supply/return air temperature trending at 1-minute intervals. Most modern CRAC controllers already log this; it just needs to be exported to a central dashboard rather than read locally.

Install refrigerant pressure transducers on units older than 8 years or units with a history of refrigerant top-ups. This catches slow leaks months before they become emergency callouts.

Implement rotation health checks for standby units: automated tests during each rotation cycle that verify the unit reaches rated capacity and holds stable pressures.

We install and configure monitoring systems for Vertiv/Liebert, Stulz, Uniflair, Daikin, Climaveneta, and Mitsubishi Heavy CRAC units across Australian data centres. Contact us for a monitoring assessment of your cooling fleet.