Methods for Data Center Cooling

Michael Atkin
December 8, 2022

Submitted as coursework for PH240, Stanford University, Fall 2022

Introduction

Fig. 1: The power consumption of data centers. ICT denotes information and communications technology. ET and UPS denote the Electricity Transformer and Uninterruptible Power Supply. [2] (Source: M. Atkin)

Data centers account for 1.8% of global power usage. [1] Within data centers, cooling accounts for 37% of their power consumption (see Fig. 1). [2] Therefore improving control algorithms for data center cooling is of significant interest.

Problem Statement

Data centers are typically cooled via HVAC systems. [3] We have two control variables for each HVAC component: fan speed and water flow. We also have two state measurements: differential air pressure and cold-aisle temperature. We wish to keep these measurements within reasonable bounds. Distributing cool air throughout a building sounds like a straightforward problem, but is in fact a multifaceted control problem. This is for a few reasons. For one, the flow dynamics of air throughout a building are complex, typically described by partial differential equations. Additionally, the performance of HVAC equipment is non-linear to power consumption. Lastly, each building has a different configuration of HVAC components, so no one solution can be generalized to all buildings.

Methods

The traditional method of data center cooling is installing a PID (proportional-integral-derivative) controller in each HVAC component. Each controller takes as input the state measurements: if they are above a set boundary, the HVAC component adjusts its control variables to lower the measurements.

This approach has the advantage of simplicity. A side effect of simplicity is safety: a simple system is less prone to catastrophic errors. However, this approach makes no attempt to tackle the complexity of the control problem, so leaves plenty of energy savings on the sample.

A newer approach is model predictive control (MPC). Under this approach, each controller maintains a model of the building's cooling dynamics. At each time step, the controller generates a predicted trajectory based on its dynamics model, and chooses its control variables to minimize cost. This means the model looks towards the future, rather than adjusting control variables on an ad-hoc basis.

In order to use model predictive control, one has to define the model - this is known as system identification. One approach for system identification is basing the model off of historical data of the control and state variables. This could lead to a rich model, but has the disadvantage of requiring historical data, which may not always be available. Additionally, if there is not enough historical data, the model may overfit, which could lead to catastrophic error.

Another approach for system identification is randomized exploration: we let the controller try out various control values and observe the corresponding state measurements. In this way, the model learns the building's cooling dynamics. Lazic et al. use a random walk which is range-limited to manually defined safe control values. [1] Random exploration is a widely used method in reinforcement learning, and has the capacity to learn a rich model. However, random exploration is known to suffer from long and inconsistent convergence. Additionally, random exploration of control values in a real data center carries significant safety risks.

Results

Lazic et al. evaluate 3 methods: PID controllers, an MPC trained on historical data, and an MPC trained on exploration. [1] They find that an MPC trained on historical data decreases dollar cost by 17.0% compared to PID controllers, and an MPC trained via exploration decreases dollar cost by 17.9% compared to PID controllers.

However, there are significant limitations to these results. For one, this a single experiment, and there is a large amount of noise when conducting experiments on real data centers. This is because dollar cost is in large part determined by uncontrollable factors, such as server power usage and the temperature of entering cold water, which may vary across experiments. Additionally, dollar cost fails to account for other desirable metrics, such as safety and ease of deployment.

© Michael Atkin. The author warrants that the work is the author's own and that Stanford University provided no input other than typesetting and referencing guidelines. The author grants permission to copy, distribute and display this work in unaltered form, with attribution to the author, for noncommercial purposes only. All other rights, including commercial rights, are reserved to the author.

References

[1] N. Lazic et al., "Data Center Cooling Using Model-Predictive Control," in Advances in Neural Information Processing Systems 31 (NeurlPS 2018), ed. by S. Bengio et al. (Curran Associates, 2018), p. 3814.

[2] Q. Zhang et al., "A Survey on Data Center Cooling Systems: Technology, Power Consumption Modeling and Control Strategy Optimization," J. Syst. Archit. 119, 102253 (2021).

[3] Y. Ma et al., "Predictive Control For Energy Efficient Buildings With Thermal Storage: Modeling, Stimulation, and Experiments," IEEE 6153586, IEEE Control Systems Magazine 32, 44 (2012).