Energy in Deep Learning

Alexander Conklin
December 9, 2021

Submitted as coursework for PH240, Stanford University, Fall 2021

Introduction

Fig. 1: A depiction of a three layer, fully connected neural network. The red neuron's activation is a weighted sum of all of incoming inputs (also highlighted red). (Source: A. Conklin)

The last decade (2010 - 2020) has experienced a resurgence in the development of machine learning and artificial intelligence (AI), most notably deep learning. [1] The driver behind this resurgence was the realization that the parallelized architectures of GPUs - Graphics Processing Units - could be used to greatly accelerate the time it takes to train neural networks. [1] Simultaneously, more powerful processors (usually CPUs) both on the cloud and on the edge (i.e. cellphones) made the deployment of trained models viable. However, as deep learning has made headlines for new feats in self-driving cars, protein folding and playing the game of Go, the industry is also reckoning with an energy problem. Training a moderately sized natural language processing model - the type behind advanced automated chatbots - releases 284 metric tons of CO2, comparable to the lifetime emissions of 5 cars. [2]

In this article we will explore why deep learning is energy-hungry and understand the implications this energy appetite has for scaling networks. Specifically, we will show that the recent trend in the growth of compute used to train the worlds largest models is unsustainable when using digital hardware.

What Makes Deep Learning Energy Intensive?

The computational overhead (and by extension energy overhead) of deep learning models is a direct product of their structure. Deep learning networks are composed of sequential layers, each containing neurons and synapses as depicted in Fig. 1. At each neuron, inputs from the previous layer undergo a weighted sum (vector matrix multiplication + an added bias) followed by the application of a non-linear function such as sigmoid. These two basic operations are repeated as the inputs pass through subsequent layers of the network. More advanced architectures include some flavorful variations of this core structure - notably including matrix-matrix multiplication - but for the most part preserve these core operations. Training a neural network means finding the weights of the synapses and biases that allow the network to map a set of inputs to their respective outputs. Finding this mapping in such a large parameter space is nontrivial. The dominant training algorithm, backpropagation, calculates the gradient of all the parameters in the network with respect to the error of the network. The gradient is then used to update all the parameters to a new value that is closer to the desired value. Depending on model size this process, which is amenable to parallelization, is typically run over multiple GPUs.

The bedrock of both the aforementioned processes inference and training is fixed precision vector matrix multiplication and matrix-matrix multiplication. On digital hardware these operations scale with O(n2) and O(n3) respectively. As the number of synapses and neurons within each layer grows, the number of operations the processor must carry out grows super-linearly. For larger models this overhead is further increased by the requirement of more training data and more cycles of backpropagation. Broadly, the size of a neural network is parametrized by the total number of synapses and neurons, which reflect the total number learnable parameters. Naturally, the number of floating point operations required to train a model emerges a powerful proxy to measure the energy consumption of deep learning. For reference, Nvidia's latest A100 GPU can compute 312 TOP/s (Tera Operations Per Second)(3.12 × 1014 operations per second) at half precision floating point and 624 TOP/s at 8 bit integer precision while consuming 300 Watts. [3]

Trends in Deep Learning Scaling

Since 2012 the amount of compute used for training the largest AI models has grown exponentially, doubling every 3.4 months. [4] Moore's Law - the observation that the number of transistors in an integrated circuit doubles every 24 months - dictates that the energy efficiency of digital processors also doubles every two years. [5] The mismatch between growth rate of compute budget from large AI models and the growth rate of energy efficiency in digital processors is substantial and will impede the seemingly unabated growth present today. More concerning yet, Moore's law is on its last leg, as transistor dimensions move closer to fundamental physical limits and challenges persist in thermal dissipation. [5]

We can quantify this massive energy cost of scaling AI models by forecasting the amount of compute required to train a large model in 20 years assuming this growth rate continues (following the trend in Theis and Wong). [5] We start with the assumption today's large models use 5,000 Peta-FLOP/s-days (4.32 × 1023 FLOP). The number of doubling periods (P) is given by

P = 240 months
3.4 months
= 70.6

Therefore in 20 years we would expect the amount of compute (C), measured in FLOP, for training the model to be:

C = 4.32 × 1023 FLOP × 270.6 = 7.73 × 1044 FLOP

Assume this model is trained on state of the art digital hardware as of writing - Nvidia A100 GPUs. Although Nvidia boasts impressive efficiency and peak performance numbers when workloads are distributed over multiple GPUs resource utilization is much lower. This discrepancy can be explained by differences in clock-speed between the silicon die and challenges in thermal management. [6] We assume the hardware achieves 33 percent of its measured 91 GFLOP/s W-1 (9.1 × 1010 FLOP J-1) efficiency. [6] The energy (kW hr) required to train the model is:

E = 7.7 × 1044 FLOP
0.33 × 9.1 × 1010 FLOP J-1
= 2.56 × 1034 J

This toy calculation demonstrates that exponential growth in compute requirements and scaling trends in deep learning models will soon hit an energy barrier. For reference, the present energy budget of civilization is is 5.5 × 1020 J y-1. Repeating the calculation under the assumption that Moore's law (doubling of energy efficacy every two years) is kept alive through advanced packaging techniques such as monolithic 3D integration and chip to chip bonding yields similarly prohibitive energy numbers:

E = 2.56 × 1034 J
210
= 2.5 × 1031 J

Conclusion

Software abstractions of neural networks are predicated on underlying operations that scale poorly with digital hardware. Moreover, advances in compute efficiency due to Moore's law are occurring too slowly to remediate the problem. The energy burden of training large AI models suggests current scaling trends in neural networks will soon hit a wall. The human brain has roughly 100 trillion synapses, 89 billion neurons and consumes 20 Watts of power. [7] In order to build AI systems on par with the scale of the brain it is likely engineers will have to look at new computing paradigms. One source of inspiration may be novel circuit elements which naturally express non-linear dynamics and other rich analog phenomena. [8]

© Alexander Conklin. The author warrants that the work is the author's own and that Stanford University provided no input other than typesetting and referencing guidelines. The author grants permission to copy, distribute and display this work in unaltered form, with attribution to the author, for noncommercial purposes only. All other rights, including commercial rights, are reserved to the author.

References

[1] D. Ciresan, U. Meier and J. Schmidhuber, "Multi-Column Deep Neural Networks For Image Classification," 2012 IEEE conference on Computer Vision and Pattern Recognition, IEEE 6248110, 16 Jun 12.

[2] E. Strubell, A. Ganesh and A. McCallum, "Energy and Policy Considerations for Deep Learning in NLP, " in Proc. 57th Annual Meeting of the Association for Computational Linguistics, ed. by A. Korhonon, D. Traum, and L. Màrquez (Association for Computational Linguistics, 2019), p. 3645.

[3] J. Choquette et al., "Nvidia A100 Tensor Core Gpu: Performance and Innovation." IEEE Micro 41, 29 (2021).

[4] R. Schwartz et al., "Green AI," Commun. ACM 63, 54 (2020).

[5] T. N. Theis and H.-S. Wong, "The End of Moore's Law: A New Beginning for Information Technology." Comput. Sci. Eng. 19, 41 (2017).

[6] M. Spetko et al., "DGX-A100 Face to Face DGX-2 - Performance, Power and Thermal Behavior Evaluation" Energies. 14, 376 (2021).

[7] H. Markram, "The Human Brain Project," Sci. Am. 306, No. 6, 50 (June 2012).

[8] S. Kumar, R. S. Williams, and Z. Wang, "Third-Order Nanocircuit Elements for Neuromorphic Engineering," Nature 585, 518 (2020).