Data Collection Techniques for Integration of Solar Energy into the Electric Grid

William Marshall
February 20, 2019

Submitted as coursework for PH241, Stanford University, Winter 2019

Introduction

Fig. 1: This image shows an example of a residential solar panel installation. Advances in technology have made such installations cheaper and more common. (Source: Wikimedia Commons.)

In recent years technological advancements and social factors have lead to an increase in deployment of solar panels and other forms of renewable energy sources. Additionally, most experts predict that there will have to be a significant rise in use of solar and other forms of renewable energy if even modest goals for reducing greenhouse gas emissions are to be met. However integrating large amounts of solar energy into the electric grid has significant challenges. [1] Energy generated from solar panels varies greatly according to the amount of solar radiation in the area of deployment at the given time. Furthermore, particularly with residential solar deployments (see Fig. 1), the exact location of all panels is not known. This makes it difficult to efficiently predict how much energy will need to be transported to which locations. These factors combined make large-scale grid integration a non-trivial task to complete at low cost. In order to help solve this problem it is necessary to have accurate estimates of the location and size of solar deployments.

Estimation Techniques

Currently, the best database for this type of solar information comes from the Stanford based project Deep Solar. [2] The team of researchers led by Jiafan Yu used a deep learning approach to develop a model that uses satellite images of the continental United States to estimate the size and location of solar deployments. The team reports an estimated 1.4702 million ± 0.0007 million solar installation and pinpoints the exact location of each one. [2] In contrast, the previous best estimate for this number was 1.02 million, and this information did not include location or any other additional information on the installation. [2]

It should be noted, however, that there are several reasons to be suspicious of the level of precision claimed in the paper. The team came upon this number using their model to predict the presence of solar panels in over a billion images from across the United States and extrapolating to the areas that they could not directly analyze. They achieved the estimation for the images they had access to by assuming that the presence or absence of solar panels in a specific sample was distributed as a Bernoulli independent of the variables from all other samples. Given this assumption, you can use sensitivity and specificity numbers estimated from evaluating the model on a test set (see below for more details) to calculate the expected number of positive samples and the standard deviation in this number. However, this calculation relies on the assumption of independence, which (although somewhat standard and used in many settings) is not necessarily true given the circumstances.

Perhaps more important is the set of images chosen for consideration for processing and extrapolation. Given the computational resources required to evaluate on so many images, the team decided to only train and evaluate on images that were taken from areas that exceeded a certain threshold of nighttime lights. Although this prevented using computational resources to evaluate images from sections of wilderness in the middle of nowhere that did not contain solar panels, it is not clear a priori that the number of solar instillations away from nighttime lights is insignificant. Furthermore, the team made no attempt to justify their choice as it pertains to the final number reported.

However, despite these estimation challenges it still seems that the data suggest that their prediction is indeed more accurate than the previously best estimate. The choice of samples evaluated would only make the estimate too small, not too large, and even though there is not full independence in the samples, there is likely only small correlation between most pairs of samples, resulting in a minor affect in the final estimate.

Public Database

The project provides a public database with their findings and analyzes the results by a variety of factors from sun exposure of the area of solar panels to average income levels. This type of information could be used by researchers and policy makers to try to speed the conversion of the energy system to renewable sources.

Training Issues

Deep Solar marks a significant improvement over previous attempts at similar tasks, but it is not without drawbacks. First of all, as with any machine learning based prediction model, the results are not perfect. The team evaluated their model using a test set of 93,500 images, each hand labeled by a human with a binary label representing the presence or absence of solar panels. These images were sampled from throughout the continental United States in such a way that no image in the test set was of a region in close proximity to any image in the training set, creating a test set that was robust both to over-fitting of the training data and changes in appearance of different parts of the country. On this test set, the models achieved a precision (number of samples correctly predicted as positive divided by total number of samples predicted as positive) of 93.1% and a recall (sensitivity) of 88.5% in residential areas and a precision of 93.7% and a recall of 90.5% in non-residential areas. [2] These numbers are made more impressive when you consider that given the sparcity of solar installations even a small false positive rate would significantly reduce the precision of the model. However, despite the somewhat impressive performance given the nature of the problem, these precision and recall numbers just reiterate the fact that the predictions are just an estimate and are by no means an exact or fine-grained tool. While these errors tend to even out for aggregate statistics, any application requiring fine-grained data and higher confidence in predictions would need to use a different source of data.

Additionally, the Deep Solar model takes nearly one month to scan enough satellite images to generate results for the entire continental United States. [2] While this is relatively fast compared to other deep learning techniques and certainly faster than survey-based techniques, it is still not ideal. This type of compute requirement is still fairly significant and certainly not suitable for any applications that are more time sensitive in nature.

Conclusion

Overall, despite these drawbacks Deep Solar represents a significant advance in data availability and hopefully will aid electric companies, policy makers, and researchers alike to build towards a cheaper and cleaner energy future.

© William Marshall. The author warrants that the work is the author's own and that Stanford University provided no input other than typesetting and referencing guidelines. The author grants permission to copy, distribute and display this work in unaltered form, with attribution to the author, for noncommercial purposes only. All other rights, including commercial rights, are reserved to the author.

References

[1] M. Burnett, "Energy Storage and the California Duck Curve," Physics 240, Stanford University, Fall 2015.

[2] J. Yu et al., "DeepSolar: A Machine Learning Framework to Efficiently Construct a Solar Deployment Database in the United States," Joule 2, 2605 (2018).