Analyzing the Efficiency of Horizontal Photovoltaic Cells in Various Climate Regions

This research presents the development of linear regression models to predict horizontal photovoltaic power output. We collected a dataset from 14 global Department of Defense (DoD) installations over a timeframe of one year using an experimental apparatus, resulting in 24,179 usable data points. We developed a linear model to predict power output, which incorporated site-specific weather and geographical characteristics, along with Köppen-Geiger climate classifications in order to determine the effect of adding climate to the model. After performing a Wald test between the full model and a reduced model without Köppen-Geiger climate variables, it was determined that including Köppen-Geiger climate variables improved the model’s ability to account for horizontal photovoltaic power variation by 3%. Although adding Köppen-Geiger variables provided added value when modeling the training dataset, these variables were less effective in predicting the validation dataset. From the analysis, the ideal Köppen-Geiger region was determined to be a warm temperate main classification, a fully humid precipitation classification and a warm summer temperature classification. This region possessed a 30% greater average power production than the mean value of the base climate classification. We found that the cost-effectiveness of a photovoltaic array depends on Köppen-Geiger climate regions, in addition to weather characteristics and the orientation of the array.


Introduction
While weather variable effects on photovoltaic power production are discussed frequently in literature, there is limited information regarding the impact of the climate classification zone [1]. The few studies that have looked at climate's effect on photovoltaic power production are based upon fixed-angle arrays instead of real-world data collected from horizontal systems [2]. With new technology being developed based on horizontal arrays, such as solar pavements, organizations with a global presence could benefit from an analysis showing the effect of climate region on the energy produced from horizontal PV panels.
Therefore, the purpose of this research effort was to determine the correlation between the power output of horizontal polycrystalline PV panels and the Köppen-Geiger climate classification system. This involves analyzing data collected from test systems placed in various climate regions to determine the most beneficial areas for photovoltaic investment.
This study built upon prior research efforts that identified candidate test sites, completed the system design, assembled the experimental test equipment, and shipped the test equipment to 38 locations worldwide [3][4][5]. More information regarding these areas is provided in the remainder of this section. however, the 2006 Kottek et al. version was used for this research because it had the highest number of climate regions [7]. The Köppen-Geiger system categorizes climate into five major climate zones: arid, warm temperate, snow, polar, and equatorial. Each zone has several types and subtypes based on precipitation and temperature. There are six precipitation classifications: desert, steppe, fully humid, summer dry, winter dry, and monsoonal; and eight temperature classifications: hot arid, cold arid, hot summer, warm summer, cool summer, extremely continental, polar frost, and polar tundra [6][7][8][9][10].
In designing this experiment, Nussbaum performed an analysis of variance (ANOVA) on the latitude and longitude of 1,763 DoD installations and found concentrations of installations in 25 distinct regions [3]. Then, a Pareto analysis was conducted to determine the climate regions that have the highest amount of installations. The Pareto analysis showed the DoD had installations in 14 of the 31 distinct Köppen-Geiger climate classifications. Test locations were selected based on these analyses, and then test systems were constructed and shipped to these locations. With spare parts, one additional test system was set up near Wright-Patterson Air Force Base, OH, USA to facilitate the diagnosis of system malfunctions. The final test location sites are shown as red dots on the map in Figure 1. Some of the locations in Figure 1 are close to one another, making it appear as one dot-e.g., the U.S. Air Force Academy and Peterson, AFB are both located near Colorado Springs, CO, USA.  The main components of the system consisted of an   ALEKO 25 Watt, 12 Volt mono-crystalline solar panel, a  Renogy 50 Watt, 12 Volt poly-crystalline solar panel, a  Raspberry Pi 3, model B, version 1.2 computer inside a  weatherized case, a weather probe, and an external power source, as shown in Figure 2. The system was also equipped with red, yellow, and green light emitting diodes (LED) to indicate an error had occurred within the system, the system was operational, and a reading was taking place, respectively. The external power source used to run the Raspberry Pi was provided by a 20-foot extension cord connected to an outdoor power source or by an additional PV panel and battery. The Raspberry Pi computer took measurements at 15-minute intervals [11][12]. These measurements included the current and voltage of the panels, ambient air temperature, and humidity of the site. The current was read using a noninvasive current Hall sensor [13]. Next, voltage was measured by the voltage drop across a known resistance. Finally, ambient air temperature and humidity were measured using a probe located on the outside of the weatherized case. Data collection was conducted using a micro secured digital (SD) card.

Test Equipment
To ensure conformity and ease of system setup, each installer was instructed to place their systems in a flat orientation. This zero-degree tilt angle enables the data to be applied to potential solar pavement applications. To ensure that each panel received the maximum amount of sunlight per day, participants were instructed to place the systems so they would have a clear view toward the southern, eastern, and western horizons. Finally, participants were instructed to check system function once each day and to ensure the systems were clear of debris, snow or high amounts of dust.

Data Collection and Configuration
Of the original 38 test sites, data was only received from 28 locations. After reception, the data had to be configured and compiled in order to provide a proper format for analysis. During each 15-minute interval, 64 readings for voltage and current were recorded for each panel, respectively. After multiplying these values together, the maximum power value was obtained for each interval. Next, the time for every location was adjusted from military or "Zulu" time to its respective time zone. Zulu time was the default time setting on each Raspberry Pi computer.
Upon completion of data compilation, other variables were incorporated with the dataset to support further analysis. Location-specific variables added to the dataset include latitude, altitude, Köppen-Geiger climate classification, and cloud ceiling obtained from the National Oceanic and Atmospheric Administration [14]. Each cloud ceiling measurement was matched to the closet 15-minute interval using RStudio statistical software.
After initial data analysis, it was determined that only 16 of the sites had reliable and continuous data. Other sites were eliminated due to lack of data acquired throughout the 16-month collection period or because of errors in the date and time stamp of the recordings. The dates of the dataset ranged from June 2017 to September 2018, with 528,569 total data points. The data collected for each site can be seen in Figure  3.
Next the dataset was narrowed further to remove outliers and errors. The first discrepancy observed was the power output recorded for the mono-crystalline panel. The values ranged from 0 W to 500 W for a 25 W rated panel. Over 64,000 readings were higher than 50 W, and it is unlikely for the panel to consistently read at such a high output. As a result, this data was removed, and no further analysis was performed on the mono-crystalline panel. Next, Learmonth Solar Observatory in Northwestern Australia was removed because it was the only site in the southern hemisphere. Capturing differences between the hemispheres, such as the opposite seasons, could have complicated the model and risked having key variables excluded. After removal, the model's final application and interpretation would be limited to the northern hemisphere. Following the removal of Learmonth, a calibration breakin period for several locations was identified, and the discrepancies were removed. These periods were identified by comparing histograms of power, plotted against month and hour for every location. These histograms displayed high power recordings during night along with power recordings above the poly-crystalline panel's rating of 50 W. However, these high recordings only occurred during the first several months of data collection, see Figure 4. Eventually, the power recordings behaved in a normal manner with readings of 1 W or less at night and no recordings above 40 W.
The last discrepancy identified if the data was associated with low temperature readings. The lowest temperature that was recorded was -40°C. However, these readings appeared to be associated with an error because it was recorded at southern sites such as Jonathan Dickinson Missile Tracking Station in Florida. For this reason, data with temperature readings lower than -39.3°C were removed. Along with these low temperature readings, there were several high temperature and power readings that were removed due to these points being extreme outliers and were easily identified as errors.

Model Variables
After collecting and filtering the data, individual variables were considered to create a simplistic model that allowed for easy interpretation and could be applied to all sites. First, the final variables for the model were selected. These variables included poly-crystalline panel power output, test site latitude, Köppen-Geiger climate classification, altitude, month, hour, temperature, cloud ceiling, and humidity. Next the rational for the inclusion of each variable into the model will be discussed.
The first variable in the model was the power output of the poly-crystalline panel. The panel is rated at 50 W, but this value was never obtained at any of the sites after removing the discrepancies from the break-in period. On the lower range of the power output, any values lower than 0.25 W were removed from the dataset. These low values can occur from very dense cloud coverage, snow or other debris accumulating on the panel. These readings were also removed to account for the potential error of the panel recording power from an artificial light source. Potential artificial light sources near the panels could be from street and sidewalk lamps -prior work found that 0.14 W could be generated from a halogen lamp illuminating a polycrystalline panel with an area of 0.37 m 2 and an efficiency of 18% [15][16][17]. This source combined with other potential calibration errors within the test system itself is why values lower than 0.25 W were eliminated from the data readings. Latitude was the next variable incorporated into the model and was treated as a continuous variable measured in degrees. Latitude was selected to account for the angle of the sun's irradiance. The ideal angle for the irradiance to strike the panel in order to maximize the area exposed is 90° [18]. This ideal angle is why many fixed solar panels are tilted at an angle equivalent to their latitude because irradiance strikes a horizontal panel directly at 90° at 0° latitude on the equator.
Köppen-Geiger climate classifications were included in the model to identify how effective they predict solar panel power production. Climate classification can account for location-specific characteristics that may not be included in other variables, such as wind speed, precipitation, vegetation and geographical landmarks such as mountains. The Köppen-Geiger climate classification along with each sub classification for every site location is shown in Table 1.
Other weather variables, such as temperature, cloud ceiling, and humidity, were added to the model due to their effect on solar power production. Temperature affects how efficient the panel is at generating power while cloud ceiling affects how much irradiance the panel receives [14,[19][20][21][22][23]. Humidity affects both the efficiency of the panel and the amount of irradiance the panel receives. This is because the water vapor in the air affects the amount of diffuse irradiance that reaches the panel and humidity can also have a soiling effect on the panel if water vapor seeps into the glass casing [24][25][26]. The Köppen-Geiger climate classifications were treated as categorical variables while temperature, cloud ceiling, and humidity were treated as continuous variables. Temperature was measured in degrees, cloud ceiling in hundreds of feet, and humidity was expressed as a percentage.
Next, altitude was incorporated into the model to help account for the intensity of the irradiance on the panel. As irradiance travels to Earth, it can be deflected and diffused by water vapor and other particles in the air [26]. As altitude increases there is a lower chance for irradiance to be deflected and diffused resulting in a higher amount of direct irradiance hitting the solar panel compared to panels at lower altitudes. Altitude was measured in meters measured from sea level and treated as a continuous variable. Finally, time was incorporated into the model to account for the position of the sun throughout the day and its seasonal affects. Time was accounted for by using the variables hour and month. Hour accounted for the position of the sun as it traverses the sky from east to west across the panel. Minute was not included because the position of the sun does not change significantly between the 15-minute measurements compared to its position after 60 minutes. As a result, hour was treated as a categorical variable with values between 0-23. The time frame for this model was further limited between 10:00AM and 3:00PM or daylight hours. Creating a standard time frame helped eliminate bias in variable coefficients when the sun was not present due to northern locations having a shorter daylight period during the winter solstice [27]. Month helped account for seasonal changes throughout the year as well as the sun's elevation in the sky with reference to the southern horizon. Month was also treated as a categorical variable with values between 1-12. After consolidating the data within the ranges of each variable, the final dataset consisted of 24,179 data points and 14 test sites (Curacao did not have any cloud ceiling measurements). In conclusion, these variables aided in analyzing the effect of climate classification on horizontal solar panel power output while holding influential variables constant.

Analysis
After the model variables were finalized, 1,000 points were randomly removed to provide a validation set to confirm the model's predictive ability. Next, the conceptual model (see Equation 1) was specified into an additive, statistical model for use in the empirical analysis (Equation 2). After the full statistical model was completed, two reduced models were developed in order to conduct Wald tests for joint significance of the weather and climate coefficients [28]. The Wald test was used in place of the standard F-test because not all the underlying assumptions for an OLS model were met during initial data examination.
The full and reduced models were compared using RStudio and conclusions were made on the effectiveness of the Köppen-Geiger climate classification system to predict photovoltaic power output compared to weather data.
The conceptual model can be seen in Equation 1--it identifies three specific factors that impact photovoltaic power production as expressed earlier in the paper. These factors can be broken into specific variables to better understand their influence on power, as shown in Equation 2. This equation contains 28 variables and associated coefficients. The variables grouped together represent categorical variables that have multiple dummy variables. Dummy variables are equivalent to 0 or 1 in an equation such that one category is represented at a time. There is also one less dummy variable than there are categories for each group due to one variable being the baseline, which is taken into account by the intercept or β0. Finally, each numbered coefficient is defined in Table 2 below. Equation (1) is the conceptual photovoltaic power prediction model. Equation Y= β₀+β₁X₁+β₂X₂+β₃₋₁₃X₃₋₁₃+β₁₄₋₁₈X₁₄₋₁₈+β₁₉X₁₉+β₂₀X₂₀ +β₂₁X₂₁+ β₂₂₋₂₇X₂₂₋₂ 7 (2) Next, assumptions were tested to determine if the proposed model could be viable. Assumptions that were tested include multicollinearity, serial correlation, normality, homoscedasticity, and coefficient significance. These were all tested in RStudio to determine if further analysis could be carried out. If these assumptions were not met, appropriate measures needed to be taken to draw valid conclusions from the model.
After all assumptions were tested, a reduced model was created for both weather and climatic variables in order to conduct the Wald test. The Wald test compared these reduced models to the full model to determine if the variables provided any value in the prediction of power. The reduced models can be seen below in Equation 3 and 4.
Equation (3) is the reduced prediction model without Köppen-Geiger climate classifications.
Y= β₀ + β₁X₁ + β₂X₂ + β₃₋₁₃X₃₋₁₃ + β₁₄₋₁₈X₁₄₋₁₈ + β₂₂₋₂₇X₂₂₋₂ 7 (4) A Wald test was conducted for each reduced model. Depending on the test statistic and the associated chi-squared critical value, the null hypothesis can be rejected or fail to be rejected. If the null hypothesis is rejected, the test would conclude that the coefficients are not jointly equal to zero and thus add value to the power prediction model. If the null hypothesis cannot be rejected, the conclusion would be that the coefficients are equivalent to zero and add no value to the power prediction model. In other words, the variables tested do not have any effect on the power output of horizontal photovoltaic cells.
Finally, the two reduced models were compared and analyzed to determine their effectiveness. First, each model's goodness-of-fit was tested by calculating each model's Rsquared values after inputting the 1,000 random validation points. Next the models' predictive abilities were tested. Each model's root mean squared error and mean absolute error were calculated to explain how well the models were able to predict the power given the input variables and actual power recorded. In conclusion, the results obtained from these tests determined the effectiveness of incorporating Köppen-Geiger climate classifications.

Figure 5. Normality Q-Q plot.
The full model was estimated using the remaining 23,179 data points after removal of the validation dataset. First, normality was tested by creating a quantile-quantile (Q-Q) plot, as shown in Figure 5. Ideally, if the dataset was normally distributed, the graphed points would follow the slanted, dotted line across the plot, yet the tail ends of the plotted points stray from the ideal line, indicating a nonnormal distribution. However, the sampling distribution of the coefficients is still considered to be normally distributed due to the central limit theorem, which suggests a large sample with random variables approaches normality regardless of the shape of the population distribution [29].
Next, multicollinearity was tested amongst the independent variables to determine if any of the variables were dependent upon each other. The variation inflation factor (VIF) was determined for each variable. An ideal VIF for a variable would be 1; however, VIFs under 10 are acceptable [28]. From the full model, two variables had a VIF above 10. Climate classification had a VIF of 37.64 while altitude had a VIF of 12.49. First, altitude was removed from the model which drastically changed the VIF for climate classification. It decreased the VIF from 37.64 to 4.20 indicating that altitude and climate classification were highly correlated. Besides this dramatic change, no other VIF values changed by more than 0.08. After the removal of altitude, the next highest VIF was temperature with a value of 4.41. Climate classification was also removed, and altitude was reinserted into the model to identify its effect. It also lowered all variables' VIFs under 10, with temperature having the highest VIF of 3.69. Due to climate classification being the investigated variable within this research, altitude was ultimately selected to remain out of the model.
After multicollinearity, serial correlation was tested. To test this assumption the data was first organized alphabetically by each location and then within each location the data was organized chronologically. Finally, a plot of the residuals was created, as shown in Figure 6 below. From this graph, autocorrelation can be clearly identified by the tendency of the data to continually stay above or below the x-axis at y = 0, labeled by the red line. Due to this evident trend, it was concluded that the residuals were correlated, which was further verified with the data failing a Durbin-Watson test [28]. Serial correlation was accounted for by using robust standard errors thus allowing for valid statistical inference. This process was completed in RStudio utilizing the package "sandwich" and command "coeftest" [30]. Following correlation, homoscedasticity of the residuals was tested by first looking at a plot of the residuals versus the fitted values of the model, as shown in Figure 7. In the plot, the values are at first closely grouped to one another, but progressively spread further apart moving left to right across the plot. This plot is depictive of heteroscedasticity. For homoscedasticity the values would ideally follow a random pattern with no specific clustering throughout the plot. Heteroscedasticity was confirmed with a Breusch-Pagan test [28]. Like serial correlation, heteroscedasticity was dealt with by using robust standard errors to adjust the standard error of the estimated coefficients within the model to determine the correct p-value and its significance. The robust standard errors were again calculated in RStudio utilizing the package "sandwich" and command "coeftest," but specifying within the command to correct for both serial correlation and heteroscedasticity [30].  Finally, each variable's significance was determined based upon a t-test using an alpha of 0.05 to determine the associated p-value's significance [29]. Any p-value lower than 0.05 would result in a significant variable. After implementing robust standard errors, climate classification variables and Month 11 had the largest change in p-value. Of the climate classification variables, the p-value of Csa increased the most making the variable become more insignificant. This insignificance led to the conclusion that there is no difference on the effect of power production between Csa and the Af (the base climate of the model). The locations recorded with the climate classification Csa are both located in California while the location recorded with the climate classification Af is in Hawaii. The insignificance can potentially be explained by the two climates sharing similarities in local weather patterns and other climatic features while controlling for temperature, humidity and cloud ceiling. Similarly, Dfa and Csb also became insignificant after adjusting the standard errors for serial correlation and heteroscedasticity. Again, there could be similarities between these climates and Af, such as precipitation or wind speed, while controlling for temperature, humidity, and cloud ceiling. These variables were not removed from the model in order to maximize the number of locations the model can be applied to. In conclusion, the climates Csa, Dfa, and Csb could not be differentiated against Af in predicting the power output of horizontal photovoltaic panels.
The only other variable within the model that was insignificant was Month 11 (November) with a p-value of 390. This higher p-value could be due to November having a similar effect on horizontal photovoltaic power prediction when compared to the base month of the model (January) while controlling for the other variables in the model. This similar effect is most likely the position of the sun (solar elevation) in the sky throughout the month. In summary, the effect on power production could not be distinguished between November and January. However, November was retained as it is a key variable that will be used for prediction.
In conclusion, the data obtained from the different test locations was developed into a simplified linear horizontal photovoltaic power model. The model was unable to meet the correlation and homoscedasticity assumptions, requiring a robust calculation of each coefficient's standard error and pvalue. Overall, the Köppen-Geiger climate classification that was determined to have the highest positive effect on horizontal photovoltaic power production compared to Af, is Cfb or warm temperate, fully humid, and warm summer.
With the completion of an initial model, two reduced models were compared against the full model to determine if climate classification and the weather variables temperature, humidity and cloud ceiling added value to the model. This analysis was conducted using a Wald test in order to properly account for the model's robust standard errors. These errors were due to the model exhibiting heteroscedasticity and serial correlation. Both robust Wald tests were completed with the null hypothesis stating that the removed variables in the reduced models did not add value to the model [28]. In both instances, the test concluded with the rejection of the null hypothesis. In conclusion, it was determined that the Köppen-Geiger climates and the weather variables, temperate, humidity and cloud ceiling added value to the full model.
Next the models' predictive abilities were assessed based upon their ability to fit the validation dataset. First, the adjusted R-squared values of the full and reduced models were compared to determine each model's goodness-of-fit or how well the model fit the validation data, as shown in Table  3. The model with weather variables and no climate variables can explain, on average, 21.75% more variance of the power output of horizontal photovoltaic cells compared to the model with climate variables and no weather variables from the dataset. Although the weather variables have a better fit, incorporating climate did increase the amount of variation the model can explain by 3.05%. An example of each model's goodness-of-fit can be seen in Figure 8 below. In the figure, the predicted power output is graphed against the actual power recorded at March Air Reserve Base between Nov 28 and Nov 30, 2017. Due to the time of day of the model being between 10:00AM and 3:00PM, times outside this period were all represented as 0 W.  Finally, a comparison was conducted of the predictive abilities of the full and reduced models. This was completed by inputting the validation dataset into each model and measuring the difference between the actual values recorded and the predictive values produced from the models. Root mean squared error and mean absolute error were used to compare the models. Running the 1,000 validation points through the models produced the values in Table 4. The results are similar to the R-squared values in Table 3 above, showing that the reduced model with weather variables but no climate variables produced less error while predicting horizontal photovoltaic power. However, when climate variables were included-i.e. the full model-the error decreased further. In conclusion, weather variables within the model were able to fit the validation data better while producing less error compared to climate variables.

Conclusion
Horizontal panel power output and weather data was collected from 28 test locations around the globe between June 2017 and September 2018. The model started with over a half billion data points collected from 16 reliable sites. However, the data was narrowed down to 14 test sites resulting in 24,179 usable data points, leading to the development of a linear model to predict power output. The model incorporated site-specific weather and geographical characteristics, along with Köppen-Geiger climate classifications in order to determine the effect of adding climate to the model. After performing a Wald test between the full model and reduced model without Köppen-Geiger climate variables, it was determined that the climate variables did provide added value to the full model. Although adding Köppen-Geiger variables provided added value to the model, these variables were less effective in fitting and predicting the validation dataset.
After analyzing each models' goodness-of-fit and predictive abilities, it was concluded that cloud ceiling, temperature, and humidity were, on average, able to account for more variation compared to Köppen-Geiger climates. However, adding climate to these weather variables further increased the amount of variation explained by 3% and lowered the overall error within the model. The best Köppen-Geiger climate classification was Cfb, which was able to produce, on average, 30% more power compared to the climate Af. Similarly, the worst climate was BSk, which produced, on average, 9% less power compared to climate Af. Overall, the model can predict the power output of horizontal poly-crystalline photovoltaic panels at 1,213 DoD installations between the hours of 1000-1500. However, the final model was only able to account for approximately 55% of the variation within the data. In conclusion, it was discovered that the cost-effectiveness of a photovoltaic array depends on Köppen-Geiger climate regions, in addition to weather characteristics and the orientation of the array. Authors' Note: The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof. Reference to specific commercial products does not constitute or imply its endorsement, recommendation, or favoring by the United States Government. The authors declare this is a work of the United States Government and is not subject to copyright protection within the United States. This article was cleared with case number 88ABW-2019-1585. Additionally, the authors thank Jada Williams for assistance with manuscript preparation.