R package for Long-term Electricity Demand Forecasting in hourly resolution on a country wide scale.
The functions included in the package can be used separately or combined in the function full_forecast()
## Installation (Aspirational) Once on CRAN:
install.packages("oRaklE")
Alternatively, install the development version
install.packages("devtools")
::install_github("Autarky-Power/oraklE_R") devtools
Install requirements:
<- c("caret","countrycode","doParallel","dplyr","ggplot2","ggthemes","glmnet","httr",
packages "jsonlite","lubridate","MLmetrics","MuMIn","parallel","patchwork","purrr","R.utils",
"readxl", "xml2")
install.packages(setdiff(packages, rownames(installed.packages())))
library("oRaklE")
As a first step, the package gets the load data from the Transparency Platform of the European Network of Transmission System Operators for Electricity (ENTSO-E) https://transparency.entsoe.eu/. If the country is not a member of the ENTSO-E a dataset needs to be supplied manually. An example with South African load data is included at the end of this section. You can either supply your own ENTSO-E Transparency Platform API key or use one of the deposited keys if api_key is set to “default”.
# Get initial Data
= get_entsoE_data(2017,2021,"France",api_key="default") demand_data
The next step checks for missing data and replaces NaN values with the load data one week prior at the same time. So a missing value on a Thursday at 8 pm will be filled with the load value at 8 pm the week before. It also adjusts the dataset to account for changes in time resolution, ensuring that the data is standardized to an hourly resolution. For example, French load data is reported at an hourly resolution from 2017 to 2021 and at a half-hourly resolution from 2022 onwards. Therefore, it is recommended to run this function, even if there are no missing values.
# Fill missing values
= fill_missing_data(demand_data) demand_data_filled
After the data is downloaded and standardized, the load time series is decomposed into three components: a yearly long-term trend, a daily mid-term seasonality, and an hourly short-term seasonality. If the data is available only at a daily resolution, the calculation of hourly seasonality is skipped.
# Decompose the load data
= fill_missing_data(demand_data) demand_data_filled
The function returns a list of three dataframes —one for each time series component:
demand_data$longterm
demand_data$midterm
demand_data$shortterm
In the following steps, each time series will be modelled individually.
First, historical data from the ENTSO-E archive, starting from 2006, is added to the long-term trend series. A dataset with historic yearly demand data for each ENTSOE-E member country is included in the library.
# Get historical data for the respective country
<- get_historic_load_data(decomposed_data$longterm) longterm
The long-term electricity demand trend is estimated with different regression algorithms based on macro-economic covariates. 10 macro-economic indicators are fetched via an API call to the Worldbank Development Indicators. Additional macro-economic covariates can be added manually.
# Get macro-economic covariates for the respective country
<- get_macro_economic_data(longterm)
longterm_all_data
head(longterm_all_data)
country year avg_hourly_demand population GDP industrial_value_added ...2006 54417.12 63628261 2.279283e+12 19.28408
FR 2007 54769.12 64021737 2.334550e+12 19.13674
FR 2008 56214.75 64379696 2.340502e+12 18.81383
FR 2009 55409.97 64710879 2.273252e+12 18.30484
FR ...
It should be noted that the average hourly demand (avg_hourly_demand) refers to the average demand over each hour of the respective year (typically in MW). To calculate the total annual demand, multiply avg_hourly_demand by 8760 hours.
After the dataset is fully prepared the best long-term prediction models are derived with multiple linear regression and k-fold cross-validation. Details on the mathematical approach are specified in the accompanying paper. The variable for test_set_steps defines how many years are used for the test set (also commonly referred to as validation set). The testquant variable defines how many of the initial best models are subjected to cross-validation.
# Calculate the best long-term trend prediction models
<- long_term_lm(longterm_all_data,test_set_steps = 2, testquant = 500) longterm_predictions
The three best models as well as plots for each model are generated and saved.
Once the best models are derived, future predictions can be made. Since these models rely on macroeconomic covariates, which are unknown for future years, forecasts for these indicators are needed. The library can obtain these forecasts from the World Economic Outlook Database of the International Monetary Fund, or users can manually include predictions from other sources or specific scenarios.
The end_year variable specifies until which year the predictions will be made. If the WEO dataset is used, predictions can only be made up to 2028. If the dataset variable is set to any option other than WEO, the function will prepare the data frame structure and list the required macroeconomic indicators.
## Prepare dataset for future predictions
# With the dataset option set to the default ("WEO")
<- long_term_future_data(longterm_predictions, end_year = 2028, dataset = "WEO")
longterm_future_macro_data tail(longterm_future_macro_data)
country year avg_hourly_demand population GDP industrial_value_added ...2021 53225.29 67764304 2.575192e+12 16.39563
FR 2022 NA 71764951 2.640147e+12 17.56864
FR 2023 NA 75808393 2.665256e+12 17.73190
FR 2024 NA 77673096 2.701124e+12 17.71412
FR 2025 NA 79192297 2.750043e+12 18.21028
FR 2026 NA 80760586 2.795947e+12 18.48387
FR 2027 NA 82230070 2.838581e+12 18.72595
FR 2028 NA 83540334 2.879402e+12 19.03256
FR
# With the dataset option set to anything else (any string works)
<- long_term_future_data(longterm_predictions, end_year = 2028, dataset = "manual")
longterm_future_macro_data
:
Outputfor the following macro-economic variables:
If you want to use your own dataset you will need predictions
for model 1
GDP, GNI, industrial_value_added, rural_population
for model 2
GDP_growth, household_consumption_expenditure, rural_population, service_value_added
for model 3 GDP, industrial_value_added, rural_population, service_value_added
After the dataset for the future predictions is prepared, long-term forecasts until the designated time period can be made. One forecast with each of the three best models is calculated and the results are saved and plotted.
# Generate future long-term component forecasts
<- long_term_future(longterm_future_macro_data) longterm_future_predictions
The mid-term component is the difference between the yearly hourly average demand and the daily average hourly demand of the respective day:
\[D_M(y,d) = D_m(d)-D_L(y)\]
where \(D_M(y,d)\) refers to the mid-term component at day \(d\) and year \(y\), \(D_m(d)\) refers to the average daily hourly load of day \(d\), and \(D_L(y)\) refers to the yearly average hourly load of year \(y\).
The mid-term time series is modelled using seasonal, calendar, and temperature-related variables. Seasonal covariates include the month (January-December), day of the week (Sunday-Saturday), and a dummy variable indicating whether the day is a holiday or a workday. Information about public holidays is retrieved from https://date.nager.at.
# Get all national holidays within the respective time period
= add_holidays_mid_term(decomposed_data$midterm) midterm_demand_data
Daily temperature values are obtained by first retrieving the 20 most populated regions of the respective country or area from https://rapidapi.com/wirefreethought/api/geodb-cities. Next, the nearest weather station for each region is identified using https://rapidapi.com/meteostat/api/meteostat and the daily temperature data is downloaded. A weighted daily average temperature for the country is then calculated, using the population share and temperature values of the 20 regions. You can either supply your own API key (from https://rapidapi.com/) that is subscribed to geo-db cities and meteostat or use one of the deposited keys if api_key is set to “default”.
# Get daily average temperature values
= get_weather_data(midterm_demand_data, api_key="default") midterm_demand_and_weather_data
This function returns a list of two dataframes —one with the prepared demand data for the models and a dataframe with the daily temperature values for the 20 regions:
midterm_demand_and_weather_data$demand
midterm_demand_and_weather_data$temperature_data
The mid-term seasonality can then be modelled using the midterm_demand_and_weather_data$demand data frame. Similar to the long-term trend models, the medium-term component is modelled using different regression techniques. The relationship between electricity demand and temperature is non-linear, the library implements two different options to account for this non-linearity:
1.) If method = “temperature transformation”
The daily temperature values are transformed into heating and cooling degree days (HD and CD):
\[HD = \max \lbrace T_{\text{Ref}} - T, 0 \rbrace \quad CD = \max \lbrace T - T_{\text{Ref}}, 0 \rbrace\]
where \(T_{\text{Ref}}\) is the reference temperature (usually 18 °C for Central European countries). The reference temperature can be set with the Tref option. Squared, cubed, and up to two-day lagged values of the calculated heating and cooling degree day values, along with the original daily temperatures, are also included as covariates. The covariate selection is then done by a LASSO method (cite).
2.) If method = “spline”
A spline regression is instead used without the transformation of the temperature data.
The variable for test_set_steps defines how many days are used for the test set. It should be consistent between all three model components. Because we set the test set steps for the long-term model to 2 years it should be set to 730 in the mid-term model (2 years × 365 days).
# Calculate the best mid-term seasonality prediction model
= mid_term_lm(midterm_demand_and_weather_data$demand, test_set_steps=730, Tref = 18, method = "temperature transformation") midterm_predictions
After identifying the best mid-term model, future seasonality predictions can be generated. While the future calendar and holiday covariates are deterministic, the future daily average temperatures are impossible to know with certainty. To estimate these, the library calculates the average of the past three observations for the same day. For example, the average temperature for July 22, 2025, would be estimated by averaging the temperatures observed on July 22 in 2022, 2023, and 2024. The variable end_year specifies the final year up to which future predictions will be made and should be consistent over all three model components (long-, medium-, and short-term).
# Generate future mid-term component forecasts
= mid_term_future(midterm_predictions, end_year = 2028) midterm_future_predictions
The short-term component \(D_S(y,d,h)\) corresponds to the respective intra-day patterns. These are isolated by subtracting the yearly and daily averages \(D_L(y)\), \(D_m(y,d)\) from the hourly load data of year \(y\), day \(d\), and hour \(h\) :
\[D_S(y,d,h) =D_s(h) - D_m(y,d)-D_L(y)\]
The short-term time series is modelled with multiple regression using only the type of hour and a holiday indicator as covariates. Information about public holidays is again retrieved from https://date.nager.at via API.
# Get all national holidays within the respective time period
= add_holidays_short_term(decomposed_data$shortterm) shortterm_demand_data
A separate short-term model is generated for each combination of day type and month. This results in a total of 84 models (12 months × 7 days). This approach has proven to yield better results compared to using a single model for the entire time series. The variable for test_set_steps defines how many hours are used for the test set. It should be consistent between all three model components. Because we set the test set steps for the long-term model to 2 years it should be set to 17520 in the short-term model (2 years × 8760 hours).
# Calculate the best short-term seasonality models
<- short_term_lm(shortterm_demand_data ,test_set_steps=17520) shortterm_predictions
Because all used covariates are deterministic the future short-term seasonality forecast can easily be generated.
# Generate future short-term component forecasts
= short_term_future(shortterm_predictions,end_year = 2028) shortterm_future_predictions
After all three components have been modelled successfully, the predictions can be combined into the full demand series. The longterm_model_number option specifies which of the three best long-term models will be used. The function also prints four statistical metrics: MAPE, RMSE, Accuracy, and the R²-value of both the training and the validation set.
# Combine all model predictions and evaluate the accuracy
<- combine_models(longterm_predictions,midterm_predictions,shortterm_predictions,longterm_model_number =1)
full_model_predictions
:
Output*** Final Model Metrics ***
MAPE: 0.0267
Training Set: 0.0384
Test Set
RSQUARE: 0.9754
Training Set: 0.9474
Test Set
ACCURACY: 97.33 %
Training Set: 96.16 %
Test Set
RMSE: 1863.2 MW
Training Set: 2635.4 MW Test Set
Note that the algorithm reaches a MAPE below 4% over the validation set of 17520 hours (2 years) for French demand data.
Similarly, full future predictions up to the specified end year can be generated.
# Generate full future demand predictions in hourly resolution
<- combine_models_future(longterm_future_predictions,midterm_future_predictions,
full_model_future_predictions longterm_model_number =1) shortterm_future_predictions,
The full_forecast function automatically performs all the steps described above and generates a complete forecast for a specified country. The future option determines whether future forecasts should be made, and if so, up to which end_year.
<- full_forecast(start_year=2017, end_year=2021, country="France", test_set_steps=2,
forecast_data future="yes", end_year=2028)