Published on Feb 29, 2020
This post describes a group project I worked for the class at Macalester College called Correlated Data in 2019 with Prof. B. Heggeseth. We learned different statistical techniques and we chose time-series analysis of coal consumption in the U.S. as our capstone project. In summary, we modeled the monthly coal consumption data since 1973 using ARIMA model and created a forecast for the next 2 years. This was a case study of the time-series analysis on real-world data.
I worked with three other Macalester students (also friends of mine), Jack Freier '20, Karrin Khandelwal ‘20, and Carl Francalangia ‘20.
Data Correlation
In Statistics, modeling is the use of mathematical ideas to approximate the reality that generates the data we observe. Using the model we made, we create a forecast, which is a projection of what could happen in the future based on historical data. When we make a model, we often make an assumption that observations are independent.
However, if you look at the real-world data, this assumption is not true in many cases. For example, if we take a survey in a classroom, the number we get from you is likely to be related to the data we get from the person sitting next to you. So we try to control for these relationships by using controlling variables such as age or gender. However, even after we try to capture these relationships, there still may be a relationship between you and the person next to you that can’t be explained.
When we are dealing with time-series data, instead of comparing you and your neighbor, we would be comparing the observations from today and yesterday, or this month and last month, etc. Whatever kind of data we are dealing with, when we make an assumption that these observations are dependent on each other, our model changes significantly, and so does the forecast.
We got the data from EIA’s (the U.S. energy information administration) website. The website has a dataset of domestic energy consumption, production, and other related data for each source of energy. We looked at the monthly coal consumption from January 1973 to May 2019. Our modeling method is called the ARIMA model, which stands for the autoregressive integrated moving average. In order to use the ARIMA model, we broke the original values into three components.
The first component is trend, which is a long-term average change over time. For example, is the value going up or down, or stagnant over time? We, then, subtract the estimated trend from the original value to deal with the seasonality. Seasonality is a pattern that repeats each time cycle. In this dataset, for example, January and August have higher values than other months because of potential factors such as the use of air conditioning. We take out the seasonality from the detrended values, and now left with residuals. Residual is the high-frequency variability in the observations not captured by trend or seasonality. This is where we look for the correlation between the values, and fit the ARIMA model.
The graph shows the original data of the domestic coal consumption between 1973 and 2019. We can see the overall increasing trend until 2006 and then the value rapidly drops. We also observed a clear seasonality, repeating patterns of ups and downs within each one-year cycle. The purple curve is our approximation of the trend. We used a method called splines, which is fitting multiple polynomial models for intervals with a similar trend. We split the x-axis into three intervals with two breaking points at 2006 and 2010 to capture the different trends.
The graph shows the de-trended data. We got these values by subtracting trend from the original values. Intuitively, the purple line from the previous graph would now be zero. Now, we take out seasonality from these values.
We can see the seasonality by splitting this graph by one year cycle. Each line represents one year and we observe the repeating seasonal patterns. Summer and winter months have higher values than other months probably because of the use of air conditioning and heating, etc. We fit a linear model for each month with an interaction term after 2008 (because of the higher level of seasonality in the recent data) to estimate the seasonality.
We removed the seasonality from the de-trended data and we are finally left with residuals. We can see that the values are scattered around zero and we see little pattern compared to the other plots. Technically, we still see the higher level of residuals for recent years that was not captured by the interaction term for seasonality but the plot looks good otherwise.
In order to measure the correlation, we took pairs of two observations and calculated the average correlation between them by the distance, or the lag. So for example, we can look at two adjacent points (lag of 1) and see how correlated their values are.
ACF (autocorrelation function) shows us the correlation between the pairs for each lag and PACF (partial ACF) conditions for all other lags, telling the direct correlation for each lag. We see a strong sign of the correlation between the two consecutive months but not for the other lags. We also see a weak correlation around 10 to 14 lags, which potentially indicates the existence of an annual correlation.
Based on the plot, we fit ARIMA model on residuals with AR(1) and seasonal MA(1) components. Since we now have all of the three components estimated, we can combine them to create a forecast. In the following graph, red line represents our forecast for the 24 months after May of 2019. The gray shades tell us the variability of the forecast value.
Our final forecast shows the same declining trend and seasonality, but the seasonal fluctuation looks smaller. Our guess is that the decreasing level of seasonality is due to giving equal weights to every observation in the 50 years of data, which underestimated the higher seasonal fluctuations in the recent years. This implies that the further work on seasonality/residual models may be necessary to make the forecast more accurate.