Time series decomposition involves thinking of a series as a combination of level, trend, seasonality, and noise components. Decomposition provides a useful abstract model for thinking about time series generally and for better understanding problems during time series analysis and forecasting.
The decomposed time series are critical components for machine learning based forecasting methods.
Andrew McCallum (Senior Economist, Board of Governors of the Federal Reserve System) reached out to me as the author of the stR package with a number of challenges using the package on real world data.
We have run into the following problems with both stR and TBATS.
1. Weekend and U.S. Federal holidays have zero values. We can drop
weekends but need to control for U.S. holiday effects. Working with
logs means we have missing values for these days. Our ad hoc solution
was to replaces zeros with very small values. After ad hoc replacement
we still need to control for the pattern implied by those very small
2. The NSA series has obvious seasonal patterns that do not have a
simple calendar periodicity. For example, the 22nd and 23 business day
(not calendar day) of each month tend to be about twice the size as
the average within a month. Similar patterns exist for the 1st and
11th business days. We hoped that business day dummy covariates would
3. It does appear that the our seasonal patterns change over time and
we want to ensure we are controlling for that.
4. When implementing STR with covariates, we were not able to figure
out how to define length.out for TimeKnots. Instead, we copy the
settings chosen by AutoSTR.
5. Auerbach and Gorodnichenko (2016) handled the inclusion of
covariates by pre-filtering the data. That pre-filter was an OLS
regression of the log NSA data on a orthogonal polynomial with
relevant dummies. What is your opinion of that approach? We had hopped
that STR would eliminate the need to pre-filter.
The data is publicly available daily spending (current and future) on defence contracts by the US government during period 1994-2013 (dod.defense.gov/News/Contracts). The data is extracted and explained thoroughly in “Effects of fiscal shocks in a globalised world” by A. Auerbach and Y. Gorodnichenko (eml.berkeley.edu/~ygorodni/AG_global.pdf). The last 3 years of data are magnified in the graph below:
We then normalise the data by converting it to a log scale:
Zooming into the last 3 years provides additional insight into the structure of the data:
Zooming into the data further demonstrates some clear seasonality.
The above graphs produce the following observations:
- There is a stable uptrend along the whole data set.
- Yearly seasonality is probably present in the data.
- Weekly seasonality is present in the data.
- Monthly seasonality might be present in the data.
- There are outliers which might have a pattern.
- There are at least two visually distinguishable periods: 1994-2008, 2009-2013.
Preparation & data cleansing
The data is comparatively clean, so we only needed to take the following steps:
- All weekends were removed as they made no impact on the result (the US government did not work on weekends).
- All zero values were replaced with NA values (NA stands for Not Available in R). This was made because zero values either indicate missing data or mistakes. In both cases replacing zero values with NA makes STR predict the correct values for these days.
The preliminary decomposition contains the following components:
- Weekly seasonality
- Monthly seasonality
- Yearly seasonality
The main observation from the above graph is that the random component exhibits some structure. In particular, at least in the beginning of the time series and in the end, the outliers appear in almost equal periods of time. Further investigation showed that:
- Positive outliers (unexplained excessive spending) in most cases happen every 15th day and last day of every month. If the 15th or the last day happens to be a holiday or a weekend, the outlier appears on
the first business day after (in many cases, they are the 10th, 11th, 20th, 21st, 22nd and 1st business days of every month). I believe that such behaviour relates to some policies and/or business practices of
- Positive outliers also happen on last Friday every November unless it is the last day in the month. In such a case, the outlier happens on the previous Friday.
These observations were taken into account in the final decomposition.
The final decomposition
The final decomposition contains the following components:
- Weekly seasonality
- Monthly seasonality
- Yearly seasonality
- Day 15 and 31
- Last Friday in November
- Random component
It is a common practice to identify seasonal patterns via their importance to the model which is used for decomposition (robjhyndman.com/hyndsight/detecting-seasonality). Thus, I followed this practice and considered decomposition as “valid” or “better” than others, when cross-validation error decreased. The following attempts were made to improve the error:
- Bi-weekly seasonality did not improve cross-validated MSE, therefore was not used.
- “Day 15” and “Day 31” components were tested separately and did not improve cross-validated MSE, therefore a single regressor “Day 15 and 31” was used in the final decomposition.
Seasonally adjusted data
Seasonally adjusted data is the data without all seasonal components and, in this case, without the outliers. Thus, the seasonally adjusted data is the trend and the random components:
It is clear that the seasonally adjusted data is far easier to interpret and understand/
Future work is can be split into two quite different directions.
The first direction is testing various ways to improve the cross-validation error of the decomposition. For example, we have not tried other possible seasonality periods, like quarterly or semi-annually which might give further improvements in the decomposition.
The second main direction is to research ways to satisfy various constraints imposed on the decomposition process. For example, one such constraint is that spending during every period (year, month, week, etc…) in the original and seasonally adjusted data should be the same. These constraints sound logical but when we have multiple seasonalities, which are not aligned (for example a month usually contains a fractional number of weeks), it may be impossible to satisfy. Thus, suppose Monday is the day of most spending. If we require that monthly spending in the original data should be equal to monthly spending in seasonally adjusted data we ignore the possibility that some months can have four Mondays, and some five.
If such constraints are useful, it is still possible to adjust the decomposition by adding a separate smooth component which will fix the discrepancies in misaligned periods in the original and adjusted data, but this requires further investigation.