Data Science

Machine learning & data science for beginners and experts alike.
Register for the upcoming Live Community Q&A Session - and don't forget to submit your questions for @DeanS regarding the future role of analytics here.


Champagne Analytics


Let's use the Time Series tools to forecast Champagne sales.  We have the monthly sales data from the Perrin Freres label, from January 1964 to September 1972 (because to answer any new question, we need to start with a clean dataset). 



The Problem

The goal is to do a 12-month forecast for the number of monthly sales for the Perrin Freres label.



The Dataset

The dataset has 105 observations with the monthly champagne sales being in millions.



Data Investigation

Best practices for any predictive modeling project is to make sure that there aren’t any records missing from the dataset.  We are going to use the Field Summary tool to figure this out. From the Field Summary tool, we see that there are no records missing. (Double check the values: the records may not be null, but they could contain erroneous values.)  The reports also give us more statistics about the dataset such as the min, max, median, mean, and standard deviation; all of which can be important when learning about your datasets.






Time Series Investigation


In this step, we want to learn more about how the data evolves over time. We are interested n the time series plot, season plot, decomposition plot (data, seasonal, trend, remainder), auto-correlation function plots (ACF, PACF).  







The Time Series Plot 

Using the TS Plot tool, the resulting graph is plotting all the data points in a univariate time series plot. It allows you to look at the data trends. Good news! We can see that there is an increasing trend in sales over time.



Season plot 

This graph plots all the years in one single view, allowing for a year-over-year comparison. From this graph, we can see that the Champagne sales decline every year in August. Hmm. This might be a good question to ask the team, or look into the seasonal logistics of Champagne production.







Decomposition plot




The decomposition plot contains four graphs:


  1. Data – these are your data points plotted over time (this is the same as the time series plot)
  2. Seasonal – this graph displays the repeating short-term cycle in the series (here, a season is a fixed and known duration - think about a fiscal quarter, rather than winter or spring)
  3. Trend - this graph displays the increasing or decreasing long-term direction in the series
  4. Remainder – the residuals are what's left after the season and trend series are removed; this is noise or irregularity


Keep reading for more information on time series decomposition and seasonality.



Autocorrelation Function Plot

Autocorrelation functions (ACF) and partial autocorrelation functions (PACF) both summarize the strength of the relationship between two variables over time. ACF is the correlation between observations based on past and future values. PACF, on the other hand, takes into consideration the correlation between a time series and each of its intervals between (intermediate values). An applied interpretation of ACF and PACF models can be found here.








Build ARIMA and ETS models


Now it's time to build the model. Especially for time series model building, it's important to have an evaluation and validation dataset. Here's why: we need most of the dataset to create an effective model, so we will retain the chronologically latest records of the dataset to validate the model. The rule of thumb is an 80/20 split - 80% of the records to create the model and 20% of the records to test the model. In this scenario, we will use the year 1972 as our validation dataset as it is the most recent, and does not have a complete year of data.  


Reminder: you must keep your records complete and chronological, so traditional sampling methods can be detrimental - no random sampling!  For time series analysis, the holdout set should fall immediately after the estimation set in time. As a result, the data stream should be sorted from the most distant record in time to the most recent before creating the estimation and holdout sample. Create your sample cutoffs based on dates and seasons.  Another good example of this can be found in Help> Sample Workflows> Predictive Tool Samples> Predictive Analytics> 15_Time_Series_Forecasting_Sample


You have two configurable model options in Alteryx Designer -  the ARIMA tool and ETS tool  (please see links for configuration assistance). To compare the validity of each model, we need to configure both to see which model is more accurate. We can look at the Report (R) output from each tool for the analysis.  There are different measures that you'll want to compare. Personally, I like to look at the information criteria (AIC, AICc, and BIC) and the in-sample error measures.


To help with the comparison assessment, we will now compare the two models using the TS Compare tool. The L input of the TS Compare tool is the unioned model objects from your ETS and ARIMA tools.  The R input will be your validation dataset, as the models have not yet seen these data points. In the O output from the TS Compare tool, we get the holdout sample error measures for both models (lowest absolute value is better). The fit statistics for both models are similar except the Mean Error. This can subjectively be taken as evidence that the ARIMA model will better produce a forecast.








Within the R output, we see a report summary for both models. This gives us a glance at the actual values compared to the values that the model predicted:





The I output is an interactive chart graphing all three resulting graphs in a single pane. This gives us a visual representation of the actual vs the models created.




We can drill down towards the end of the graph to see the actual vs forecasted values.



From the steps above, the ARIMA model has a lower holdout sample error when compared to the ETS model.



Generate the Forecast


Based on our testing, we will go ahead and use the ARIMA model for our forecasting. Using the same configuration of the ARIMA model in the building phase, we are going to generate a forecast. For this, we are going to use the full dataset as we will get a better time series forecast. Using the TS Forecast tool, we are going to forecast for the next 12 months. One important aspect of time series modeling to keep in mind is that the error of a model increases as the forecasts move further forward in time, which results in wider confidence intervals. Thus, the first few forecasted periods are going to be more accurate than the later forecasted periods.










Additional Time Series resources:

5 - Atom



Can you show how to use xgboost to run a forecast please.


Thanks Neil

8 - Asteroid

Awesome post and some very concise explanations on material that can be hard to explain (e.g. ACF, PACF).


I recently took this course in my MS in Analytics from Georgia Tech. If you're looking for a deep dive into Time Series Analysis, that course was well as Penn State's are great resources.