Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Data Science

Machine learning & data science for beginners and experts alike.
DiganP
Alteryx Alumni (Retired)

champagne2.JPG

Champagne Analytics

 

Let's use the Time Series tools to forecast Champagne sales.  We have the monthly sales data from the Perrin Freres label, from January 1964 to September 1972 (because to answer any new question, we need to start with a clean dataset). 

 

 

The Problem

The goal is to do a 12-month forecast for the number of monthly sales for the Perrin Freres label.

 

 

The Dataset

The dataset has 105 observations with the monthly champagne sales being in millions.

1.png

 

Data Investigation

Best practices for any predictive modeling project is to make sure that there aren’t any records missing from the dataset.  We are going to use the Field Summary tool to figure this out. From the Field Summary tool, we see that there are no records missing. (Double check the values: the records may not be null, but they could contain erroneous values.)  The reports also give us more statistics about the dataset such as the min, max, median, mean, and standard deviation; all of which can be important when learning about your datasets.

 

 

 

2.png

3.png




Time Series Investigation

 

In this step, we want to learn more about how the data evolves over time. We are interested n the time series plot, season plot, decomposition plot (data, seasonal, trend, remainder), auto-correlation function plots (ACF, PACF).  

 

 

4.png

 

 

 

The Time Series Plot 

Using the TS Plot tool, the resulting graph is plotting all the data points in a univariate time series plot. It allows you to look at the data trends. Good news! We can see that there is an increasing trend in sales over time.


5.png





 

Season plot 

This graph plots all the years in one single view, allowing for a year-over-year comparison. From this graph, we can see that the Champagne sales decline every year in August. Hmm. This might be a good question to ask the team, or look into the seasonal logistics of Champagne production.


 

6.png

 

 

 

 

Decomposition plot


7.png

 

 

The decomposition plot contains four graphs:

 

  1. Data – these are your data points plotted over time (this is the same as the time series plot)
  2. Seasonal – this graph displays the repeating short-term cycle in the series (here, a season is a fixed and known duration - think about a fiscal quarter, rather than winter or spring)
  3. Trend - this graph displays the increasing or decreasing long-term direction in the series
  4. Remainder – the residuals are what's left after the season and trend series are removed; this is noise or irregularity

 

Keep reading for more information on time series decomposition and seasonality.

 

 

Autocorrelation Function Plot

Autocorrelation functions (ACF) and partial autocorrelation functions (PACF) both summarize the strength of the relationship between two variables over time. ACF is the correlation between observations based on past and future values. PACF, on the other hand, takes into consideration the correlation between a time series and each of its intervals between (intermediate values). An applied interpretation of ACF and PACF models can be found here.

 

 

 

8.png

 



 

 

Build ARIMA and ETS models

 

Now it's time to build the model. Especially for time series model building, it's important to have an evaluation and validation dataset. Here's why: we need most of the dataset to create an effective model, so we will retain the chronologically latest records of the dataset to validate the model. The rule of thumb is an 80/20 split - 80% of the records to create the model and 20% of the records to test the model. In this scenario, we will use the year 1972 as our validation dataset as it is the most recent, and does not have a complete year of data.  

 

Reminder: you must keep your records complete and chronological, so traditional sampling methods can be detrimental - no random sampling!  For time series analysis, the holdout set should fall immediately after the estimation set in time. As a result, the data stream should be sorted from the most distant record in time to the most recent before creating the estimation and holdout sample. Create your sample cutoffs based on dates and seasons.  Another good example of this can be found in Help> Sample Workflows> Predictive Tool Samples> Predictive Analytics> 15_Time_Series_Forecasting_Sample

 

You have two configurable model options in Alteryx Designer -  the ARIMA tool and ETS tool  (please see links for configuration assistance). To compare the validity of each model, we need to configure both to see which model is more accurate. We can look at the Report (R) output from each tool for the analysis.  There are different measures that you'll want to compare. Personally, I like to look at the information criteria (AIC, AICc, and BIC) and the in-sample error measures.

 

To help with the comparison assessment, we will now compare the two models using the TS Compare tool. The L input of the TS Compare tool is the unioned model objects from your ETS and ARIMA tools.  The R input will be your validation dataset, as the models have not yet seen these data points. In the O output from the TS Compare tool, we get the holdout sample error measures for both models (lowest absolute value is better). The fit statistics for both models are similar except the Mean Error. This can subjectively be taken as evidence that the ARIMA model will better produce a forecast.

 

 

 

9.png

 

 

 

Within the R output, we see a report summary for both models. This gives us a glance at the actual values compared to the values that the model predicted:

 

 

 

10.png



The I output is an interactive chart graphing all three resulting graphs in a single pane. This gives us a visual representation of the actual vs the models created.


 

11.png

 



We can drill down towards the end of the graph to see the actual vs forecasted values.




12.png

 



From the steps above, the ARIMA model has a lower holdout sample error when compared to the ETS model.




13.png

 




Generate the Forecast

 

Based on our testing, we will go ahead and use the ARIMA model for our forecasting. Using the same configuration of the ARIMA model in the building phase, we are going to generate a forecast. For this, we are going to use the full dataset as we will get a better time series forecast. Using the TS Forecast tool, we are going to forecast for the next 12 months. One important aspect of time series modeling to keep in mind is that the error of a model increases as the forecasts move further forward in time, which results in wider confidence intervals. Thus, the first few forecasted periods are going to be more accurate than the later forecasted periods.

 

 

16.png





15.png





14.png

 

 

 

 

Additional Time Series resources:

Comments
helpplease
5 - Atom

Hi,

 

Can you show how to use xgboost to run a forecast please.

 

Thanks Neil

mtouiti
Alteryx Alumni (Retired)
dmccandless
8 - Asteroid

Awesome post and some very concise explanations on material that can be hard to explain (e.g. ACF, PACF).

 

I recently took this course in my MS in Analytics from Georgia Tech. If you're looking for a deep dive into Time Series Analysis, that course was well as Penn State's are great resources.

aweiner
7 - Meteor

This was great practice, thank you! I ran the attached workflow and without making any changes, the TS compare output of the ETS model in step 3 had different values than the screenshots in the article. Most of the numbers were higher. ARIMA was fine. Do you know if the dataset changed since the article was written? It did not affect the rest of your workflow since step 4 used ARIMA. I tried building it myself as well and got the same result. Either way, this was super helpful. Thank you!!