Engine Works

Under the hood of Alteryx: tips, tricks and how-tos.
Register for the upcoming Live Community Q&A Session - and don't forget to submit your questions for @DeanS regarding the future role of analytics here.
16 - Nebula
16 - Nebula

A friend on the Alteryx community recently asked me about recreating the Excel trendline capabilities within Alteryx.

 

excel-options.png

 

If you have the predictive tools installed then Alteryx has a Linear Regression tool which will fit a linear model to the set. This uses R and produces a model for the entire dataset. For my implementation, I wanted to allow for groups of data and go back to first principles and use the core Alteryx tools (i.e. no SDKs, R, or Python) to build it.

 

Excel has 6 options for fitting a trendline to a dataset. Apart from Moving Average these are all fitted using the least squares method. The five models are:

 

  • Exponential: exp.svg
  • Linear: lin.svg
  • Logarithmic: log.svg
  • Power: pow.svg
  • Polynomialpoly.svg depends on Order

For my Alteryx macro, I plan to support Exponential, Linear, Logarithmic, and Power (Moving Average isn't really the same and Polynomial will need more work).

 

The Excel tool also allows you to specify an intercept (the value when x_0.svg) for Linear, Exponential, and Polynomial. In other words, fixing the value of A in each case. In the Power case, the intercept is always 0 and for the Logarithmic case it will be an error as the logarithm is not defined at 0. I want my macro to also support this.

 

Finally, you can get the trendline in Excel to output both the equation and the value of r2.svg, so the last requirement is to do this as well.

 

A quick shout out to the LaTeX Previewer by Troy Henderson - I used it to create the SVG of all the LaTeX in this post. Onto the maths...

 

Least Squares Method for Regression

 

Let's imagine we have a dataset of points xy.svg.

 

The goal of the least squares method is to fit a function fx.svg, which minimises the square of the errors, e.svg, where e.svg  is defined as ei.svg. In other words:

 

min_ei.svg

 

So, let's work this out for the Linear case. Let's expand out the summation and then see if we can simplify it:

 

First, substitute for e.svg:

 

sum_ei_yi.svg

 

Next, replace fx.svg:

 

sum_ei_lin.svg

 

Now, expand out the square:

 

sum_ei_lin2.svg

 

sum_ei_lin3.svg

 

The goal is to find A and B to minimise the above expression:

 

min_ei_lin3.svg

 

Going back to calculus basics, a maximum or minimum will be when the first derivative is 0. So let's differentiate with respect to A:

 

da_lin.svg

 

We want to find when this is 0. Rearranging (and dividing by 2):

 

da_lin2.svg

 

We can then expand the sums for this:

 

da_lin3.svg

 

Dividing by n, this gives us:

 

da_lin4.svg

 

Where x_bar.svg  and y_bar.svg are the average of x and y respectively.

 

Now back to the original expression, this time let's differentiate with respect to B:

 

db_lin.svgdb_lin1.svg

 

Following a similar approach, we can get that:

 

db_lin2.svg

 

If we have a known intercept, then we can substitute this for A above. Otherwise, we can substitute our expression we had above for A:

 

db_lin3.svg

 

db_lin4.svg

 

db_lin5.svg

 

This means to find A and B, all we need to compute is:

 

  • n - the number of records
  • x_bar.svg - the average of x
  • y_bar.svg - the average of y
  • sum_x_sq.svg - the sum of x squared
  • sum_xy.svg- the sum of x times y

All of this is straight forward using the Summarize tool.

 

Building the First Macro

 

Let's start building the macro. This first version will handle computing A and B for the linear model.

 

macro_v1.png

 

We start by taking a standard macro input. I have chosen not to expose a FieldMap but instead created new variables called __X__ and __Y__. I used a dropdown box to allow you to map the field to each, using an action tool to update the raw XML of a pair of formula tools.

 

Next, I computed values for __XX__ and __XY__ which I needed to compute the totals. Then it was on to the Summarize tool to compute the five values I needed. Additionally, I used a List Box to allow selection of the Group By within this macro. This was a little fiddly inside the formula for action tool, but basically, it worked by adding the group by entries to the raw XML of the summarise:

 

IIF([#1]='""',
    '',
    '<SummarizeField field=' + 
      Replace([#1], '|||', ' action="GroupBy" /><SummarizeField field=') +
    ' action="GroupBy" />')
+
[Destination]

Finally, the last step was to compute the Slope and Intercept and to use a select tool to drop all the intermediary fields. One last little catch is to remember to select the Output fields change based on macro's configuration or data input option within the Interface Designer.

 

output_fields.png

 

Expanding to Other Models

 

Currently, it can only solve Linear. However, we've basically finished the hard work and all we need to do is see how we can transform the inputs to get each of the other models.

 

Logarithmic: log.svg

 

This is straight forward. If we take the log of x as __x__ then we have the linear model.

 

Exponential:exp.svg

 

Let's take the log of both sides:

 

exp_model.svg

 

Again, we have gotten to a linear model. If we take the log of y as __y__ and then take the exponential of the intercept we can compute this model.

 

Power: pow.svg

 

Let's take the log of both sides:

 

pow_model.svg

 

As before, this is a linear model. In this case, we have to take both the log of x as __x__ and the log of y as __y__ and then finally take the exponential of the intercept to compute this model.

 

Expanded Macro

 

First, let's add a model drop down with a manual set of values:

 

Linear (y=A+Bx):Linear
Logarithmic (y=A+B ln(x)):Logarithmic
Exponential (y=Ae^(Bx)):Exponential
Power (Y=Ax^B):Power

We can feed this into a new formula tool after the one creating __X__ and __Y__ values to create the adjusted series we need.

 

__X__ = IIF([#1] IN ('Logarithmic', 'Power'),LOG([__X__]),[__X__])
__Y__ = IIF([#1] IN ('Exponential', 'Power'),LOG([__Y__]),[__Y__])

And then add an additional step to formula tool, adjusting the intercept for Exponential and Power:

 

Intercept = IIF([#1] IN ('Exponential', 'Power'), EXP([Intercept]),[Intercept])

 

Equation

 

Now that we have different models it seems a good time to add a new column containing the expression. A simple formula tool will handle this:

 

ToString([Intercept]) + 
IF [#1] = 'Logarithmic' THEN
 ' + ' + ToString([Slope]) + ' LOG(X)'
ELSEIF [#1] = 'Exponential' THEN
' EXP(' + ToString([Slope]) + ' X)'
ELSEIF [#1] = 'Power' THEN
' X ^ ' + ToString([Slope])
ELSE
	' + ' + ToString([Slope]) + ' X'
ENDIF

 

R Squared

 

R Squared is a measure which measures how much of the dependent variable is predicted by the model. It is expressed as:

 

r_sq.svg

 

In order to compute this, I needed to join the model values back to the original series. As we might not always have a grouping field, I needed to add a dummy variable, __D__, which can be included in all joins as a common field. Dynamically creating joins is a little harder than adding group by clauses (as we must do it twice!). In this case, we need to take something that looks like "A"|||"B"|||"C" to something like:

 

<Field field="A" />
<Field field="B" />
<Field field="C" />

This is like the last case but needs to be applied to two nodes in the Join configuration. The expression below changes the list to the raw XML needed:

 

IIF([#1]='""','','<Field field=' + + Replace([#1], '|||', ' /><Field field=')+' />')
+
[Destination]

 

action_join.png

 

Now to compute the values we need for R squared, I first used a formula tool to evaluate the top and bottom of the fraction and then used a summarise to make totals. Again, this summarise tool needs to be grouped in the correct manner. Finally, I computed the R squared value for each set. After this, I knew that the output of this calculation would be in the same order as the output of the Slope/Intercept calculation so I joined by record position to add the value to the output. At this point the macro looked like:

 

macro_v2.png

 

Fixed Intercepts

 

The last step was to take in an optional set of intercepts. Alteryx supports optional Macro Inputs, they will come through as an empty (0 row) copy of the template if no input is supplied. Unfortunately, as we were building the grouping fields dynamically, we couldn't have a template that would work for all cases.

 

The workaround I chose, was to take the output of the Summarise tool, sample 0 rows from it, and use it to create a complete set of all the columns we need. This can then be unioned to the Macro Input tool to create a superset of columns. Using similar expressions to the above, we can reconfigure a Join Multiple tool. I chose to use this over a Join tool as it meant that the order of the records from the Summarise tool was preserved.

 

optional_input.png

 

The formula tool at the end copes with the Exponential and Power models, where the Intercept needs to be adjusted. Finally, the Slope calculation needs to be updated to use the Intercept if provided:

 

IIF(IsNull([Intercept]),
([Sum___XY__]-[Count]*[Avg___X__]*[Avg___Y__]) / ([Sum___XX__]-[Count]*[Avg___X__]*[Avg___X__]),
([Sum___XY__]-[Count]*[Intercept]*[Avg___X__]) / [Sum___XX__]
)

 

Wrapping Up

 

finished_macro.png

 

You can download the final macro here.

 

This macro should re-create most of the capabilities of Excel trendlines. The direct manipulation of raw XML does mean that some field names (e.g. containing a ", ||| or >) may cause it some issues.

 

If you want to fit another expression you can hopefully use the techniques above to get you started. Some of the techniques in building the macro are quite advanced but provide a powerful way to add new functionality.

Comments
8 - Asteroid

Hi @jdunkerley79 , 

 

Thanks for this really interesting macro - sounds perfect for something that I wanted. Using a simple (made-up) data set, I'm getting some interesting differences between the macro and Excel, and I'm wondering if they can be easily explained. (I do appreciate that this post is from almost a year ago!!).

 

Using the same data set in Excel & Alteryx through your macro for two groups I get the following:

 

ProductExcel - EquationExcel R²Macro SlopeMacro R²
Product Xy = -0.001x + 10.383R² = 0.9637-0.0009950.963705
Product Yy = -3E-05x + 8.4273R² = 0.00020.0004820.051478

 

It is strange to me that the R2 are so different for the (poorly correlated) Product Y group (but identical for the 1st!).

 

I can't seem to attach files to this post, but data is as follows:

 

Spoiler
Using X = Units
And Y = ASP

ProductCustomer#Customer nameUnitsRevenueASP
Product XCustomer1Customer 11844156748.5
Product XCustomer2Customer 2424424010
Product XCustomer3Customer 3410410010
Product XCustomer4Customer 4203203010
Product XCustomer5Customer 5105910060.59.5
Product XCustomer6Customer 61836156068.5
Product XCustomer7Customer 75225115.69.8
Product XCustomer8Customer 81934164398.5
Product XCustomer9Customer 9154813467.68.7
Product XCustomer10Customer 10150813119.68.7
Product XCustomer11Customer 115125017.69.8
Product XCustomer12Customer 129188904.69.7
Product XCustomer13Customer 131806153518.5
Product XCustomer14Customer 14266266010
Product XCustomer15Customer 151424128169
Product XCustomer16Customer 16192516362.58.5
Product XCustomer17Customer 171498134829
Product XCustomer18Customer 1866064689.8
Product XCustomer19Customer 19402402010
Product XCustomer20Customer 205525409.69.8
Product XCustomer21Customer 21182182010
Product XCustomer22Customer 22270270010
Product XCustomer23Customer 23196916736.58.5
Product XCustomer24Customer 2461059789.8
Product XCustomer25Customer 251274114669
Product XCustomer26Customer 26157413693.88.7
Product XCustomer27Customer 279138856.19.7
Product XCustomer28Customer 285345233.29.8
Product XCustomer29Customer 29169914781.38.7
Product XCustomer30Customer 308067818.29.7
Product XCustomer31Customer 318698429.39.7
Product XCustomer32Customer 32175114883.58.5
Product XCustomer33Customer 33338338010
Product XCustomer34Customer 349839535.19.7
Product XCustomer35Customer 357096948.29.8
Product XCustomer36Customer 36354354010
Product XCustomer37Customer 371520132248.7
Product XCustomer38Customer 38422422010
Product XCustomer39Customer 391916162868.5
Product XCustomer40Customer 401321118899
Product XCustomer41Customer 41106110079.59.5
Product XCustomer42Customer 427717478.79.7
Product XCustomer43Customer 43332332010
Product XCustomer44Customer 448017769.79.7
Product XCustomer45Customer 45170714850.98.7
Product XCustomer46Customer 46174215155.48.7
Product XCustomer47Customer 4759057829.8
Product XCustomer48Customer 48319319010
Product XCustomer49Customer 491826155218.5
Product XCustomer50Customer 50159213850.48.7
Product YCustomer1Customer 1258.81022070.4818
Product YCustomer2Customer 2244.7112447.1110
Product YCustomer3Customer 31860.70814885.668
Product YCustomer4Customer 41484.17611873.418
Product YCustomer5Customer 5426.82914268.29110
Product YCustomer6Customer 6364.60743281.4679
Product YCustomer7Customer 71673.01213384.18
Product YCustomer8Customer 848.32538483.253810
Product YCustomer9Customer 91270.20411431.849
Product YCustomer10Customer 101385.1099695.7657
Product YCustomer11Customer 11950.23056651.6137
Product YCustomer12Customer 1262.06437496.5158
Product YCustomer13Customer 13127.63541148.7199
Product YCustomer14Customer 14828.26877454.4199
Product YCustomer15Customer 151866.31218663.1210
Product YCustomer16Customer 161595.05112760.418
Product YCustomer17Customer 1796.31437674.20067
Product YCustomer18Customer 181152.3399218.7158
Product YCustomer19Customer 191025.0819225.7289
Product YCustomer20Customer 20984.67968862.1179
Product YCustomer21Customer 21645.15715806.4149
Product YCustomer22Customer 22394.84762763.9337
Product YCustomer23Customer 23636.06765088.548
Product YCustomer24Customer 24248.47741987.8198
Product YCustomer25Customer 251001.2958010.3648
Product YCustomer26Customer 261850.03816650.349
Product YCustomer27Customer 271047.35510473.5510
Product YCustomer28Customer 281152.8259222.6048
Product YCustomer29Customer 29149.63831197.1068
Product YCustomer30Customer 301311.02711799.259
Product YCustomer31Customer 311177.5078242.5497
Product YCustomer32Customer 32201.66141613.2918
Product YCustomer33Customer 33982.21539822.15310
Product YCustomer34Customer 341146.4839171.8648
Product YCustomer35Customer 35587.7734702.1848
Product YCustomer36Customer 36615.16125536.4519
Product YCustomer37Customer 371935.33419353.3410
Product YCustomer38Customer 38369.22613323.0359
Product YCustomer39Customer 39643.67094505.6967
Product YCustomer40Customer 401128.7811287.810
Product YCustomer41Customer 41727.12746544.1469
Product YCustomer42Customer 42302.92133029.21310
Product YCustomer43Customer 43490.0544410.4869
Product YCustomer44Customer 441560.27812482.238
Product YCustomer45Customer 451609.53214485.799
Product YCustomer46Customer 461122.4310101.879
Product YCustomer47Customer 47627.08265643.7449
Product YCustomer48Customer 48158.95541112.6887
Product YCustomer49Customer 49890.68918016.2029
Product YCustomer50Customer 501911.79619117.9610

 

Best wishes,

 

James

5 - Atom

Can someone please give this man a medal. This is fantastic

5 - Atom

How to created Batch Macros ?