Dear fellow Alteryx fans,
I would like to create a Monte Carlo simulation using the simulation sampling tool and I'm a little confused how to set it up. I have anonymized my data so that we can discuss the issue, and hope that by working this through on the forum it will be of help to others.
The example is a little contrived, so please bear with me!
I have three customer profiles. Assume my website serves each of these profile groups in proportion. I have a dataset containing 250,000 rows. I have analysed these customers coming from four different channels (A, B, C and D) and looked at the relationship between the number of page visits versus the total revenue. Specifically, I wish to examine the percentage of page views by channel versus the percentage of revenue by channel.
Please find attached a simple tableau representation of the full data, this details:
Page visits by channel in terms of the three profile groups
revenue by channel in terms of the three profile groups
performance, i.e. the percentage of total revenue divided by the percentage of total page visits
We can see that for the entire dataset, in channel A customer profile group 1 outperforms by 11%, but group 2 underperforms by 7%.
I would like to use the Monte Carlo simulation to construct levels of confidence around these percentages. I'm thinking that one way we can do this is by taking a sample of data and making the same calculation, and then repeating this exercise again and again. I would expect the mean outperformance in channel A for customer profile group 1 to be 11%, but what's the standard deviation around this? This is where the simulation tool would be useful if I knew how to use it!
Hopefully somebody out there will find this interesting!
Very grateful for your help.
Best wishes,
Jonathan
Solved! Go to Solution.
Would something like this work for you:
Basically, this uses generate rows to make a set of simulations. Within each simulation, it creates a set of trials and then picks random data points from the input data.
Finally, it creates percentages for each simulation and then averages and std dev over the simulation set
Sample attached for you to look at
Aurora,
Super kind of you to take the time to reply to my question.
It looks like a very comprehensive answer, though avoids the simulation tool… No problem there, just need to get the job done!
I will go through your answer in more detail and revert if I have any questions. In the meantime, thank you kindly.
Best wishes,
Jonathan
Thank you both for the interesting problem and example.
I have one question - In the Formula tool, is there any particular reason you used:
RandInt([RowCount]-1) + 1
Instead of picking a random number between 1 and 242,095 for RecordID?
The latter seems to make more sense than picking a RecordID between 0 and 1 for the first trial, between 0 and 100 for the 100th trial, between 0 and 10,000 for the 10,000th trial, etc.
Does the approach in the screenshot bias the selection towards the first 10,000 data points in the sample dataset, vs. considering the full set of 242,095 sample data points?
Thank you
Hi,
The `[RowCount]` is the count of rows from the input data side and is a constant 242,095. It should be uniformly distributed over the set of input rows.
James
Hi, @jdunkerley79
Dear master, I send a private message to your inbox before 3 months ago, so could you please see that if you have free time ?
Thanks for the reply!
I think it may be a small typo in the workflow - the "Count" column is the constant you refer to; "RowCount" is the column identifying each trial within a simulation
Either way, thank you for clarifying that I was thinking about it the right way!