Data creation with Alteryx
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
I have a data creation problem that I'd like to solve with Alteryx.
Say we want create a dummy list of customers; 5000 of them. Then we want to allocate them across 5 locations as follows:
10% to Location1,
20% to Location2,
30% to Location3,
25% to Location4,
15% to Location5.
Then we want to assign them across 5 sales reps so that:
SalesRep1 gets 10%,
SalesRep2 gets 20%,
SalesRep3 gets 40%,
SalesRep2 gets 5%,
SalesRep2 gets 25% of the customers across each location.
Then we want to assign them a rating: 10% get A, 40% get AA, 50% get AAA. The rating percentages are per location and per sales rep… and then there are couple of other similar nested rules.
Does anyone know of an easy way to achieve that with Alteryx? I also want to endure I don't lose extreme values on the way, e.g. 5% of 5% of 5% so there needs to be some rounding up... and the model should always produce 5000 rows.
Thank you!
Solved! Go to Solution.
- Labels:
- Expression
- Join
- Preparation
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
I think the "RandomSample" tool should be able to get the sample splits for you at each step (e.g. by grabbing a random N% of the records; or just a random N records, which may be easier than trying to recalculate new percentages).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
A little bit of process to produce I think.
First generate the rows and assign a unique ID.
The first task is to convert to a set of customer id bins and assign a location. We can use a summarise tool to ensure each customer ID appears only once.
Then shuffle within each location to a random order and assign an Location Customer ID for each customer in each location bin.
We repeat the process to assign each Location Customer ID to a Sales Rep.
Then repeat again to assign a rating.
Have attached a flow doing this.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
This works great! Thank you!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
I've attached another workflow that uses R to allow flexibility in how you apply your breakdowns: just describe them however you like similar to the three categories already shown; and add them to the given mulit-join tool. The given R command will then generate the sample data, adding it to the input data frame.
The R code first processes the "breakdown matrix" and determines how many rows (customerIDs) are needed for each possible "breakdown" ... then just loops through those combinations, sampling the appropriate number of rows (from rows not already sampled) and populating the test values. (Perhaps @jdunkerley79 could turn this approach into Alteryx-only tools? :-)
Enjoy!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Am game to try :) - must make some progress on my v0.3 Addins though first (HexBins tonight I hope!)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
That left me time to putz with it. :-) The attached workflow does it in similar fashion to what I described above.
Minor caveats to this one:
- It's not quite as "randomized all over the place" as it just constructs the appropriate number of rows for each combination and then does a RecordID on the final construction.
- It can be prone to having fewer-than-expected records in the last combination, due to rounding issues building up to it; (recommend using "Ceil" instead of "Round" and having a few too many rows in the end.
- If you add new "breakdown" categories, you will have to modify other tools in the workflow, e.g to select them, include them in the formula, etc.
Otherwise it runs fast and doesn't use R.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Nice solution @JohnJPS
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
I like the R solution, thanks! We're hoping for a bit of randomnes, not an exact split ideally and that seems to the job. It seems fairly scalable too as we need to run the same process independently for a few [independent] sets of metrics and combine everything together. I'll give it a go now...