Alteryx Designer Discussions

Find answers, ask questions, and share expertise about Alteryx Designer.
ALTER.NEXT:

Join us on Dec 2 for a half-day virtual analytics + data science event!
US & CA customers only

SAVE YOUR SPOT
It's the most wonderful time of the year - Santalytics 2020 is here! This year, Santa's workshop needs the help of the Alteryx Community to help get back on track, so head over to the Group Hub for all the info to get started!
SOLVED

Generate Synthetic Data

Highlighted
Alteryx Partner

Hi Team,

 

I have a requirement where in we need to generate massive amounts of synthetic data for a particular data model. We have the range values, expected field calculations etc with us. 

I expect to generate over a 1B records. Is this achivable with Alteryx ?

 

Highlighted
15 - Aurora

Sounds like a lot of data, but in general, yes, Alteryx is great for generating random data sets.

 

Use a "Generate Rows" tool to generate as many rows as you need; then a Formula tool to generate random data for whatever you need.  Use the Math > RandInt(n) or Math > Rand() Functions in your Formula Expression.

 

If you have ranges, something like [RangeBeginning] + RandInt([RangeEnd] - [RangeBeginning]) will give you an integer in your range.

Highlighted
Alteryx Partner

Thanks. 
Is there a Rand() funtion that allows me to set a range of values ?

Highlighted
16 - Nebula
16 - Nebula

I have attached a sample generator I did for another question on here.

 

The easiest way to do a range is

Rand() * ([Upper]-[Lower]) + [Lower]
Highlighted
15 - Aurora

[Edit: removing duplicate solution - same as @jdunkerley79's]

Highlighted
Alteryx
Alteryx

@JohnJPS & @jdunkerley79 have given you the solution, but there is one thing that I would like to add to this due to the amount of data that you're dealing with.

 

Play around with the order of tools to increase speed. Under "Workflow Properties > Runtime" you can turn Performance Profiling on to look at the time each tool is taking. Generally, generate your categorical variables first as that will be less data. But you may find that by changing the position of a join you could significantly reduce the time your workflow takes to run. I.e. don't try to join a billion rows to a billion rows based upon a field, instead randomise/sort/rganise and then join by record position.

 

Kane

Labels