Advent of Code is back! Unwrap daily challenges to sharpen your Alteryx skills and earn badges along the way! Learn more now.

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Sample based on two columns

Inactive User
Not applicable

TL;DR: I need a 5% sample of a population for each distinct combination of 2 sets of categories.

 

Hi everyone,

 

Let's say I have a data set where I collect the rating responses for different desserts in different cities (where I have my dessert superstores!). Now I want to contact a small sample of customers again for a longer survey to figure out how to refine my desserts and be more successful.

I have a range of 6 desserts that I sell in 3 different cities, with thousands of ratings per dessert&city combination.

Not all desserts have the same amount of ratings; however, I want to sample 5% of each dessert&city combination. So, if I have "Los Angeles&Apple Pie" 1000 times, I want to get a random 50 ratings; if I have "Chicago&Cherry Blossom Mochi" only 200 times, I want to get 10 random ratings. 🙂

 

Example data showing all 6 dessert types:

LocationDessertRating
Los AngelesApple Pie7.2
Los AngelesMarzipan Bar3.3
Los AngelesChocolate Pudding7.4
Los AngelesMacarons9.8
Los AngelesPistachio Ice Cream6.1
Los AngelesCherry Blossom Mochi9.9
New York CityApple Pie5.7
New York CityMarzipan Bar8.3
New York CityChocolate Pudding9.8
New York CityMacarons9.1
New York CityPistachio Ice Cream7.5
New York CityCherry Blossom Mochi5.0
ChicagoApple Pie6.6
ChicagoMarzipan Bar8.0
ChicagoChocolate Pudding6.8
ChicagoMacarons3.4
ChicagoPistachio Ice Cream7.1
ChicagoCherry Blossom Mochi8.7

 

I'm grateful for any suggestions!

 

Longer example data attached. 😃

5 REPLIES 5
AngelosPachis
16 - Nebula

Hi @Inactive User ,

 

Would something like this work for you?

 

Screenshot 2020-10-27 180951.jpg

 

The dataset is not large enough to see the sample, but it should work in a larger dataset

atcodedog05
22 - Nova
22 - Nova

Hi @Inactive User 

 

Here is a workflow for the task.

Output:

atcodedog05_0-1603822239542.png

 

Workflow:

atcodedog05_1-1603822275885.png

I can creating a RandomID and sorting by it to random zine the data.

 

Then use sample tool n%

atcodedog05_0-1603822390339.png

Currently 5% is too small with the number of rows. With more rows it will be more effective🙂

 

Hope this helps 🙂


If this post helps you please mark it as solution. And give a like if you dont mind 😀👍

Inactive User
Not applicable

@AngelosPachis Oh I'm so embarrassed, I can't believe it was this easy 🙈 Clearly I need to take a break from working on this -- thank you so much!

atcodedog05
22 - Nova
22 - Nova

Hi @Inactive User 

 

As per your requirement last line said you wanted random sample right

 

Just using a sample tool will not pick random sample. It only pick first n% rows.

My workflow randomizes the selection with help of a randomID

 

And 1 in every n rows is not same as N%

AngelosPachis
16 - Nebula

Hi @Inactive User ,

 

Don't be embarrassed since as it turned out I wasn't totally correct. As @atcodedog05 passionately pointed out, indeed my previous answer won't give you a perfectly random sample.

 

Instead, you can either follow the workflow he proposed, but sample for the first 5% of rows to get the correct answer. Otherwise, you can use a batch macro with a random sample tool, which you can also find attached

 

Screenshot 2020-10-27 184038.jpg

 

Hope that solves it

Labels