Hi all,
I am new to Alteryx.
I am trying to achieve the following:
1) To read in input data
2) To read in sample size = 13
3) To create sample table
4) To pick samples from each group based on the number listed in the sample table
Input data
Record ID | Group |
1 | 22 |
2 | 22 |
3 | 23 |
4 | 23 |
5 | 23 |
6 | 23 |
7 | 24 |
8 | 27 |
9 | 27 |
10 | 27 |
11 | 28 |
12 | 28 |
13 | 28 |
14 | 28 |
15 | 32 |
16 | 32 |
17 | 32 |
18 | 32 |
19 | 32 |
20 | 32 |
21 | 32 |
22 | 32 |
23 | 32 |
24 | 32 |
25 | 32 |
26 | 32 |
27 | 32 |
28 | 32 |
29 | 32 |
30 | 32 |
31 | 32 |
32 | 32 |
33 | 37 |
34 | 37 |
35 | 37 |
Sample size: 13
To create a sample table
1) To read the sample size = 13 (manual input)
2) To count the number of records per each group
3) To calculate the % of no of records per group over the whole population (higher the percentage, more sample would be selected from that group)
4) To calculate the number of samples to be picked for each group. Each group must have at least one picked, no matter what the percentage is.
5) To pull addition samples from the ones with the highest %, to get to the required total sample size (in this case 13).
Group | Count | % population | Calculated sample size | Adjusted sample size | |
22 | 2 | 0.057142857 | 1 | 1 | |
23 | 4 | 0.114285714 | 1 | 1 | |
27 | 3 | 0.085714286 | 1 | 1 | |
24 | 1 | 0.028571429 | 1 | 1 | |
28 | 4 | 0.114285714 | 1 | 1 | |
32 | 18 | 0.514285714 | 6 | 7 | |
37 | 3 | 0.085714286 | 1 | 1 | |
Total | 35 | 12 | 13 |
Sample output (random sampling, number of samples to be picked is based on "Adjusted sample size" column in sample table above.
Record ID | Group |
1 | 22 |
4 | 23 |
9 | 27 |
7 | 24 |
13 | 28 |
21 | 32 |
25 | 32 |
22 | 32 |
24 | 32 |
29 | 32 |
30 | 32 |
31 | 32 |
35 | 37 |
It is a very Excel driven idea as I am not familiar with Alteryx. If I can by-pass the sample table creation step and get to final sample output that would be great.
Any idea/suggestions would be greatly appreciated.
Solved! Go to Solution.
Hi @tww
Here's one way to do it.
The bottom branch calculates the group totals, the grand total and the number of records selected from each group. This is summarized into the Current Sample. The portion in the container allocates one additional row to each of the largest groups until the total number sampled is equal to the your target. In the top branch a random number is added to each row and the rows are sorted by group and rand to scramble the item in the group. Then the top records are selected from each group according to the calculated group sample size.
This method works with your sample data, but you might need to adjust it when applied to real data. You'll need to modify the portion in the container if you want to distribute your additional records by another method, i.e. biased by group size.
Dan
Thank you so much Dan! This works great.