Alteryx Designer Desktop Discussions

rshack2005 · ‎10-23-2021

Hello!

Looking for a little guidance on the best way to set up a workflow to select a "weighted sample" that selects the largest proportion of samples from a high risk bucket, a moderate selection of samples from a medium risk bucket, and the smallest proportion from a low risk bucket.

The sample size will vary based on (a) population size, and (b) required confidence level. Blue table below shows what my sample size might look like in different population and confidence scenarios.

Each transaction will be assigned to a "risk bucket" (green table below), which is what should drive the sample weighting. The weight would be as follows..

Select 75% of sample high risk transactions
Select 20% of sample medium risk transactions
Select 5% of sample low risk transactions

Any guidance on how to set this up? The oversample tool seems to have a different purpose and I don't see how you designate the sample size. Would it be distilling all of the above into a complex web of IF statements?

Thanks a lot!!

apathetichell · ‎10-23-2021

The easiest way to do this is to use two filters - first filter checks for if something is high risk bucket (feeds to random % sample at 75%) second filter checks if something is medium risk bucket - feeds to sample % at 20% false anchor feeds to sample % tool 5%. Sample % tools go to Union tool to rejoin.

This is the most straight-forward way. I'd use a batch macro with two control parameters - but that's not as straight forward.

rshack2005 · ‎10-23-2021

Thanks for this - interesting. One issue I can see is if there aren't enough transactions in the high risk bucket to satisfy the 75% target sample size, I would need to pass the remaining sample allotment to the next lower bucket.

But I guess the tricky part is the varying sample size. Can you feed/pass a target sample size to the sampling tool? Or could it be done with the batch macro that you mentioned?

danilang · ‎10-24-2021

Hi @rshack2005

The sample tool is good if you have fixed cut offs, but if you have a complex situation like yours you need to roll-your-own sampler. Here's a generic sampling methodology to select X rows from a list

1. Add a random number

2. Sort by the random number, which randomizes the list.

3. Select the first X rows.

Your case is a bit more complex since the X varies by population and also risk level.

The salmon coloured "Create sample data..." container does what it says on the label. Replace this with your own data.

The blue "Determine limits" container is where you calculate the sample sizes for risk levels based on the population size.

The two text inputs hold your population ranges and sample amounts(bottom one) and percentage based sample amounts if you have more data than the maximum in the ranges. The range input is expanded to give one record for each value in the range. Your input data is counted and the message tool throws an error if you have fewer than the low bound of the smallest range. The count is joined to the expanded range data. There only two possible outcomes for this join.

1. There is a match to a single range record which comes out the J output and the R output is empty. In this case just pass the sample sizes out of the container

2. There are no matches in which case the R output contains the count and the J output is empty. The R output uses the percentages to calculates sample sizes based on the population count.

The final Union tool passes the record from condition 1 or 2

The green container applies the first two steps of the randomizing algorithm from above. The single record from the "Determine Limits" is appended to randomized data and then the Filter can apply Step 3 of the randomizing algorithm

([Risk]="High" and [RowInGroup]<=[High]) OR
([Risk]="Medium" and [RowInGroup]<=[Medium]) OR 
([Risk]="Low" and [RowInGroup]<=[Low])

This picks the first X records from the randomized sample data for each of the High, Medium and Low groups.

In the "Create Sample data..." container change the Generate rows tool Condition expression from RowCount <=250 to a different value to test the various ranges and also over and under conditions

Dan

Alteryx Designer Desktop Discussions

Sampling tool