Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
GELÖST

Sampling based on the proportion of the entity in the population

aabughali
Meteoroid

Hi,

 

I'm working on the attached workflow which will determined the sample size and select samples based on the proportion of the entities in the population. I can't figure out the last part 'Select Remaining Samples'. I want the remaining samples to be selected based on the % of the entity in the population. The previous steps were easy since the entities won't change. But the remaining samples might pertain to entities with different names that I can't just filter in.

 

Thanks in advance!

 

Capture.PNG

  • Apps
7 ANTWORTEN 7
Qiu
20 - Arcturus
20 - Arcturus

@aabughali 
The input file is missing in your attachment, so we can not really run the flow.
but for the remaining samples, I guess we can only join the Sampled Data with overall data then take the data from left anchor as  remaining samples, if we can.

aabughali
Meteoroid

The input file has one field “legal entity”, it doesn’t matter what’s in that column.

The workflow already has the remaining population to be sampled. The question is how to select x number of samples (defined in the macro) from a population based on the entity’s proportion in the population.

apathetichell
Pollux

I used R (sample.split) whenever I needed to do this...

aabughali
Meteoroid

I’m not familiar with R coding, can you give an example please?

aabughali
Meteoroid

I solved it using a Python tool, and here's the code:

# Import necessary libraries
import pandas as pd

# Read the dataframes from Alteryx
df1 = Alteryx.read("#1")
df2 = Alteryx.read("#2")

# Calculate the total sample size
total_sample_size = int(df2.iloc[0, 0])

# Calculate the count of records for each legal entity
entity_counts = df1['Legal Entity'].value_counts()

# Calculate the proportion of records for each legal entity
entity_proportions = entity_counts / len(df1)

# Calculate the sample size for each legal entity based on its proportion
sample_sizes = (entity_proportions * total_sample_size).round().astype(int)

# Perform sampling for each legal entity
sampled_data = pd.DataFrame()
for legal_entity, size in sample_sizes.items():
entity_data = df1[df1['Legal Entity'] == legal_entity]
sampled_group = entity_data.sample(n=size, replace=True) # Sampling with replacement
sampled_data = pd.concat([sampled_data, sampled_group])

# If the total sample size is not met, adjust the sample size for the last entity
while len(sampled_data) < total_sample_size:
oversample_entity = sample_sizes.idxmax() # Get the entity with the highest sample size
sampled_group = df1[df1['Legal Entity'] == oversample_entity].sample(n=1, replace=True)
sampled_data = pd.concat([sampled_data, sampled_group])

# Trim the excess samples if more than the total sample size
sampled_data = sampled_data.head(total_sample_size)

# Reset index of sampled data
sampled_data.reset_index(drop=True, inplace=True)

# Output the sampled data
Alteryx.write(sampled_data, 1)

--

Thank you all!

aabughali
Meteoroid

Code

aabughali
Meteoroid

.

Beschriftungen