Hi all,
I am trying to replicate the following functionality with alteryx:
column_list = pd.DataFrame(['Col1','Col2','Col3','Col4', 'Col5',
'Col6','Col7','Col8','Col9','Col10',
'Col11','Col12'])
final_hr = pd.DataFrame()
for column in range(len(column_list)):
hr_new=hr.copy()
#Drop rows containing NAN for column 'col4' for new merge
hr_new.dropna(subset=[column_list.iloc[column,0]], inplace = True)
#Creating a new column for merge
hr_new['ID_final']=hr_new[column_list.iloc[column,0]]
#case folding
hr_new['ID_final']=hr_new['ID_final'].str.strip().str.upper()
#Merge data
merged_data = pd.merge(hr_new, data, how='left', left_on='ID_final', right_on ='OtherID')
#Concatinating all data together
final_hr = final_hr.append(merged_data)
HR_attestation = final_hr.drop_duplicates()
my thinking on the logic: For each iteration of the loop, it's taking one of the user ID fields(Col1 to Col 12) and writing that to the ID_final field. And then it tries to use that to join to "data". Then appends everything to final_hr dataframe.
I am having trouble replicating this in alteryx, what I have tried is:
1) joins between "hr_new" and "data" , joining OtherID to each of the Col 1 to Col 12.
2) Taking the inner join of each and unioning them together
3) taking "Hr_new" and joining to the union joining on record id.
4) a second union taking the left and inner join of this.
any help would be appreciated.
These joins/unions is skipping out on some records so I am wondering if my logic is correct.
Hi @wonka1234
I'm not very clear about what you need to accomplish, Maybe a further explanation would help, but I think the process is simpler (thinking in Alteryx not in coding), for what I understand:
column_list = pd.DataFrame(['Col1','Col2','Col3','Col4', 'Col5','Col6','Col7','Col8','Col9','Col10','Col11','Col12'])
- This is a Select Tool, where you select which columns of the Dataset you want to keep, which to discard, and you can rename them and change types.
final_hr = pd.DataFrame()
for column in range(len(column_list)): -> You don't need this, alteryx proceses your datastream line by line.
hr_new=hr.copy()
#Drop rows containing NAN for column 'col4' for new merge
hr_new.dropna(subset=[column_list.iloc[column,0]], inplace = True)
- This part sounds like a filter tool with the expression: [Col4]!='NAN'
#Creating a new column for merge
hr_new['ID_final']=hr_new[column_list.iloc[column,0]]
- This is a Formula Tool
#case folding
hr_new['ID_final']=hr_new['ID_final'].str.strip().str.upper()
- Anther formula tool (or a Cleanse Tool)
#Merge data
merged_data = pd.merge(hr_new, data, how='left', left_on='ID_final', right_on ='OtherID')
- This is a Join on [ID_final] and [OtherID]
#Concatinating all data together
final_hr = final_hr.append(merged_data)
HR_attestation = final_hr.drop_duplicates()
- This can be resolved with a Unique Tool or a Summarize
First consider if you really need to iterate over rows in a DataFrame. Iterating through pandas dataFrame objects is generally slow. Pandas Iteration beats the whole purpose of using DataFrame. It is an anti-pattern and is something you should only do when you have exhausted every other option. It is better look for a List Comprehensions , vectorized solution or DataFrame.apply() method for iterate through DataFrame.
Pandas DataFrame loop using list comprehension
result = [(x, y,z) for x, y,z in zip(df['Name'], df['Promoted'],df['Grade'])]
Pandas DataFrame loop using DataFrame.apply()
result = df.apply(lambda row: row["Name"] + " , " + str(row["TotalMarks"]) + " , " + row["Grade"], axis = 1)
User | Count |
---|---|
106 | |
85 | |
76 | |
54 | |
40 |