Hi,
I have a workflow that does many joins.
I am joining 10 fields from dataset A(Left), to 1 field in dataset B(right) in each join, Taking the left output and inner of each join and using a union to bring these fields together.
Is there a faster way of accomplishing this? It seems to take my workflow long to run.
Hi @wonka1234 ,
probably, but because of my brain thing can you provide some mock data and show us what you're trying to do?
M.
Where is the data coming from?
If its coming from a database that you are connecting to. I'd recommend using the InDB Join tool as this will build a query to execute against your data base directly. (Also if your input data is coming from a data base that could also be why the workflow is running slow)
Please give us a better idea of where the data is coming from and what it looks like.
If its just Excel files or if its other flat files:
How many records are you using for each join? More or less than 100,000?
Are there any expected duplicate matches in the Join tools? Or are all the matches 1 to 1? If its less than 100,000 and there are no expected duplicate matches then the Find and Replace tool can be used instead. You'll need to Concatenate the 10 join fields together and find and replace.
If its greater than 100,000 records then the Join tool will likely be your best option.
data coming from large excel files.
I am joining 430 records from Dataset A with 5 million records from dataset B.
I join from 12 different columns from Dataset A, to the Same column each time on Dataset B.
The purpose of this was to replicate a python loop through columns + a left outer join.
Their are duplicates at the join, however I should be able to remove duplicates prior to join.
**bleep** this sucks, its takes awhile.
Could you transpose Dataset A first, to put all 12 joining columns into one? At least then you only have to do a single join to the 5 million records in Dataset B!
Try the Find and Replace Tool. Use your A data set as the R input for the Find and Replace. You'll need to concatenate the 12 field in both data sets together to create a Primary Key. Find and replace on this key.
If this is just replicating a pre existing Python Loop: How fast does the Python code run? Have you consider just moving the Python Script into Alteryx using the Python Tool?
Also just to clarify... how many input files do you have?
If you just have 2 data sets A and B, are you loading Dataset A into Alteryx multiple times?
A snapshot of the rest of your workflow would be very helpful to see the big picture here:
You can hide the annotations if needed so you don't have to redact each one:
I have two input files - and loading dataset A and B into Alteryx once. Dataset A is around 152MB, I should also clarify that its around 430 records going into the join, I do a filter after dataset A is loaded.
Hopefully this paints a clearer photo - its still running and I dont want to stop it.
Thanks for your patience.
not sure if this will help but here is my python code I am trying to replicate.
column_list = pd.DataFrame(['ColA','ColB','ColC','ColD'])
final_df = pd.DataFrame()
for column in range(len(column_list)):
hr_new=hr.copy()
hr_new.dropna(subset=[column_list.iloc[column,0]], inplace = True)
hr_new['ID']=hr_new[column_list.iloc[column,0]]
merged_data = pd.merge(hr_new, dataframenotshown, how='left', left_on='ID', right_on ='IDtwo')
final_df = final_df.append(merged_data)
MegaDF = final_df.drop_duplicates()
I think the python helped... confirming my mock up of what I think is happening. I still might not be getting the entire picture though. Mind sharing a better mock up?
Dataset A
Data Set A | |||
ID | Field1 | Field2 | Field3 |
A | F1 | F2 | F3 |
B | F1 | F2 | F3 |
C | F1 | F2 | F3 |
Dataset B
Data Set B | |||
ID | Field2 | ||
A | Data1 | ||
B | Data1 | ||
C | Data1 | ||
A | Data2 | ||
B | Data2 | ||
C | Data2 |
Merged | |||
ID | Field2 | Unioned Data | |
A | Data | F1 | |
B | Data | F1 | |
C | Data | F1 | |
A | Data | F1 | |
B | Data | F1 | |
C | Data | F1 | |
A | Data | F2 | |
B | Data | F2 | |
C | Data | F2 | |
A | Data | F2 | |
B | Data | F2 | |
C | Data | F2 | |
A | Data | F3 | |
B | Data | F3 | |
C | Data | F3 | |
A | Data | F3 | |
B | Data | F3 | |
C | Data | F3 |