Free Trial

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.

Better way to replace many joins

wonka1234
10 - Fireball

Hi,

 

I have a workflow that does many joins. 

I am joining 10 fields from dataset A(Left), to 1 field in dataset B(right) in each join, Taking the left output and inner of each join and using a union to bring these fields together.

 

Is there a faster way of accomplishing this? It seems to take my workflow long to run.

 

wonka1234_0-1639584974914.png

 

14 REPLIES 14
mceleavey
17 - Castor
17 - Castor

Hi @wonka1234 ,

 

probably, but because of my brain thing can you provide some mock data and show us what you're trying to do?

 

M.



Bulien

csmith11
11 - Bolide

Where is the data coming from?

 

If its coming from a database that you are connecting to. I'd recommend using the InDB Join tool as this will build a query to execute against your data base directly. (Also if your input data is coming from a data base that could also be why the workflow is running slow) 

 

Please give us a better idea of where the data is coming from and what it looks like.

 

If its just Excel files or if its other flat files:

How many records are you using for each join? More or less than 100,000?

 

Are there any expected duplicate matches in the Join tools? Or are all the matches 1 to 1? If its less than 100,000 and there are no expected duplicate matches then the Find and Replace tool can be used instead. You'll need to Concatenate the 10 join fields together and find and replace.

 

If its greater than 100,000 records then the Join tool will likely be your best option.

wonka1234
10 - Fireball

data coming from large excel files.

 I am joining 430 records from Dataset A with 5 million records from dataset B. 

I join from 12 different columns from Dataset A, to the Same column each time on Dataset B.

The purpose of this was to replicate a python loop through columns + a left outer join.

 

Their are duplicates at the join, however I should be able to remove duplicates prior to join.

 

**bleep** this sucks, its takes awhile.

Christina_H
14 - Magnetar

Could you transpose Dataset A first, to put all 12 joining columns into one?  At least then you only have to do a single join to the 5 million records in Dataset B!

csmith11
11 - Bolide

Try the Find and Replace Tool. Use your A data set as the R input for the Find and Replace. You'll need to concatenate the 12 field in both data sets together to create a Primary Key. Find and replace on this key.

 

If this is just replicating a pre existing Python Loop: How fast does the Python code run? Have you consider just moving the Python Script into Alteryx using the Python Tool?

csmith11
11 - Bolide

Also just to clarify... how many input files do you have?

 

If you just have 2 data sets A and B, are you loading Dataset A into Alteryx multiple times?

 

A snapshot of the rest of your workflow would be very helpful to see the big picture here:

 

You can hide the annotations if needed so you don't have to redact each one:

csmith11_0-1639663910741.png

 

 

 

 

wonka1234
10 - Fireball

I have two input files - and loading dataset A  and B into Alteryx once. Dataset A is  around 152MB,  I should also clarify that its around 430 records going into the join, I do a filter after dataset A is loaded.

 

wonka1234_0-1639667043627.png

 

wonka1234_1-1639667088156.png

 

wonka1234_2-1639667141276.png

 

Hopefully this paints a clearer photo - its still running and I dont want to stop it.

 

Thanks for your patience.

wonka1234
10 - Fireball

not sure if this will help but here is my python code I am trying to replicate.

column_list = pd.DataFrame(['ColA','ColB','ColC','ColD'])

final_df = pd.DataFrame()
for column in range(len(column_list)):
    hr_new=hr.copy()

    hr_new.dropna(subset=[column_list.iloc[column,0]], inplace = True)

    
    hr_new['ID']=hr_new[column_list.iloc[column,0]]
    

    merged_data = pd.merge(hr_new, dataframenotshown, how='left', left_on='ID', right_on ='IDtwo')

    
    final_df = final_df.append(merged_data)
 
 
MegaDF = final_df.drop_duplicates()

 

csmith11
11 - Bolide

I think the python helped... confirming my mock up of what I think is happening. I still might not be getting the entire picture though. Mind sharing a better mock up?

 

Dataset A

Data Set A  
IDField1Field2Field3
AF1F2F3
BF1F2F3
CF1F2F3

 

 

Dataset B

 

Data Set B  
IDField2  
AData1  
BData1  
CData1  
AData2  
BData2  
CData2  

 

 

Merged   
IDField2Unioned Data
ADataF1 
BDataF1 
CDataF1 
ADataF1 
BDataF1 
CDataF1 
ADataF2 
BDataF2 
CDataF2 
ADataF2 
BDataF2 
CDataF2 
ADataF3 
BDataF3 
CDataF3 
ADataF3 
BDataF3 
CDataF3 

 

 

Labels
Top Solution Authors