We are celebrating the 10-year anniversary of the Alteryx Community! Learn more and join in on the fun here.
Start Free Trial

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.

Too many records after join (also after using Unique tool and deleting duplicated IDs)

Juliajagtman
5 - Atom

Here you see the simple workflow;

Juliajagtman_1-1654873188736.png

I want to join two dataset. one of de datasets has 215,511 records 

Juliajagtman_2-1654873298865.png

I want to integrate this dataset with another dataset based on the postcode. This other dataset has 489,653 records. 

Juliajagtman_3-1654873377229.png

 

This bigger datafile contains information that is want to integrate with the smaller dataset based on the postcode.

So, I expect that the final dataset after join contains (again) 215,511 records.

HOWEVER, whatever I try it returns 227,035 records (11,524 records to much). 

Juliajagtman_5-1654873553685.png

 

Can someone help me with this? Thank you all so much in advance.

 

7 REPLIES 7
IraWatt
17 - Castor
17 - Castor

Hey @Juliajagtman,

This is because the fields your joining on are not unique. Add a unique tool before the 215,511 and select the fields your going to join on. 

 

The reason for this is what you are doing as a many to many join SQL Many-to-Many Relationships - TruthSeekers

Adrian_T
Alteryx Alumni (Retired)

Hi @Juliajagtman,

 

Adding on to what @IraWatt has mentioned, I believe the issue might also be in duplicate postcodes in your larger dataset of 489,653 records. The Join Tool performs Cartesian Join, which matches every row from one table to every row from the other table. To avoid duplicated outputs, you would therefore need ensure that rows in the larger dataset is unique - by using the Unique Tool before the Join Tool.

Juliajagtman
5 - Atom

Thank you for your reaction Adrian and Ira!

As you can see here I tried your tip with the Unique tool:

Juliajagtman_0-1654952077495.png

 

Juliajagtman_1-1654952109530.png

And as you can see I still end up with 226102 records... 

Which is still a big difference with 215,511 records. 

:(

mceleavey
17 - Castor
17 - Castor

Hi @Juliajagtman ,

 

To follow up on what @IraWatt and @Adrian_T have said, this is due to the ID not being unique. When you are using the unique tool, you need to identify the fields which should be included in determining if they are unique or not.

 

For example, in the following dataset I have three ID rows, but only two are unique:

 

mceleavey_0-1654952723133.png

If I attach a Unique tool and configure as follows:

mceleavey_1-1654952750171.png

I will not remove any "duplicates" as the rows are not at ID level, but ID and Date level. So, the Unique tool here is creating unique combinations of the selected fields.

mceleavey_3-1654952856148.png

 

 

If I attached a different Unique tool and configure as follows:

mceleavey_2-1654952807086.png

I will now remove one "duplicate" row. This is because the Unique tool is only looking for unique ID values, and does not take into account the date:

mceleavey_4-1654952872050.png

 

You need to be aware of the level of granularity of your data. When you think each row is at ID level, but there are multiple ID instances, then this is not the case. You can either use the Unique tool, or the Sum tool with a Group By function to get the data to the level of granularity you require.

If you can post your data, or a sample (remove any identifying columns like names) we can help you with this.

I hope this helps.

 

M.

 

 



Bulien

twazi
5 - Atom

Thanks to all the contributors, this feedback was just what I needed today.  Mine was a little different though and turned my 72k records into 32 million.  Both files had blank values in the match column but it was appropriate that the many blanks in the Left file stayed unmatched (L and J Progress from here).  So in my case, using unique in the Right file wasn't an option because I'd still get a match on the one remaining blank.  My solution was simpler, I just needed to ditch all the rows with a blank in the match column in my Right file which is simply the cross reference.

Raj
16 - Nebula

have you tried using 2 primary keys if possible?

because this should not happen with unique primary keys.

twazi
5 - Atom

Thanks Raj, good to know.  In the smaller xref file it wasn't the primary key.  I also realised after that I could have just filtered out the rows with  blank value instead of removing them in the source xref file.

Labels
Top Solution Authors