Here you see the simple workflow;
I want to join two dataset. one of de datasets has 215,511 records
I want to integrate this dataset with another dataset based on the postcode. This other dataset has 489,653 records.
This bigger datafile contains information that is want to integrate with the smaller dataset based on the postcode.
So, I expect that the final dataset after join contains (again) 215,511 records.
HOWEVER, whatever I try it returns 227,035 records (11,524 records to much).
Can someone help me with this? Thank you all so much in advance.
Hey @Juliajagtman,
This is because the fields your joining on are not unique. Add a unique tool before the 215,511 and select the fields your going to join on.
The reason for this is what you are doing as a many to many join SQL Many-to-Many Relationships - TruthSeekers
Hi @Juliajagtman,
Adding on to what @IraWatt has mentioned, I believe the issue might also be in duplicate postcodes in your larger dataset of 489,653 records. The Join Tool performs Cartesian Join, which matches every row from one table to every row from the other table. To avoid duplicated outputs, you would therefore need ensure that rows in the larger dataset is unique - by using the Unique Tool before the Join Tool.
Thank you for your reaction Adrian and Ira!
As you can see here I tried your tip with the Unique tool:
And as you can see I still end up with 226102 records...
Which is still a big difference with 215,511 records.
:(
Hi @Juliajagtman ,
To follow up on what @IraWatt and @Adrian_T have said, this is due to the ID not being unique. When you are using the unique tool, you need to identify the fields which should be included in determining if they are unique or not.
For example, in the following dataset I have three ID rows, but only two are unique:
If I attach a Unique tool and configure as follows:
I will not remove any "duplicates" as the rows are not at ID level, but ID and Date level. So, the Unique tool here is creating unique combinations of the selected fields.
If I attached a different Unique tool and configure as follows:
I will now remove one "duplicate" row. This is because the Unique tool is only looking for unique ID values, and does not take into account the date:
You need to be aware of the level of granularity of your data. When you think each row is at ID level, but there are multiple ID instances, then this is not the case. You can either use the Unique tool, or the Sum tool with a Group By function to get the data to the level of granularity you require.
If you can post your data, or a sample (remove any identifying columns like names) we can help you with this.
I hope this helps.
M.
Thanks to all the contributors, this feedback was just what I needed today. Mine was a little different though and turned my 72k records into 32 million. Both files had blank values in the match column but it was appropriate that the many blanks in the Left file stayed unmatched (L and J Progress from here). So in my case, using unique in the Right file wasn't an option because I'd still get a match on the one remaining blank. My solution was simpler, I just needed to ditch all the rows with a blank in the match column in my Right file which is simply the cross reference.
have you tried using 2 primary keys if possible?
because this should not happen with unique primary keys.
Thanks Raj, good to know. In the smaller xref file it wasn't the primary key. I also realised after that I could have just filtered out the rows with blank value instead of removing them in the source xref file.