Anyone explain what's going on here? I have a file 24G, 31M lines joining to a file that's 15 Megs, 51k lines. You can see the stats on the join is blowing up
Solved! Go to Solution.
Hi @Watermark ,
That's a common issue with a join, if both L and R inputs have duplicate records. There are many posts in the community that address this
The most common solution is either to stick a unique/summarize tool before your join or increase the number of fields you are joining on. If you also work with that many records, I will suggest exploring the Calgary tool palette. It indexes your data base and your workflow will run much faster.
Hope that helps
Angelos
Angelos,
It's a simple CSV connecting to a spreadsheet. It only has one field to join on, that's the URL. I'm going to go look at the to links you entered.
Hi @Watermark,
It is also worth to mention that if you got empty or null columns they will also create thousands of duplicates. So it is worth to keep that in mind each time when you are performing join tool.
Are you expecting there to be only one row per URL?
If that's not the case then you may need to do some investigation in the data to understand what other data elements are causing the URLs to appear on multiple rows. Perhaps a filter needs to be applied to the data or you can pare down the number of columns and follow @AngelosPachis suggestion on using the Summarize tool to remove duplicates.
Yep, Enormous number of duplicates (not expected, lesson learned), as well as a hefty chunk of nulls (also not expected). Thanks for the help.
User | Count |
---|---|
106 | |
82 | |
70 | |
54 | |
40 |