Anyone explain what's going on here? I have a file 24G, 31M lines joining to a file that's 15 Megs, 51k lines. You can see the stats on the join is blowing up
Hi @Watermark ,
That's a common issue with a join, if both L and R inputs have duplicate records. There are many posts in the community that address this
https://community.alteryx.com/t5/Alteryx-Designer-Discussions/Join-returns-too-many-records/td-p/308215
https://community.alteryx.com/t5/Alteryx-Designer-Discussions/Why-My-Join-Is-Getting-More-Records-than-Expected/td-p/531159
The most common solution is either to stick a unique/summarize tool before your join or increase the number of fields you are joining on. If you also work with that many records, I will suggest exploring the Calgary tool palette. It indexes your data base and your workflow will run much faster.
Hope that helps
Angelos
Angelos,
It's a simple CSV connecting to a spreadsheet. It only has one field to join on, that's the URL. I'm going to go look at the to links you entered.
Hi @Watermark,
It is also worth to mention that if you got empty or null columns they will also create thousands of duplicates. So it is worth to keep that in mind each time when you are performing join tool.
Yep, Enormous number of duplicates (not expected, lesson learned), as well as a hefty chunk of nulls (also not expected). Thanks for the help.