Hi there
I want to match data between two data sources based on a common field. My issue is that I have a large data set of 200k-500k and there are duplicate records that I would like to keep in the dataset. Because of the volume the find and replace tool (I think) has issues with performance therefore always producing fluctuating numbers, whilst the join tool creates massive Cartesian Join creating millions of rows that I can use a unique tool to reduce back to a normal size but significantly increase the runtime.
I really want to use the find and replace tool but because of inconsistent results I can only rely on the join tool which is a pain in the * to use cause of the Cartesian Join. Can anyone suggest what I could do in this situation?
@Alteryxuserhere
It is the first time that I hear that a tool is failing to due the amount of data, never happened to me before.
You say that you want to keep the duplications that is fine, but if you say that after the Join tool there are even more duplication it means that you have duplication in both side the L and R.
I assume that the main data is at the L side. Put a unique tool before the R side so you will have unique entries in that way after connecting it to the R you will not going to have additional duplication at the J
Example:
L: 2 similar rows
R: 1 row
Results J: 2 rows
L1R1
L2R1
L: 2 similar rows
R: 2 similar rows
Results J: 4 rows
L1R1
L1R2
L2R1
L2R2
So just ensure that you have one input with unique entries and you are good.
User | Count |
---|---|
18 | |
15 | |
13 | |
9 | |
8 |