Hi Everyone,
I am dealing with a huge file (more that 300000 rows and more than 200 fields).
I need to filter the rows that have the same customer, same division and same year.
here is a sample file with the 4 fields that need to be filtered.
Could anyone help me with it?
thank you
Esme
Solved! Go to Solution.
Use the Unique tool to find unique values from the Customer, Year, and Division fields. Use the Duplicate output to rejoin on the Customer, Year, and Division fields and deselecting the duplicate fields from the join output. There were no duplications in your provided dataset, so I had to create one.
bonsoir Mathieu
merci pour votre aide
good thing I learnt: I did not know I could use a file in Left and right join with the filter tool ! Good to know
but unfortunately, the formula [recordID]!=[right_recordID] brings no common ID.
There was one more thing I forgot to mention that I just thought of. Add a GROUP BY on the Customer, Year, and Division fields between the Unique and Join tools before rejoining with the original data. This will create a list of unique duplicates present in the data in case there are more than two instances of this set of data.
Hi NB,
thank you for this additional comment. I do not understand what this Group By brings in. Could you please explain?
Bonjour @Esmeralda_In_Paris 🙂
Can you provide a simple sample to tag the lines you want to output?
That would help.
Thanks
If you have more than two instances of this set of data (more than one duplication) in your dataset, the additional Group By will give you a list of the unique duplicates for the Customer, Year, Division combo. Using this to rejoin will provide a list of unique record IDs that are duplicate. If you rejoin without a Group By, the joined result will use ALL the duplicate results to join on the original dataset making the results (N*(N+1)) large. You can still use another Group By on all fields after the fact, but it will make your workflow much faster if you slim down your data before the join. See attached.