Alteryx Designer Desktop Discussions

udles5666 · ‎02-02-2021

Hi Community,

I am stuck with a data cleanse exercise where I need to replace data in few columns once I have identified that they are duplicate rows.

Please find the attached workflow for reference & below is what I am after.

For example: Sales Document 150502162202 & 6602165304 are duplicates in the dataset & have got multiple rows for same material and customer code. (Record ID 24-31 for example), I want the # or 0's to be replaced with where there is a value. Further only keep the clean data and remove the bad duplicates. Please refer to the screenshots attached and let me know if something is still not clear.

Note: I have around 1 million rows in my original data set.

Thanks a ton in advance.

clmc9601 · ‎02-03-2021

Hi @udles5666,

I built out the rest of your workflow. I used a unique tool to determine the duplicates, but you could replace this with a filter based on your "duplicate" column from the multi-row formulas. Just depends on how you want to determine duplicate fields. The rest of the workflow then separates the duplicate rows, grabs the correct values, and then unions them back together. I hope it helps!

Qiu · ‎02-03-2021

@udles5666

Finally get something. but sure the plant ID and name are not included for unique ident?

udles5666 · ‎02-03-2021

Hi @clmc9601

Thanks a ton for that.

Really appreciate that but, the results don't match. I think the Sales Document field & SLA are mixed up.

Meaning : For Sales Doc : 6602165304, SLA should be 12 & for Sales Document 150502162202 SLA should be 4. But it is the other way around. Can you please look into this again?

Thanks,

clmc9601 · ‎02-03-2021

Hi @udles5666,

Sorry about that. I wondered if the dynamic rename would cause problems (which it did). Thankfully it's an easy fix.

One quirk about crosstab is it replaces any special characters and spaces in field names with underscores, and it alphabetizes all the columns in the output. So in order to get the columns back to your original names, I used dynamic rename positionally by column. Positional by column is kind of a shortcut because it's not as dynamic and sensitive to changes to column order. I used it after glancing over your results and seeing that most of them matched. However, the double capitalization rearranged the sort order, which I didn't notice before.

I changed the settings in the dynamic rename, so it should work now! If there are other fields you want to appear at the beginning, you can add them to the formula tool where it creates a field called "Rank".

udles5666 · ‎02-03-2021

Hi @clmc9601

Please don't apologize as its because of people like yourself this is the best community/forum we have. Any effort anyone puts in is very much appreciated.

Your solution is bang on and worked with my original data set as well.

Thanks once again for your help. Much appreciated.

udles5666 · ‎02-03-2021

Hi @Qiu

Thanks heaps for looking into this. I have marked your solution correct as well however, @clmc9601 's solution is much efficient (as in faster to perform with 1 mill rows).

Also, Plant name/number is fine as it is. Thanks heaps once again for helping out.

Qiu · ‎02-03-2021

@udles5666

Thank you every much!😁

clmc9601 · ‎02-04-2021

Hi @udles5666,

I'm glad this solution worked for you! I realized it could be even faster by adding a sample tool to the rename data stream as shown below. No big deal if you don't need it anymore-- just thought I'd mention it anyway!

Alteryx Designer Desktop Discussions

Data Cleanse Exercise for duplicate records - Replace Bad Data Columns with Clean Data