nav[aria-label="Primary Navigation"] { padding: 0; & ul { list-style: none; width: 100%; display: flex; flex-direction: row; justify-content: start; align-items: start; gap: 30px; padding: 0; & li { margin: 0; } & ul li { list-style: none; } } }

Data Cleanse Exercise for duplicate records - Replace Bad Data Columns with Clean Data

udles5666

Hi Community,

I am stuck with a data cleanse exercise where I need to replace data in few columns once I have identified that they are duplicate rows.

Please find the attached workflow for reference & below is what I am after.

For example: Sales Document 150502162202 & 6602165304 are duplicates in the dataset & have got multiple rows for same material and customer code. (Record ID 24-31 for example), I want the # or 0's to be replaced with where there is a value. Further only keep the clean data and remove the bad duplicates. Please refer to the screenshots attached and let me know if something is still not clear.

Note: I have around 1 million rows in my original data set.

Thanks a ton in advance.

Data Cleanse.yxzp

Output.PNG

How the data should change.PNG

Transformation

Preparation

Expression

Accepted answers

Qiu

@udles5666

Finally get something. but sure the plant ID and name are not included for unique ident?

Data Cleanse-r1.yxzp

clmc9601

Hi @udles5666,

Sorry about that. I wondered if the dynamic rename would cause problems (which it did). Thankfully it's an easy fix.

One quirk about crosstab is it replaces any special characters and spaces in field names with underscores, and it alphabetizes all the columns in the output. So in order to get the columns back to your original names, I used dynamic rename positionally by column. Positional by column is kind of a shortcut because it's not as dynamic and sensitive to changes to column order. I used it after glancing over your results and seeing that most of them matched. However, the double capitalization rearranged the sort order, which I didn't notice before.

I changed the settings in the dynamic rename, so it should work now! If there are other fields you want to appear at the beginning, you can add them to the formula tool where it creates a field called "Rank".

Data Cleanse Fixed Rename.yxzp

clmc9601

Hi @udles5666,

I'm glad this solution worked for you! I realized it could be even faster by adding a sample tool to the rename data stream as shown below. No big deal if you don't need it anymore-- just thought I'd mention it anyway!

Screen Shot 2021-02-04 at 9.31.40 AM.png

Data Cleanse.yxmd

All comments

clmc9601

Hi @udles5666,

I built out the rest of your workflow. I used a unique tool to determine the duplicates, but you could replace this with a filter based on your "duplicate" column from the multi-row formulas. Just depends on how you want to determine duplicate fields. The rest of the workflow then separates the duplicate rows, grabs the correct values, and then unions them back together. I hope it helps!

Data Cleanse.yxzp

Qiu

@udles5666

Finally get something. but sure the plant ID and name are not included for unique ident?

Data Cleanse-r1.yxzp

udles5666

Hi @clmc9601

Thanks a ton for that.

Really appreciate that but, the results don't match. I think the Sales Document field & SLA are mixed up.

Meaning : For Sales Doc : 6602165304, SLA should be 12 & for Sales Document 150502162202 SLA should be 4. But it is the other way around. Can you please look into this again?

Thanks,

Expected Result.PNG

Other way round.PNG

Data Cleanse-updated.yxzp

Quick Links

Popular Tags

This months top contributors

atcodedog05 19598

Qiu 15907

binu_acs 15783

MarqueeCrew 13709

apathetichell 13703