Hi Everyone,
I have some pretty messed up data which I need to map to a master look up. I have attached a file that holds what the raw data would look like, an example of the look up and what the required output would be.
A screenshot of the desired output below. As you can see, we are getting some really crappy Raw Codes and Descriptions!!! I dont really know where to begin to get this data cleansed.
Can anyone offer any starting points or sample workflows? I would be eternally grateful...
Many many thanks!!
Solved! Go to Solution.
Not so hard with pattern matching (and a little practice/experience). I (personally) didn't start using Regular Expressions until I became active in the community. Once I did start, they changed my approach to strings. Watch tutorials on RegEx and soak it in!
I've attached the workflow and will review the steps with their results:
IIF(IsEmpty([RAW CODE]),GetWord([RAW Desc], 0),[RAW CODE])Basically, if the code is empty then take the first word out of the description. Follow this with an update to the description to take the first word out if it is what was found for the code.
IIF([RAW CODE]=GetWord([RAW Desc], 0),replace([RAW Desc], [RAW CODE]+' ',''),[RAW Desc])
REGEX_Replace([RAW CODE], "(\d{3})(\d{3})(\d{7})", '$1-$2-$3')Format numeric (13 bytes) with dashes
REGEX_Replace([RAW CODE], "[[:punct:]]", '')
Based upon the sample data this is the workflow necessary. If you end up with more "unjoined-left" data, you'll need additional rules constructed or you'll have a small set of outliers to manually correct.
Cheers,
Mark
Hi @fiorano
The attached workflow is, I believe, successfully matching all 24 records in your sample table. However, a lot of what @MarqueeCrew solution includes are some excellent practices to pre-process/cleanse your values in preparation for matching, especially when you run this through a much larger table.
Hi,
thanks @MarqueeCrew and @jrgo !!!!! Both are amazing solutions which look to cover almost all data quality issues that we are facing at the moment.
Huge thanks to both of you!
Hi @MarqueeCrew,
Just a quick query on one of your expressions if I may.
Can you explain what
REGEX_Replace([RAW CODE], "(\d{3})(\d{3})(\d{7})", '$1-$2-$3')
is doing with an example?
Many thanks!
Fiorano
User | Count |
---|---|
17 | |
15 | |
15 | |
8 | |
5 |