Inconsistent data format causing Text to Column anomalies
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hi Team,
Need your help
I have data coming in via different externals system and majority (90%) the format of the "External system ID" is in the following format "ABC_123456_Mason"
Consequently i have used the Text to Column functionality to split the data into 3 columns anytime it find an "_" (See below SC)
External System | External ID | Column A | Column B | Column C |
Mason1 | ABC_123456_Morgan | ABC | 123456 | Mason |
Mason2 | AVI_600291_CV15NR | AVI | 600291 | CV15NR |
However there is one external system "Mason3" which does it a bit differently and provides data into the following format "AAA_123456Morgan". Which means anything after the first "_" is bundled into column B
External System | External ID | Column A | Column B | Column C |
Mason1 | ABC_123456_Morgan | ABC | 123456 | Mason |
Mason2 | AVI_600291_CV15NR | AVI | 600291 | CV15NR |
Mason3 | MAS_461598TA44NG | SKA | 461598TA44NG |
Just for an external system "Mason3" is there a way to extract the first 6chars after the "_" and put into column b, any other digits after the 6 digits go into column c ?
Expected Outcome
Primary System | External ID | Column A | Column B | Column C |
Mason1 | ABC_123456_Morgan | ABC | 123456 | Mason |
Mason2 | AVI_600291_CV15NR | AVI | 600291 | CV15NR |
Mason3 | MAS_461598TA44NG | SKA | 461598 | TA44NG |
Looking forward to your help
Many thanks
- Labels:
- Common Use Cases
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hi @Masond3
From this output extracted, I would recommend you to do the formula tool.
Save the output in Column B
If length is >6 for Column B
then Left(column b,6)
else column B
Save the output in Column C
If length is >6 for Column B
then right(column b,6)
else column C
External System | External ID | Column A | Column B | Column C |
Mason1 | ABC_123456_Morgan | ABC | 123456 | Mason |
Mason2 | AVI_600291_CV15NR | AVI | 600291 | CV15NR |
Mason3 | MAS_461598TA44NG | SKA | 461598TA44NG |
Many thanks
Shanker V
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
@Felipe_Ribeir0 So do you think its best to do the text to columns as first step and then apply the formula solution afterwards ?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
I suggest to Split in to columns and then apply the formula.
As for huge dataset, it might be Mason#, mason4 ...5 etc.
Many thanks
Shanker V
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hi @Masond3
Fair question!
Thinking better, it seems like a good idea to add the necessary _ to the column and use the text to columns after it. I personally would go with this idea:
If you need to keep the External ID as it is, you can use the formula to create a new External ID with the _ just to parse it correctly.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hi @Masond3
You can use the RegEx tool here in parse mode:
The following RegEx will parse 3 columns, A with everything before the first underscore, B with all the digits after that, but before the next underscore (if there), and C with everything after the final underscore (if there) or first non-digit to the end if no underscore
(.+?)_(\d+)_?(.+)
Hope that helps,
Ollie
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Me personally, I would use @OllieClarke design pattern as it 1 requires less work to create but it's also more robust.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Generally the regex path leads to a higher computational cost than solutions using simple components, when the second case is possible. With the Enable Performance Profiling it is possible to see that. Ofc this will not be a problem considering small datasets, but i think that this is something to keep in mind.
