Splitting data without delimiters, referring to other data in the same column
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
I see today is is the day of asking for help for me :)
I have values in a column, which I want to split for further processing. Part of them is comma separated, but part of them not. Each value is a string, which can contain two words or more.
What I can assume is that for most cases where there are more than one value in a field, I can find other field in the same column where each of the values will be alone, or separated by a comma.
Initially I tried to split data into rows simply by comma, list all unique values for the column and filter out longest ones (base assumption: if there is no delimiter, then the field after the split should have the longest strings in it), and then use find-replace to add artificial delimiters to the data. But that doesn't work perfectly as sometimes one single value is longer than combination of two others.
As a workaround I do a split of the data as they are now, aggregate results, look what should be corrected and then put it manually on a source list for find-replace before the split and re-run the workflow.
But I look for something less manual, as the current way is not good for broader variety of data.
Below the sample how data are organized currently:
1 | Blue Carrot |
2 | Orange Carrot |
3 | Very Big and Delicious Green Potato, Orange Carrot |
4 | Very Big and Delicious Green Potato |
5 | Very Big and Delicious Green Potato Black Berry |
6 | Black Berry, Black Potato |
7 | Blue Carrot Very Big and Delicious Green Potato Black Berry |
8 | Black Berry Very Big and Delicious Green Potato, Blue Carrot |
9 | Black Berry Blue Berry |
And the output should look like that:
1 | Blue Carrot |
2 | Orange Carrot |
3 | Very Big and Delicious Green Potato |
3 | Orange Carrot |
4 | Very Big and Delicious Green Potato |
5 | Very Big and Delicious Green Potato |
5 | Black Berry |
6 | Black Berry |
6 | Black Potato |
7 | Blue Carrot |
7 | Very Big and Delicious Green Potato |
7 | Black Berry |
8 | Black Berry |
8 | Very Big and Delicious Green Potato |
8 | Blue Carrot |
9 | Black Berry |
9 | Blue Berry |
Solved! Go to Solution.
- Labels:
- Best Practices
- Parse
- Preparation
- Workflow
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Actually I crafted something similar, but didn't add space to the "find" field - which did the trick (mostly).
The problem is that in the data I have it not always adds commas to all values. There is one or two which this workflow avoid and I don't know why (visually they are identical so I have i.e. "Blue Orange" and "Blue Orange Red Berry" and the comma isn't added with find-replace), I don't know what may cause the problem, as when I put both into the excel and use "search" function, it doesn't have any issues with finding the string.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Could you update your question with an example that doesn't work perhaps?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Sure, I've just had to track down the issue first, to not send a list of hundred records 😅
It seems that the issue is when you have overlapping data in the source file for replacement.
It seems that if your original file was like that:
AAA, BBB, CCC, DDD |
AAA |
EEE, FFF |
AAA, BBB CCC DDD EEE, FFF |
BBB |
CCC |
FFF |
And after the split by comma you have a list (which is your source for replacement) like that:
AAA |
BBB |
CCC |
DDD |
EEE |
FFFF |
BBB CCC DDD EEE |
Then find/replace will not put commas in every block of "BBB CCC DDD EEE" which it will find
So for example "AAA BBB CCC DDD EEE FFF" will be replaced into "AAA,BBB CCC DDD EEE,FFF" instead of "AAA,BBB,CCC,DDD,EEE,FFF".
If you then split the data into rows and do next round of find/replace then it will clear the remaining part, but you never know how many additional rounds of such operation (split and then replace) you will need until your data will be finally cleared.
The file with the sample data showing the problem is below.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Seems that it is doing its job, can you guide me a little on how to create such macro?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Yeah, the problem with the original solution is that the Find Replace will always look for the longer string match first. An iterative solution wouldn't be necessary if we could just match the shortest items in the list.
So this iterative solution is doing the exact same thing as before, but doing it in a loop repeatedly looking for substrings (and adding that comma) until...
How does it know when to stop? If there is the same number of rows before and after the Text-to-Columns then no further splitting was done. So I make sure all the records go to the regular output (as opposed to iterative output) when [Rows Before] = [Rows After]. An iterative macro will stop iterating when no records are passed to the Iterative Output.
You can search the community for more general information on how Iterative Macros work and how to create them.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Thanks, I did some research and created a macro which works for me.
In the meantime I realized that there is also another nuance to cover, but I didn't found a way to do it.
The thing is that sometimes I have two values which are different, but have the same core, i.e. "Blue Berry" and "Blue Berry Flavoured" If they are in the longer string, during the fist iteration they will be separated into two separate strings (as you said, Find/Replace takes the longer string into consideration).
In the second iteration the "Blue Berry Flavoured" will be split into "Blue Berry" and "Flavoured" which I would like to avoid.
I was thinking of some kind of cross check with the first split for the values which don't match (the "Flavoured" will not appear on the list) and then combining them wit the proper beginning (I assume that would be value in the row before the one where not matching value was found), but not sure how to craft it.
Not mentioning, that even such additional clearance will leave some data for manual correction...
But for now, the macro is doing most of the necessary job.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Excellent. Can you mark as a solution?
