Parse white space
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hello
I have a set of data (see sample below). I want to parse the data so that there are 6 columns. the 5th and 6th column are either a number or whitespace depending on the position of digit in the string.
1000 All Goodwill ASSETS 20500000
1050 All Amortisation Goodwill ASSETS 20500000
I have the following expression, however I can't seem to add a capturing group in the 5th group to capture either the digit or additional white space, and then the 6th group capturing the digit or nothing.
(\d{4,6})\s*(All)\s(.*)(ASSETS|LIABILITY|OWNERS EQ|INCOME|EXPENSE|BROKER)\s*(\d*)
Any help would be much appreciated
Thanks
Solved! Go to Solution.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
i think this is not what he is looking for
as much i understood he is looking for this
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hi - thanks for your response
As mentioned I need to parse the data and not use text to columns. The data above is only example and I can't use a space as a delimiter.
The parsing works because everything in between "All" and "(ASSETS|LIABILITY|OWNERS EQ|INCOME|EXPENSE|BROKER)" is the column description and must be all in one column.
The result should be as follows (either Field 5 or Field 6 being a digit or Null (or empty / or otherwise))
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hi @FECL3
Here you go, try this Regex:
(\d{4,6})\s*(All)\s(.*)(ASSETS|LIABILITY|OWNERS EQ|INCOME|EXPENSE|BROKER)\s(\d*|\s*)(\s+\d*|\s*)
I've stuck a Data Cleansing tool on to remove the duplicate whitespace and you get this:
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Thanks @DavidSkaife - very useful however I fear I have oversimplified the data. When I add this to my workflow it doesn't work because there is more than one space character (\s) between the 5th and 6th capturing group.
The data actually looks like this and as you can see there is multiple white space characters between "Assets" and the value. However the number of whitespace characters differ (this number also differs across the rest of my data set so can't define a value) depending on whether I want it in the 5th or 6th capturing group.
To hopefully add some more context I have added a snip from Regex101. The yellow highlighted section should come through as the 5th capturing group, whilst the proceeding number should be in the 6th capturing group.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
can you share 5 line of un simplified data , so we can try new logic for same
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hi @FECL3
Ok try this:
(\d{4,6})\s*(All)\s(.*)(ASSETS|LIABILITY|OWNERS EQ|INCOME|EXPENSE|BROKER)\s(\d*|\s+)\s(\s+\d*|\s*)
However i suspect that if there is no structure as to the number of whitespace after ASSETS it's going to be tricky to find a Regex that works in one go.
For the 5th group, will it always be 8 characters/8 whitespace?
How do you know if the final value is in the 5th or 6th based on the structure?
As @Raj said sharing a subset of your data may be helpful for us to troubleshoot for you
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Thank you for the suggestions. Please see example data below in the attached workflow.
Please note I have changed the regex to as the numbers have commas and full stops in.
(\d{4,6})\s*(All)\s(.*)(ASSETS|LIABILITY|OWNERSEQ|INCOME|EXPENSE|BROKER)\s*(\d{1,3},\d{1,3}.\d{2}|\d{1,3}.\d{2}|\d{1,3},\d{1,3},\d{1,3}.\d{2})
The 5th group will have a range of 3 to 11 characters in.
I know the final value is in the 6th structure because it sits within a different column in the source data.