Hello
I have a set of data (see sample below). I want to parse the data so that there are 6 columns. the 5th and 6th column are either a number or whitespace depending on the position of digit in the string.
1000 All Goodwill ASSETS 20500000
1050 All Amortisation Goodwill ASSETS 20500000
I have the following expression, however I can't seem to add a capturing group in the 5th group to capture either the digit or additional white space, and then the 6th group capturing the digit or nothing.
(\d{4,6})\s*(All)\s(.*)(ASSETS|LIABILITY|OWNERS EQ|INCOME|EXPENSE|BROKER)\s*(\d*)
Any help would be much appreciated
Thanks
Solved! Go to Solution.
i think this is not what he is looking for
as much i understood he is looking for this
Hi - thanks for your response
As mentioned I need to parse the data and not use text to columns. The data above is only example and I can't use a space as a delimiter.
The parsing works because everything in between "All" and "(ASSETS|LIABILITY|OWNERS EQ|INCOME|EXPENSE|BROKER)" is the column description and must be all in one column.
The result should be as follows (either Field 5 or Field 6 being a digit or Null (or empty / or otherwise))
Hi @FECL3
Here you go, try this Regex:
(\d{4,6})\s*(All)\s(.*)(ASSETS|LIABILITY|OWNERS EQ|INCOME|EXPENSE|BROKER)\s(\d*|\s*)(\s+\d*|\s*)
I've stuck a Data Cleansing tool on to remove the duplicate whitespace and you get this:
Thanks @DavidSkaife - very useful however I fear I have oversimplified the data. When I add this to my workflow it doesn't work because there is more than one space character (\s) between the 5th and 6th capturing group.
The data actually looks like this and as you can see there is multiple white space characters between "Assets" and the value. However the number of whitespace characters differ (this number also differs across the rest of my data set so can't define a value) depending on whether I want it in the 5th or 6th capturing group.
To hopefully add some more context I have added a snip from Regex101. The yellow highlighted section should come through as the 5th capturing group, whilst the proceeding number should be in the 6th capturing group.
can you share 5 line of un simplified data , so we can try new logic for same
Hi @FECL3
Ok try this:
(\d{4,6})\s*(All)\s(.*)(ASSETS|LIABILITY|OWNERS EQ|INCOME|EXPENSE|BROKER)\s(\d*|\s+)\s(\s+\d*|\s*)
However i suspect that if there is no structure as to the number of whitespace after ASSETS it's going to be tricky to find a Regex that works in one go.
For the 5th group, will it always be 8 characters/8 whitespace?
How do you know if the final value is in the 5th or 6th based on the structure?
As @Raj said sharing a subset of your data may be helpful for us to troubleshoot for you
Thank you for the suggestions. Please see example data below in the attached workflow.
Please note I have changed the regex to as the numbers have commas and full stops in.
(\d{4,6})\s*(All)\s(.*)(ASSETS|LIABILITY|OWNERSEQ|INCOME|EXPENSE|BROKER)\s*(\d{1,3},\d{1,3}.\d{2}|\d{1,3}.\d{2}|\d{1,3},\d{1,3},\d{1,3}.\d{2})
The 5th group will have a range of 3 to 11 characters in.
I know the final value is in the 6th structure because it sits within a different column in the source data.