Parse white space
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hello
I have a set of data (see sample below). I want to parse the data so that there are 6 columns. the 5th and 6th column are either a number or whitespace depending on the position of digit in the string.
1000 All Goodwill ASSETS 20500000
1050 All Amortisation Goodwill ASSETS 20500000
I have the following expression, however I can't seem to add a capturing group in the 5th group to capture either the digit or additional white space, and then the 6th group capturing the digit or nothing.
(\d{4,6})\s*(All)\s(.*)(ASSETS|LIABILITY|OWNERS EQ|INCOME|EXPENSE|BROKER)\s*(\d*)
Any help would be much appreciated
Thanks
Solved! Go to Solution.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hi @FECL3
Which rows have the numerical data that should sit in field 6?
EDIT: Other than you knowing the data sits in field 6, is there any other way of knowing from the data? If there isn't it's going to be nigh on impossible to translate that into logic..
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
They will sit in rows 2,3,4,6 and 7
It's difficult as ASSETS, EXPENSE will generally be in the 5th capturing group, LIABILITIES, OWNERSEQ, and INCOME will be in the 6th capturing group. However this is not always the case and as sometimes they aren't (for instance row 2 and row 5), hence the desire to try capture them based on their position relative to the amount of whitespace in the data.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hi @FECL3
This is horrible code, but, try this:
(\d{4,6})\s*(All)\s(.*)(ASSETS|LIABILITY|OWNERSEQ|INCOME|EXPENSE|BROKER)\s{23,27}(\d{1,3},\d{1,3}.\d{2}|\d{1,3}.\d{2}|\d{1,3},\d{1,3},\d{1,3}.\d{2}|\s*)\s*(\d{1,3},\d{1,3}.\d{2}|\d{1,3}.\d{2}|\d{1,3},\d{1,3},\d{1,3}.\d{2}|\s*)
It seems to work on the sample you sent, but i've had to make some assuptions based on the number of whitespace you have after the ASSETS/LIABILITIES group.
You can define an upper and lower limits in Regex, to account for variations in whitespace. So what i did was find the number for each row using
REGEX_CountMatches([Field1], '\s')
which returns this:
I plug this into the Regex code you provided and replace the \s+ with \s{23,27} just before the numerical regex code to account for the min/max value before the first set of numerical data (these upper/lower values may need to be changed depending on your whole dataset). Subsequently i repeat the same batch of Regex to find the 6th field but this time i leave the separator as \s*
I also added in |\s* at the end of the numerical code as both field 5 or field 6 can be whitespace.
(\d{4,6})
\s*(All)
\s(.*)
(ASSETS|LIABILITY|OWNERSEQ|INCOME|EXPENSE|BROKER)
\s{23,27}(\d{1,3},\d{1,3}.\d{2}|\d{1,3}.\d{2}|\d{1,3},\d{1,3},\d{1,3}.\d{2}|\s*)
\s*(\d{1,3},\d{1,3}.\d{2}|\d{1,3}.\d{2}|\d{1,3},\d{1,3},\d{1,3}.\d{2}|\s*)
There is probably a way of making this a bit tidier...but you do appear to get the right output:
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Thank you very much - a neat and useful solution. Thank you for the explanation too - I have had to edit your code slightly to fit the different permeations across the whole dataset - but it has worked well!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hi @FECL3
I'm glad it works, and i learnt something new about the upper/lower parameters so it's a win all round :)
- « Previous
-
- 1
- 2
- Next »