Hi all,
Seeking guidance once again on how best to go about extracting a piece of information, I'm thinking with RegEx.
Below are a couple of examples of the sorts of strings that will be formatted. I'm unable to share the real sample but it will contain tens of thousand of rows. Each important piece of information is between brackets with the word "and" being the separator between these pieces.
My aim is extract all of the information relating to Safety Rating. Information on Safety Rating can appear twice such as in points 1-4 below. So in example 1 below, I would want it to take out:
In some strings information on Safety Rating will only appear once, as in points 5 and 6 below. So in example 5 below, I would only want to extract:
(Safety Rating (Low, Multi, F/I, Latest) is greater than 149)
Important to note, the Safety Rating piece can appear in a different position in different strings, e.g. not the final bracketed piece of information as in example 1.
1. (Housing Type in (Flat)) and (Country Of Property in (GERMANY)) and (Purchase Currency in (Euro)) and (Safety Rating (Low, Multi, F/I, Latest) = 149 or higher) and (Safety Rating (Low, Multi, F/I, Latest) = 350 or lower) and (Days Until Review Date is less than 10950)
2. (Housing Type in (House)) and (Country Of Property in (UK)) and (Purchase Currency in (Sterling)) and (Days Until Review Date = 7300 or lower) and (Safety Rating (Low, Multi, F/I, Latest) is greater than 150) and (Safety Rating (Low, Multi, F/I, Latest) = 349 or lower)
3. (Housing Type in (House)) and (Country Of Property in (UK)) and (Purchase Currency in (Sterling)) and (Days Until Review Date = 7300 or lower) and (Safety Rating (Low, Multi, F/I, Latest) = 149 or higher) (Safety Rating (Low, Multi, F/I, Latest) is less than 350)
4. (Housing Type in (House)) and (Country Of Property in (UK)) and (Purchase Currency in (Sterling)) and (Days Until Review Date = 7300 or lower) and (Safety Rating (Low, Multi, F/I, Latest) is greater than 149) and (Safety Rating (Low, Multi, F/I, Latest) is less than 350)
5. (Housing Type in (House)) and (Country Of Property in (UK)) and (Purchase Currency in (Sterling)) and (Days Until Review Date = 7300 or lower) and (Safety Rating (Low, Multi, F/I, Latest) is greater than 149)
6. (Housing Type in (Flat)) and (Country Of Property in (GERMANY)) and (Purchase Currency in (Euro)) and (Safety Rating (Low, Multi, F/I, Latest) = 350 or lower) and (Days Until Review Date is less than 10950)
Any help with this is greatly appreciated, please let me know if I can add any more colour to the above. Thank you!
Solved! Go to Solution.
Hey there. Yes, Regex with output method = Tokenize is the right path.
Here's the expression I used: \(Safety Rating \(Low, Multi, F/I, Latest\) (?:is|=|<|>|<=|>=|in) [^)]+\)
If there's more/less of the string that you were intending to retain you can modify the regex expression as needed.
Here's a screenshot of the workflow (i've also attached it)
If this solved it for you, please mark this as solution!
Hello @Deano_280
This is an interesting problem but I believe regex can help.
I have used the following expression and set the regex tool to tokenize: (\(Safety Rating.*?\d.*?\)). This asks the tool to extract out all occourences of the expression \(Safety Rating.*?\d.*?\). Ive tried to explain how this expression works below:
I have attached the worklfow below to try and assist further.
Please let me know how you get on.
Regards - Pilsner
Hi Pilsner,
Thanks for the solution and the explanation too, it's nearly there. Apologies I should have specified that within the end part of the ratings piece where it will say something like = 149 or higher, in the real sample this will actually be capitalised letters, for example it might be = EEE+ or higher, or is greater than EEE+.
The middle bracketed section (Low, Multi, F/I, Latest) is redundant in this instance and will also vary between rows.
Is there also a way to keep the rest of the information within the rows instead of only keeping the ratings information? I guess what I would be looking for is something similar to the below.
(Housing Type in (House)) and (Country Of Property in (UK)) and (Purchase Currency in (Sterling)) and (Days Until Review Date = 7300 or lower) and (Safety Rating (Low, Multi, F/I, Latest) is greater than EEE+) | (Safety Rating (Low, Multi, F/I, Latest) is greater than EEE+) | [Null] |
(Housing Type in (Flat)) and (Country Of Property in (GERMANY)) and (Purchase Currency in (Euro)) and (Safety Rating (Low, Multi, F/I, Latest) = EEE + or higher) and (Safety Rating (Low, Multi, F/I, Latest) = FFF- or lower) and (Days Until Review Date is less than 10950) | (Safety Rating (Low, Multi, F/I, Latest) = EEE+ or higher) | (Safety Rating (Low, Multi, F/I, Latest) = FFF- or lower) |
Thank you for the help so far!
Hello @Deano_280
No worries, thank you for clarifying the capitalisation in the brackets. (I have changed all the numbers to EEE for the dummy data used in the example below.)
In order for regex to work, there has to be some common character or pattern that we can identify. Based on your update, I have assumed that all of these safety ratings have 3 capital letters. With this in mind, we can simply swap the \d from the original regex with [A-Z]{3} (which asks for 3 capital letters). This makes the overall regex look like:
(\(Safety Rating.*?[A-Z]{3}.*?\))
Make sure to deselect the case insensitive option.
If you want the original text also in your output, I would recommend first pivoting the output from the regex tool using a Record ID tool, grouped per record ID (I have called it group ID) and a crosstab tool.
Next, you can join your pivoted table back to the original table based on record ID to get your final output.
I have attached the updated workflow below.
Please let me know how you get on.
Regards - Pilsner
Hi @Deano_280
If there is only ever going to be up to two instances of Safety Rating appearing in the string then using @Pilsner excellent solution you can switch it to 'Split to Rows' instead of columns, and the job can be done in one tool:
Thank you all for the help so far. I think it's nearly there, the actual rating is not always 3 characters. So for example it could be E, E+, E-, EE, EE+, EE-, or EEE.
Some of the rows are producing an output whereas some are nulling.
Please may you kindly advise how this would affect the expression?
Probably worth adding it won't always be the letter E either!
Hello @Deano_280
To be able to extract this information, we need to identify a pattern that is consistent among all the records you're interested in. If you are able to share what you believe the consistent pattern to be, then I'd be happy to try and help convert this into regex.
Based on your update, I have edited the regex to create the following "(\(Safety Rating.*?\(.*?\).*?[A-Z].*?\))"
I've updated my solution with this new regex and attached it below. I do really like @davidskaife's solution as its much more streamlined than mine, if you do know how many columns you will need I would recommend going with his suggestion.
Regards - Pilsner
It looks as though that has done it, I'll begin my testing now. The ratings which were only one character, e.g. E+, were the problem before but it appears they are now being extracted which is brilliant.
Thank you @Pilsner @davidskaife @jrlindem for your ideas and solutions. Certainly helping me understand RegEx more and more!
Have a good day.