We are celebrating the 10-year anniversary of the Alteryx Community! Learn more and join in on the fun here.
Start Free Trial

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Extracting info between a set of brackets, where the string contains multiple brackets.

Deano_280
7 - Meteor

Hi all,

 

Seeking guidance once again on how best to go about extracting a piece of information, I'm thinking with RegEx.

 

Below are a couple of examples of the sorts of strings that will be formatted. I'm unable to share the real sample but it will contain tens of thousand of rows. Each important piece of information is between brackets with the word "and" being the separator between these pieces.

 

My aim is extract all of the information relating to Safety Rating. Information on Safety Rating can appear twice such as in points 1-4 below. So in example 1 below, I would want it to take out:

 

  • (Safety Rating (Low, Multi, F/I, Latest) = 149 or higher)
  • (Safety Rating (Low, Multi, F/I, Latest) = 350 or lower)

In some strings information on Safety Rating will only appear once, as in points 5 and 6 below. So in example 5 below, I would only want to extract:

  • (Safety Rating (Low, Multi, F/I, Latest) is greater than 149)

 

Important to note, the Safety Rating piece can appear in a different position in different strings, e.g. not the final bracketed piece of information as in example 1.

 

1. (Housing Type in (Flat)) and (Country Of Property in (GERMANY)) and (Purchase Currency in (Euro)) and (Safety Rating (Low, Multi, F/I, Latest) = 149 or higher) and (Safety Rating (Low, Multi, F/I, Latest) = 350 or lower) and (Days Until Review Date is less than 10950)


2. (Housing Type in (House)) and (Country Of Property in (UK)) and (Purchase Currency in (Sterling)) and (Days Until Review Date = 7300 or lower) and (Safety Rating (Low, Multi, F/I, Latest) is greater than 150) and (Safety Rating (Low, Multi, F/I, Latest) = 349 or lower)


3. (Housing Type in (House)) and (Country Of Property in (UK)) and (Purchase Currency in (Sterling)) and (Days Until Review Date = 7300 or lower) and (Safety Rating (Low, Multi, F/I, Latest) = 149 or higher) (Safety Rating (Low, Multi, F/I, Latest) is less than 350)


4. (Housing Type in (House)) and (Country Of Property in (UK)) and (Purchase Currency in (Sterling)) and (Days Until Review Date = 7300 or lower) and (Safety Rating (Low, Multi, F/I, Latest) is greater than 149) and (Safety Rating (Low, Multi, F/I, Latest) is less than 350)

 

5. (Housing Type in (House)) and (Country Of Property in (UK)) and (Purchase Currency in (Sterling)) and (Days Until Review Date = 7300 or lower) and (Safety Rating (Low, Multi, F/I, Latest) is greater than 149)


6. (Housing Type in (Flat)) and (Country Of Property in (GERMANY)) and (Purchase Currency in (Euro)) and (Safety Rating (Low, Multi, F/I, Latest) = 350 or lower) and (Days Until Review Date is less than 10950)

 

 

Any help with this is greatly appreciated, please let me know if I can add any more colour to the above. Thank you!

11 REPLIES 11
jrlindem
11 - Bolide

Hey there.  Yes, Regex with output method = Tokenize is the right path.

 

Here's the expression I used:   \(Safety Rating \(Low, Multi, F/I, Latest\) (?:is|=|<|>|<=|>=|in) [^)]+\)

 

If there's more/less of the string that you were intending to retain you can modify the regex expression as needed.

 

Here's a screenshot of the workflow (i've also attached it)

 

jrlindem_0-1753292378925.png

 

If this solved it for you, please mark this as solution!

 

Pilsner
13 - Pulsar

Hello @Deano_280 

This is an interesting problem but I believe regex can help.

I have used the following expression and set the regex tool to tokenize: (\(Safety Rating.*?\d.*?\)). This asks the tool to extract out all occourences of the expression \(Safety Rating.*?\d.*?\). Ive tried to explain how this expression works below:

 

  • First, I have specified that it must start by opening the brackets "\("
  • Then there must be the words Safety Rating followed by some charactors. "Safety Rating.*?"
  • After this the expression requires a number "\d"
  • Before finishing with either a bracket imediatly, or some more charactors, followed by a bracket ".*?\)"

1.png

I have attached the worklfow below to try and assist further. 

Please let me know how you get on.

Regards - Pilsner

Deano_280
7 - Meteor

Hi Pilsner,

 

Thanks for the solution and the explanation too, it's nearly there. Apologies I should have specified that within the end part of the ratings piece where it will say something like = 149 or higher, in the real sample this will actually be capitalised letters, for example it might be = EEE+ or higher, or is greater than EEE+.

 

The middle bracketed section (Low, Multi, F/I, Latest) is redundant in this instance and will also vary between rows.

 

Is there also a way to keep the rest of the information within the rows instead of only keeping the ratings information? I guess what I would be looking for is something similar to the below.

 

(Housing Type in (House)) and (Country Of Property in (UK)) and (Purchase Currency in (Sterling)) and (Days Until Review Date = 7300 or lower) and (Safety Rating (Low, Multi, F/I, Latest) is greater than EEE+)

(Safety Rating (Low, Multi, F/I, Latest) is greater than EEE+)[Null]
(Housing Type in (Flat)) and (Country Of Property in (GERMANY)) and (Purchase Currency in (Euro)) and (Safety Rating (Low, Multi, F/I, Latest) = EEE + or higher) and (Safety Rating (Low, Multi, F/I, Latest) = FFF- or lower) and (Days Until Review Date is less than 10950)(Safety Rating (Low, Multi, F/I, Latest) = EEE+ or higher)(Safety Rating (Low, Multi, F/I, Latest) = FFF- or lower)

 

Thank you for the help so far!

Pilsner
13 - Pulsar

Hello @Deano_280 

No worries, thank you for clarifying the capitalisation in the brackets. (I have changed all the numbers to EEE for the dummy data used in the example below.)

In order for regex to work, there has to be some common character or pattern that we can identify. Based on your update, I have assumed that all of these safety ratings have 3 capital letters. With this in mind, we can simply swap the \d from the original regex with [A-Z]{3} (which asks for 3 capital letters). This makes the overall regex look like:

(\(Safety Rating.*?[A-Z]{3}.*?\))

Make sure to deselect the case insensitive option. 

1.png



If you want the original text also in your output, I would recommend first pivoting the output from the regex tool using a Record ID tool, grouped per record ID (I have called it group ID) and a crosstab tool.

2.png




Next, you can join your pivoted table back to the original table based on record ID to get your final output. 

3.png

  



I have attached the updated workflow below.

Please let me know how you get on.

Regards - Pilsner

davidskaife
14 - Magnetar

Hi @Deano_280 

 

If there is only ever going to be up to two instances of Safety Rating appearing in the string then using @Pilsner excellent solution you can switch it to 'Split to Rows' instead of columns, and the job can be done in one tool:

 

DavidSkaife_0-1753347987017.png

 

Deano_280
7 - Meteor

Thank you all for the help so far. I think it's nearly there, the actual rating is not always 3 characters. So for example it could be E, E+, E-, EE, EE+, EE-, or EEE.

 

Some of the rows are producing an output whereas some are nulling.

 

Please may you kindly advise how this would affect the expression? 

Deano_280
7 - Meteor

Probably worth adding it won't always be the letter E either!

Pilsner
13 - Pulsar

Hello @Deano_280 

To be able to extract this information, we need to identify a pattern that is consistent among all the records you're interested in. If you are able to share what you believe the consistent pattern to be, then I'd be happy to try and help convert this into regex.

Based on your update, I have edited the regex to create the following "(\(Safety Rating.*?\(.*?\).*?[A-Z].*?\))"

 

  • This first looks for "(Safety Rating"  using "\(Safety Rating"
  • Followed by another bracket "(" using ".*?("
  • Then some more unknown characters using ".*?"
  • Another bracket ")" using "\)"
  • Then some more text which contains a capital letter, before the final bracket e.g "EE+ )" using ".*?[A-Z].*?\)"
    4.png

I've updated my solution with this new regex and attached it below. I do really like @davidskaife's solution as its much more streamlined than mine, if you do know how many columns you will need I would recommend going with his suggestion.

Regards - Pilsner

Deano_280
7 - Meteor

It looks as though that has done it, I'll begin my testing now. The ratings which were only one character, e.g. E+, were the problem before but it appears they are now being extracted which is brilliant.

 

Thank you @Pilsner @davidskaife @jrlindem for your ideas and solutions. Certainly helping me understand RegEx more and more!

 

Have a good day.

Labels
Top Solution Authors