Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Regex Tokenize | RegEx split

akshatkumar87
7 - Meteor

Hi all I have a simple problem. I have a text  column which has values like below

 

Text

-------------------------------

30% off HG Baselayer
2 for $30 Graphic T's
BOGO 50% Graphic T's
30% off Velocity 1/4 Zip
$19.99 ColdGear Baselayer Tops
40% off Mo'Money Short

40% off

Graphic T's

 

I need to split this text into something like

Column 1           Column 2

30% off              HG Baselayer

BOGO 50%        Graphic T's

2 for $30            Graphic T's

30% off               Velocity 1/4 Zip

$19.99                 ColdGear Baselayer Tops

40% off               null

null                      Graphic T's

 

The rule which i am looking to implement is

Split the text if you encounter 'off' or any $/d (number starting with dollar sign) also if none is there col 1 is null but still col 2 can be not null

 

Any help is much appreciated

 

Thank you

7 REPLIES 7
KaneG
Alteryx Alumni (Retired)

Hi @akshatkumar87,

 

You could try to write one big REGEX string for this, but it would be easier to split the lines up depending on the obvious differences. A huge advantage of Alteryx is being able to split data using the Filter tool and deal with it separately before bringing it back together.

 

So, that being said, the different categories I see are:

  • x%/$x off product
  • Data with no numbers
  • The rest

 

So, I would start with a couple of filters to separate the above into 3 datasets.

 

REGEX-Filter_for_Regex.png

 

then dealing with the first stream is via a Parse REGEX option (as you have 2 named ranges, so Tokenize won't work).

  • ([\d%$]+ off)(.*?)$

In the first named range, you are looking for any combination of digits, or symbols ($,%) followed by a space then the word off

In the second named range, you are looking for everything else before the end of the line (symbolised by the $)

 

The second stream will be anything that has a digit in it and as it's so inconsistent I would look for combinations of 3 or more numbers, symbols etc and then add a delimiter after that to split on

  • Replace: ([\%\$\d\.]{3,} ) with $1|

which will add a pipe after each combination that is found.

  • Everything in the square brackets are the search terms
  • The part in the squiggly brackets means 3 or more.
  • $1 means everything in the normal brackets; first marked group

The rest should be pretty simple to deal with

REGEX-Parsing_Deals.png

jdunkerley79
ACE Emeritus
ACE Emeritus

 

^(.*?(\$[0-9\.]+|[0-9\.]+%)( off)?)? ?(.*?)$

This looks for a number block starting with a $ or ending with a %. There can be text before (e.g. BOGO of 2 for) and there can be an ' off' afterwards. Breaking it down:

 

  • ^ - start of text
  • .*? - non greedy match of text at beginning 
  • (\$[0-9\.]+|[0-9\.]+%) - match number with either a $ at start or % at end
  • ( off)? - match off if present

You get a couple of extra matching blocks to delete bu can do this with a select

2018-10-23_08-01-58.png

 

Sample attached

PaulN
Alteryx Alumni (Retired)

Great answers @jdunkerley79 and @KaneG,

 

I would humbly suggest the following RegEx to avoid the select tool:

 

^(.*?(?:\$[.0-9]+|[.0-9]+%)(?: off)?)?\s?(.*?)$

 

It's completely borrowed from James' solution -so all credits go to him honestly- but it doesn't keep track of unneeded groups (through ?:) like 

(\$[.0-9]+|[.0-9]+%)

or 

 

( off)

meaning that only 2 columns are left as requested.

 

Thanks,

 

Paul Noirel

Sr Customer Support Engineer, Alteryx

jdunkerley79
ACE Emeritus
ACE Emeritus

@PaulN nice enhancement - I always forget about those!

KaneG
Alteryx Alumni (Retired)

Nice work to @jdunkerley79 & @PaulN, awesomely concise. The power of REGEX!!!

 

That gives a few options, whether you want it explained and easier for someone else to edit (taking the performance penalty) or whether you would like it nice, simple and efficient with minimal tools. Testament to the community and how there is multiple ways to solve every problem, but there is always a way in Alteryx.

MarqueeCrew
20 - Arcturus
20 - Arcturus

@KaneG + @jdunkerley79 + @PaulN,

 

Those were some consumate RegEx recommendations.  But I also feel that they were egregious in that they demand RegEx expertise.  The following single formula applies the rules necessary to parse the incoming text following these rules:

  • If you see "OFF", take the left most characters (including Off) as the first token and everything following as the second.
    • Otherwise, if you see a $ symbol then get everything up to the next space as the first token and everything following that space as the second.
      • Otherwise, if you see a % symbol then take the left most characters (including the %) as the first token and Trim anything following as the second.
        • Lastly make the whole text the second.

 

IF
	Contains([Text],"off")	THEN Trim(Replace([Text], "off", 'off|'))	ELSEIF
	Contains([Text],"$")	THEN //Find first space after $ symbol
		Left([Text],ToNumber(FindString(Substring([Text],FindString([Text], "$"))," "))+
		ToNumber(FindString([Text], "$"))) + "|" +
		Substring([Text],ToNumber(FindString(Substring([Text],FindString([Text], "$")),
			" "))+1)						ELSEIF
	Contains([Text],"% ")	THEN Replace([Text],"% ","%|")
Else    "|"+[Text]
ENDIF

Then you can split the Output using "|" as a delimiter.

 

Cheers,

 

Mark

 

Alteryx ACE & Top Community Contributor

Chaos reigns within. Repent, reflect and restart. Order shall return.
Please Subscribe to my youTube channel.
akshatkumar87
7 - Meteor

Thank you so much for the solution it worked out pretty well except few exceptions where the text has stuff like

19.99 for Graphic T or $34.99/$45.99 for graphic T which i figured. All the solutions are right so thank you everyone

Labels