Regex Tokenize | RegEx split
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hi all I have a simple problem. I have a text column which has values like below
Text
-------------------------------
30% off HG Baselayer
2 for $30 Graphic T's
BOGO 50% Graphic T's
30% off Velocity 1/4 Zip
$19.99 ColdGear Baselayer Tops
40% off Mo'Money Short
40% off
Graphic T's
I need to split this text into something like
Column 1 Column 2
30% off HG Baselayer
BOGO 50% Graphic T's
2 for $30 Graphic T's
30% off Velocity 1/4 Zip
$19.99 ColdGear Baselayer Tops
40% off null
null Graphic T's
The rule which i am looking to implement is
Split the text if you encounter 'off' or any $/d (number starting with dollar sign) also if none is there col 1 is null but still col 2 can be not null
Any help is much appreciated
Thank you
Solved! Go to Solution.
- Labels:
- Data Investigation
- Fuzzy Match
- Regex
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hi @akshatkumar87,
You could try to write one big REGEX string for this, but it would be easier to split the lines up depending on the obvious differences. A huge advantage of Alteryx is being able to split data using the Filter tool and deal with it separately before bringing it back together.
So, that being said, the different categories I see are:
- x%/$x off product
- Data with no numbers
- The rest
So, I would start with a couple of filters to separate the above into 3 datasets.
then dealing with the first stream is via a Parse REGEX option (as you have 2 named ranges, so Tokenize won't work).
- ([\d%$]+ off)(.*?)$
In the first named range, you are looking for any combination of digits, or symbols ($,%) followed by a space then the word off
In the second named range, you are looking for everything else before the end of the line (symbolised by the $)
The second stream will be anything that has a digit in it and as it's so inconsistent I would look for combinations of 3 or more numbers, symbols etc and then add a delimiter after that to split on
- Replace: ([\%\$\d\.]{3,} ) with $1|
which will add a pipe after each combination that is found.
- Everything in the square brackets are the search terms
- The part in the squiggly brackets means 3 or more.
- $1 means everything in the normal brackets; first marked group
The rest should be pretty simple to deal with
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
^(.*?(\$[0-9\.]+|[0-9\.]+%)( off)?)? ?(.*?)$
This looks for a number block starting with a $ or ending with a %. There can be text before (e.g. BOGO of 2 for) and there can be an ' off' afterwards. Breaking it down:
- ^ - start of text
- .*? - non greedy match of text at beginning
- (\$[0-9\.]+|[0-9\.]+%) - match number with either a $ at start or % at end
- ( off)? - match off if present
You get a couple of extra matching blocks to delete bu can do this with a select
Sample attached
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Great answers @jdunkerley79 and @KaneG,
I would humbly suggest the following RegEx to avoid the select tool:
^(.*?(?:\$[.0-9]+|[.0-9]+%)(?: off)?)?\s?(.*?)$
It's completely borrowed from James' solution -so all credits go to him honestly- but it doesn't keep track of unneeded groups (through ?:) like
(\$[.0-9]+|[.0-9]+%)
or
( off)
meaning that only 2 columns are left as requested.
Thanks,
Paul Noirel
Sr Customer Support Engineer, Alteryx
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
@PaulN nice enhancement - I always forget about those!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Nice work to @jdunkerley79 & @PaulN, awesomely concise. The power of REGEX!!!
That gives a few options, whether you want it explained and easier for someone else to edit (taking the performance penalty) or whether you would like it nice, simple and efficient with minimal tools. Testament to the community and how there is multiple ways to solve every problem, but there is always a way in Alteryx.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
@KaneG + @jdunkerley79 + @PaulN,
Those were some consumate RegEx recommendations. But I also feel that they were egregious in that they demand RegEx expertise. The following single formula applies the rules necessary to parse the incoming text following these rules:
- If you see "OFF", take the left most characters (including Off) as the first token and everything following as the second.
- Otherwise, if you see a $ symbol then get everything up to the next space as the first token and everything following that space as the second.
- Otherwise, if you see a % symbol then take the left most characters (including the %) as the first token and Trim anything following as the second.
- Lastly make the whole text the second.
- Otherwise, if you see a % symbol then take the left most characters (including the %) as the first token and Trim anything following as the second.
- Otherwise, if you see a $ symbol then get everything up to the next space as the first token and everything following that space as the second.
IF Contains([Text],"off") THEN Trim(Replace([Text], "off", 'off|')) ELSEIF Contains([Text],"$") THEN //Find first space after $ symbol Left([Text],ToNumber(FindString(Substring([Text],FindString([Text], "$"))," "))+ ToNumber(FindString([Text], "$"))) + "|" + Substring([Text],ToNumber(FindString(Substring([Text],FindString([Text], "$")), " "))+1) ELSEIF Contains([Text],"% ") THEN Replace([Text],"% ","%|") Else "|"+[Text] ENDIF
Then you can split the Output using "|" as a delimiter.
Cheers,
Mark
Chaos reigns within. Repent, reflect and restart. Order shall return.
Please Subscribe to my youTube channel.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Thank you so much for the solution it worked out pretty well except few exceptions where the text has stuff like
19.99 for Graphic T or $34.99/$45.99 for graphic T which i figured. All the solutions are right so thank you everyone
