Engine Works

ned_blog · ‎03-01-2010

Adam Riley talked about the Regex parse documentation seeming incorrect. His understanding of the Regex Tokenize is in fact correct, our help is clearly missing an example and potentially misleading as well. It is however, fairly straightforward to tokenize on a delimiter or set of characters. The part that is missing from the documentation is that the Tokenize extracts either the entire match or the 1st marked part of a match. This allows you to extract just part of a match.

Since the tool outputs the part that matches, we have to mark the part in between the delimiters for output. We have to be careful here if we are matching everything to either exclude the match or do a non-greedy match. The non-greedy match ends the match at the 1st possible place a match can end, whereas a default match will take the longest match possible. That might include multiple delimiters. We then want to find out match ending with a delimiter, or the end of the line. Matching to the end of line is important, otherwise you might drop your last token.

Here is a regular expression that tokenizes based on commas.

(.+?)(?:,|$)

() - This is simply creating a marked expression. The part enclosed is what will be output in the separate fields or rows.
.+? - A non-greedy match of 1 or more characters. Since it isn't greedy, it will terminate at the 1st match of what follows.
(?:) - A non marking group. We need this so that we can select between our delimiter and the end of line ($).
,|$ - matches wither a comma OR the end of line

To get this far would be really easy to do with the Text To Columns tool. It only gets interesting when we do more. Now that we have the pattern down for Tokenize, it is really easy to change it to match other things:

(.+?)(?:[[:punct:]]|$) - tokenizes on any punctuation characters
(.+?)(?:[[:punct:]]| [0-9]|$) - tokenizes on any punctuation characters or digits
(.+?)(?:[a@]|$) - tokenizes on an a or @
(.+?)(?:def|$) - tokenizes on def. You need all 3 characters together to break a token.

etc... You can see how easy it is to create a custom tokenizer. A module to demonstrate and start playing with these regular expressions can be found at the end of this post.

On an amusing note (or not so much): while writing this blog I found a bug in the tokenize method. It will let you create a 0 character token which of course generates an infinite number of results. Not a good thing.

Download the mentioned files here.

rgor · ‎10-11-2015

I think I may have found another bug in the tokenize method: I am matching a certain pattern in the input data and, when the input record is null, it reports the same match as the record above at the output. I am really new here, so not sure if there's a way to report it?

Example:

For the following input data (note row 5 is null)

UID Codes
1 1QWE-A1234,1QWE-A1234,xxxx
2 2WER-S2345,1QWE-A1234,yyy
3 3ERT-D3456
4 4RTY-F4566,3ERT-D3456,3ERT-D3456,zz
5
6 5TYU-G5678
7 6YUI-H6789
8 7UIO-J7890
9 8IOP-K890

And using a regular expression to match unique cases of a digit, followed by 3 letters, followed by slash, followed by a single letter, followed by 1 or more digits:

(\d{1}[A-Z]{3}-[A-Z]{1}\d{1,})(?!.*\1)

The output looks like:

UID Codes

1 1QWE-A1234
2 2WER-S2345
2 1QWE-A1234
3 3ERT-D3456
4 4RTY-F4566
4 3ERT-D3456
5 3ERT-D3456 I believe this is wrong; the input is null so the output should be null, but it shows last match from the row above...
6 5TYU-G5678
7 6YUI-H6789
8 7UIO-J7890
9 8IOP-K890

Ned · ‎10-12-2015

I am not able to replicate that using 10.0. I agree that what you are seeing would be a bug, but again, I'm not seeing that. If you have a simple module to replicate so we can look in more depth, feel free to post on the forums or email me: ned (at) alteryx.com

Ned · ‎10-12-2015

This was in fact a bug - thank you @rgor for privately sending a module to replicate. It has been fixed in the code base for the next release.

rgor · ‎10-12-2015

Thanks, that was quick!
Just in case, an easy work-around for now is to pre-filter the null values so they don't go through the RegEx tool, and union them at the end.

anushaa · ‎10-23-2018

Hello,

I had a query. Say I had a set of email ids as the data:

abc@ab.cfy.com

huj@ab.yhg.com

I want to extract the data between "@" and ".com " . Since it has a common delimiter "." , I am not able to extract the required text (i.e. ab.cyf ) . I was hoping someone could guide me.

Thank you,

Anusha

mbarone · ‎10-24-2018

Hi anushaa,

This should do the trick: (?<=@)(.*)(?=.com)

You can use the RegEx tool with either the Tokenize or Parse option. Both will get you what you need.

anushaa · ‎10-24-2018

Hey!

Thanks ! That works!

mbarone · ‎10-24-2018

Welcome.

paulvu0911 · ‎08-06-2019

We have a list of formulas being stored as strings, and I'm trying to tokenize all variables in the formula. For example, for the formula: DateTimeDiff([curr_date],[s_next_due_date],"months"), I want to tokenize and put [curr_date], [s_next_due_date] in two separate columns. What would be the right regex expression in this case?

Thanks.

Paul-Evans · ‎10-04-2022

This was so helpful!

For anyone looking to tokenize based on text you want to keep, try using a lookahead

(?=

instead of

(?:

For example, I needed to tokenize a string of department codes & names where the department number was always 6 digits.

I was able to use the following expression to tokenize it, whereas using an unmarked group would only capture every other department (i.e. the code for the second department was the unmarked group at the end of the first token, and then not able to be used to tokenize the second department).

(\d{6}.+?)(?=\d{6}|$)

dholder15 · ‎01-24-2023

I'm trying to extract the first and last name from an email address like this: first.last+google.gmail.com. All I want is the first and last name and I don't want the period between them either. Is there a regular expression for this?

NeilR · ‎01-24-2023

@dholder15

(.+)\.(.+)@.+

dholder15 · ‎01-24-2023

Thanks Neil. I have a follow up, what happens if there is a +tq100 before the email address. What should I do expression wise for that? So the example would look like tam.woolen+tq100@gmail.com

dholder15 · ‎01-24-2023

I want to have everything after the + sign gone. @NeilR

NeilR · ‎01-24-2023

A small tweak to the RegEx should do the trick. Have you tried anything so far?

dholder15 · ‎01-24-2023

Yep I just figured it out!

dholder15 · ‎01-24-2023

Thank you!