community
cancel
Showing results for 
Search instead for 
Did you mean: 

Engine Works Blog

Under the hood of Alteryx: tips, tricks and how-tos.
Asteroid

Adam Riley talked about the Regex parse documentation seeming incorrect. His understanding of the Regex Tokenize is in fact correct, our help is clearly missing an example and potentially misleading as well. It is however, fairly straightforward to tokenize on a delimiter or set of characters. The part that is missing from the documentation is that the Tokenize extracts either the entire match or the 1st marked part of a match. This allows you to extract just part of a match.

 

Since the tool outputs the part that matches, we have to mark the part in between the delimiters for output. We have to be careful here if we are matching everything to either exclude the match or do a non-greedy match. The non-greedy match ends the match at the 1st possible place a match can end, whereas a default match will take the longest match possible. That might include multiple delimiters. We then want to find out match ending with a delimiter, or the end of the line. Matching to the end of line is important, otherwise you might drop your last token.

 

Here is a regular expression that tokenizes based on commas.

 

 

(.+?)(?:,|$)

 

() - This is simply creating a marked expression. The part enclosed is what will be output in the separate fields or rows.
.+? - A non-greedy match of 1 or more characters. Since it isn't greedy, it will terminate at the 1st match of what follows.
(?:) - A non marking group. We need this so that we can select between our delimiter and the end of line ($).
,|$ - matches wither a comma OR the end of line

To get this far would be really easy to do with the Text To Columns tool. It only gets interesting when we do more. Now that we have the pattern down for Tokenize, it is really easy to change it to match other things:

 

(.+?)(?:[[:punct:]]|$) - tokenizes on any punctuation characters
(.+?)(?:[[:punct:]]| [0-9]|$) - tokenizes on any punctuation characters or digits
(.+?)(?:[a@]|$) - tokenizes on an a or @
(.+?)(?:def|$) - tokenizes on def. You need all 3 characters together to break a token.

etc... You can see how easy it is to create a custom tokenizer. A module to demonstrate and start playing with these regular expressions can be found at the end of this post.

 

On an amusing note (or not so much): while writing this blog I found a bug in the tokenize method. It will let you create a 0 character token which of course generates an infinite number of results. Not a good thing.

 

Download the mentioned files here.

Comments
Atom

I think I may have found another bug in the tokenize method: I am matching a certain pattern in the input data and, when the input record is null, it reports the same match as the record above at the output. I am really new here, so not sure if there's a way to report it?

 

Example:

For the following input data (note row 5 is null)

 

UID    Codes
 1       1QWE-A1234,1QWE-A1234,xxxx
 2       2WER-S2345,1QWE-A1234,yyy
 3       3ERT-D3456
 4       4RTY-F4566,3ERT-D3456,3ERT-D3456,zz
 5      
 6       5TYU-G5678
 7       6YUI-H6789
 8      7UIO-J7890
 9      8IOP-K890

And using a regular expression to match unique cases of a digit, followed by 3 letters, followed by slash, followed by a single letter, followed by 1 or more digits:

(\d{1}[A-Z]{3}-[A-Z]{1}\d{1,})(?!.*\1)

 

The output looks like:

 

UID    Codes

 1      1QWE-A1234
 2       2WER-S2345
 2       1QWE-A1234
 3       3ERT-D3456
 4      4RTY-F4566
 4      3ERT-D3456
 5      3ERT-D3456   I believe this is wrong; the input is null so the output should be null, but it shows last match from the row above...
 6      5TYU-G5678
 7      6YUI-H6789
 8      7UIO-J7890
 9      8IOP-K890

 

Founder
Founder
I am not able to replicate that using 10.0. I agree that what you are seeing would be a bug, but again, I'm not seeing that. If you have a simple module to replicate so we can look in more depth, feel free to post on the forums or email me: ned (at) alteryx.com
Founder
Founder
This was in fact a bug - thank you @rgor for privately sending a module to replicate. It has been fixed in the code base for the next release.
Atom

Thanks, that was quick!
Just in case, an easy work-around for now is to pre-filter the null values so they don't go through the RegEx tool, and union them at the end.

Alteryx Certified Partner

Hello,

 

I had a query. Say I had a set of email ids as the data:

abc@ab.cfy.com

huj@ab.yhg.com

 

I want to extract the data between "@"  and ".com " . Since it has a common delimiter "."  , I am not able to extract the required text (i.e. ab.cyf ) . I was hoping someone could guide me.

 

Thank you,

Anusha

 

Pulsar
Pulsar

Hi anushaa,

 

This should do the trick:  (?<=@)(.*)(?=.com)

 

You can use the RegEx tool with either the Tokenize or Parse option.  Both will get you what you need.

Alteryx Certified Partner

Hey!

 

Thanks ! That works!

Pulsar
Pulsar

Welcome.

Meteoroid

We have a list of formulas being stored as strings, and I'm trying to tokenize all variables in the formula. For example, for the formula: DateTimeDiff([curr_date],[s_next_due_date],"months"), I want to tokenize and put [curr_date], [s_next_due_date] in two separate columns. What would be the right regex expression in this case?

 

Thanks.

Labels