Engine Works

ned_blog · ‎03-01-2010

Adam Riley talked about the Regex parse documentation seeming incorrect. His understanding of the Regex Tokenize is in fact correct, our help is clearly missing an example and potentially misleading as well. It is however, fairly straightforward to tokenize on a delimiter or set of characters. The part that is missing from the documentation is that the Tokenize extracts either the entire match or the 1st marked part of a match. This allows you to extract just part of a match.

Since the tool outputs the part that matches, we have to mark the part in between the delimiters for output. We have to be careful here if we are matching everything to either exclude the match or do a non-greedy match. The non-greedy match ends the match at the 1st possible place a match can end, whereas a default match will take the longest match possible. That might include multiple delimiters. We then want to find out match ending with a delimiter, or the end of the line. Matching to the end of line is important, otherwise you might drop your last token.

Here is a regular expression that tokenizes based on commas.

(.+?)(?:,|$)

() - This is simply creating a marked expression. The part enclosed is what will be output in the separate fields or rows.
.+? - A non-greedy match of 1 or more characters. Since it isn't greedy, it will terminate at the 1st match of what follows.
(?:) - A non marking group. We need this so that we can select between our delimiter and the end of line ($).
,|$ - matches wither a comma OR the end of line

To get this far would be really easy to do with the Text To Columns tool. It only gets interesting when we do more. Now that we have the pattern down for Tokenize, it is really easy to change it to match other things:

(.+?)(?:[[:punct:]]|$) - tokenizes on any punctuation characters
(.+?)(?:[[:punct:]]| [0-9]|$) - tokenizes on any punctuation characters or digits
(.+?)(?:[a@]|$) - tokenizes on an a or @
(.+?)(?:def|$) - tokenizes on def. You need all 3 characters together to break a token.

etc... You can see how easy it is to create a custom tokenizer. A module to demonstrate and start playing with these regular expressions can be found at the end of this post.

On an amusing note (or not so much): while writing this blog I found a bug in the tokenize method. It will let you create a 0 character token which of course generates an infinite number of results. Not a good thing.

Download the mentioned files here.