nav[aria-label="Primary Navigation"] { padding: 0; & ul { list-style: none; width: 100%; display: flex; flex-direction: row; justify-content: start; align-items: start; gap: 30px; padding: 0; & li { margin: 0; } & ul li { list-style: none; } } }

Tool Mastery | RegEx

AlexKo

This article is part of the Tool Mastery Series, a compilation of Knowledge Base contributions to introduce diverse working examples for Designer Tools. Here we’ll delve into uses of the RegEx Tool on our way to mastering the Alteryx Designer:

The RegEx tool is kind of like the Swiss Army Knife of parsing in Alteryx; there are a whole lot of ways you can use it to do things faster or more effectively, but even if you just use the blade it's still immensely useful. Sometimes that's all you need, but if you do take the time to figure outhow to use a few other tools in that knife, you'll start to see that there isn't much you can't do with it.

mr bean.png

Before and after using the RegEx tool.

RegEx: What is it good for?

RegEx is an abbreviation of Regular Expression,and you can essentially think of it as anotherlanguage. It uses symbols just like any other language, but in regular expressions these symbols are used to match sequences of letters, numbers, or characters in text. It's a language that's all aboutrecognizing patterns.

Humans are really good at this sort of thing - let's say I gave you this block of text:

3345 Michelson Drive, Suite 400,Irvine, CA 92612

12303 Airport Way, Suite 250, Broomfield, CO 80021

Two North Riverside Plaza, Suite 1430, Chicago, IL 60606

You'd have no problem telling me these are addresses, and what part is a street number or a city name. But a computer would just see a block of text, and it wouldn't care to check if it was an address or not. RegEx is one way we can 'recognize' useful data in text. Let's 'translate' this to a RegEx version:

3345	^\d+	The ^signifies the beginning of a line in RegEx, so it's good practice to include it with your initial pattern. Here, our pattern is\d which means 'any numerical character' (0-9). The+signifies that we want to match theprevious expression oneor more times. Since the first part of the address is a street number, this allows us to have a number of any length.
Michelson Drive	[^\d]+	To match the street, we have to allow our expression to pick up multiple words of characters, including any number of spaces, since streets will often be longer than one word. One way to match this is bywhat we don't expectusing[^...]. This grouping notation matches 'any character not listed here'. What we end up with is matchone or more times any character that isnot a number.
Suite 400	.*	The next part of the address is a suite number, which may or may not be present, and could potentially take on various naming conventions. In order to define a flexible expression to match anything we see there, we can use a . to match 'any character'. The* then signifies that we can match any characterzero or more times.
Irvine	[^\d]+	As before, this just means 'any character that is not a number'.
CA	\u{2}	To match the state we can make use of \uto signify 'any uppercaseletter'. Since we're expecting this to always be a two-letter sequence, we can also specify the length of the match by using{...} after our expression, or 'match any two uppercase letters'.
92612	\d{5}$	The zip code will likewise come in as 5 digits, so we can do something similar to say 'match five number characters'. Then, we can tack on $to signify that we're expecting this to be the end of the current line.

Tools of the trade

There are quite a few symbols used to build RegEx expressions, but Alteryx provides a nice little 'cheat-sheet' right in the tool for you.

You don't really need much more than this to get rolling with using RegEx, and much like in Alteryx, you'll find that there are many many different ways you can create an expression to match a pattern.So don't sweat the details too much, and don't be afraid to spend some time with the good old fashioned "guess-and-test" method of learning.

trial and error.jpg

Don't worry, I'm classically trained.

For more complete guides on RegEx, you can also turn to the Boost-Extended Format String Syntax Guide, as well as the RegEx Perl Syntax Guide.

Alteryx has structured RegEx functionality into four methods: Match, Parse, Replace & Tokenize. Our help documentation for these methods is very good as well.

Match

TheMatchmethod simply checks to see whether a string can be described by the given regular expression, and gives you aTrueorFalse . Let's use the expression elements in the table above to match our addresses. Youcan create one long expression by just putting them together and including spaces\s and commas, wherever theyappear.

The first two addresses match just fine using this expression, butthe third address strays from what we expected to see and fails. Due to an obscure rule about buildings that share a name with their address, the street number is spelled out and our regular expression is unable to match it.

The key to writing a good RegEx is foreseeingthese exceptions in your data, and accounting for them within the expression. In order to match the 'Two' in this address, let'sput in another expression to check if and only ifthis one fails to match. Most addresses will start with numerical characters, but if they don't then this expression will check for a word instead. Here's how it looks:

(?:^\d+)|(?:^\w+)

This is a lot simpler than it looks, and really just uses two additional RegEx symbols. Thepipe symbolImeans 'or'. For example,a|b would just check ifamatches, and if it doesn't, ifbmatches. The second symbol is (?:...), also known as anunmarked group. This is just a way for us to group these things together for theoroperation.

In summary, the first group here (?:^\d+)is just doing the same thing as before, but when it fails the expression tries to match the second group(?:^\w+). This lets us match the word version of our address above without an issue.

For more onMatch:

This excellent post by Adam Riley on the Chaos Reigns Within blog.
How To: Get Started With Regular Expressions
The Regex_Match function for the Formula tool.

Extra Credit:There are many ways to structure RegEx; comment below with a better alternative for(?:^\d+)|(?:^\w+). Why is it better?

Parse

Matching is nice in that you can use it for validation, but theParsemethod is really when RegEx comes into its own, allowing you toextractuseful information out of a block of text. The RegEx tool makes this easy for us - all we need is to place parentheses (...)around each thing we want to pull out. These are calledmarked groups - the counterpart to ourunmarked groups above (?:...).

As you type in the parentheses, you'll see these pop into a Select tool style Output Fieldswindow, which allows you to rename your fields and change their datatypes.

For more onParse:

Part 2 of Adam Riley's RegEx and Alteryx.

Replace

As you can see, theParsemethod is really just an extension ofMatchthat allows us to pull stuff out of text and place it neatly in a new column. Well, what happens when we extend this concept and ask ourselves, 'How can I put stuff back in?'That's where theReplace method comes in.

WithReplace, we can parse components of our string, replace them, and even rearrange them. We can do thisby specifyingmarked groupsto tell the RegEx toolwhat to replace and where, in a language anybody can understand...

dollah bills

The dapper sloth is absolutely right of course, we can use dollar signs$ along with numbers to specify each exact marked group. For example, $1, $2, and $3refer to thefirst, second,andthirdmarked groups, respectively. So for our list of addresses, if we just wanted to parse out a list of city-states, we cantype inthe expression$4, $5.

Note we were able to add in our own little comma there, and a space, just by typing in that text box. TheReplacemethod is very flexible, and you can also use it from the Formula tool withthe function Regex_Replace.

For more on Replace:

Tokenize

RegEx is pretty nifty when things are given to us in a nice, neat list, but what happens if all of these addresses are just unceremoniously dumped out by some API call or database as a single block of text?

3345 Michelson Drive, Suite 400, Irvine, CA 92612,12303 Airport Way, Suite 250, Broomfield, CO 80021,Two North Riverside Plaza, Suite 1430, Chicago, IL 60606

Well, this isn't very useful - it's all comma-delimited, but we have no easy way of knowing when one address ends and another begins. What we need here is theTokenize method, which will take that chunk of text and split it into columns or rows, much like the Text to Columnstool. There are two important differences that set RegEx Tokenize apart:

Instead of matchingon what you don't want (like a comma), you match on what you do want (everything else).
You have the option of choosing what's split out and what is ignored by using amarked group.

This may sound topsy-turvy, but it actually gives you a lot more flexibility in what/how you split your data. To illustrate, let's split our address text blurb up into multiple rows, using the comma as our delimiter.

Since we have to match everything that we want, we need to use the expression (.+?)(?:,|$). Let's break this down:

.+meansany one character.matched one or more times +
?is how we tell this match to be lazyrather thangreedy. This is a really useful distinctionthat may be a bit difficult to understand at first, but for the purposes of this crash course let's just focus on what the? means:match whatever is before zero times (not at all) or exactly once.
The plus sign+is actually the opposite: it's agreedysymbol, so the previousone character .will try to matchone or more times. So what does it mean when we tell something greedy to be lazy? Well, it actually modifies just how greedy it can be by forcing it to look at the next match. So what.+?really means is:
'match one character one or more times until you can match whatever comes next'.
What comes next is anunmarked group(?:,|$)that essentially functions as a STOP sign for the previous (.+?). Since it's not in the marked group, it won't be present in the result. It's just in anunmarked group so that we can tell it to look foreithera comma ,orthe end-of-line $at the end of the text blurb.

Phew - that's certainly a lot. Feel free to take this opportunity to stretch, go for a walk, or meditate.

line noise.gif

Obviously, the Text to Columnstool can split on a comma way easier than the above, but the flexibility ofTokenize comes into the fore when we try to do something a tad more useful.For instance, we can use it to split that block of address information into the original three addresses.

In this case, we are just using our last matching expression for the zip code \d{5}to mark out where each match ends. As above, we know that each line will either end with a comma or the end of line, and we could use the(?:,|$)here to split these successfully. In the example above, I chose to showcase the?ability to matchzero or onetime to replace this - so we can split on an 'optional comma' after the marked groupmatch.

For more onTokenize:

Part 2 of Adam Riley's RegEx and Alteryxand an Engine Works blog response from Alteryx co-founder Ned Harding.

Extra Credit:The ,?won't work in the previous case, splitting by comma alone - (.+?),?instead of(.+?)(?:,|$). Why?

Comment with an explanation below for eternal glory and bragging rights.

By now, you should have expert-level proficiency with the RegExTool! If you can think of a use case we left out, feel free to use the comments section below! Consider yourself a Tool Master already? Let us know at community@alteryx.com if you’d like your creative tool uses to be featured in the Tool Mastery Series.

Stay tuned with our latest posts everyTool Tuesdayby followingAlteryxon Twitter! If you want to master all the Designer tools, considersubscribingfor email notifications.

Additional Information

Click on the corresponding language link below to access this article in another language -

regex.yxmd

Regex

Parse

Expression

Tool Mastery

Accepted answers

All comments

Ken_Black

Alex,

Thanks for this excellent article. As time goes by and I develop more and more sophisticated workflows, my usage of regex continues to grow. After reading this article, I took a look at a recent bit of work and found a whole series of regex tools I strung together to do a bunch of things. I just wanted to say things for producing this well-documented piece of work.

Ken

AlexKo

Happy to hear you found this useful, @Ken_Black! Happy RegEx'ing!

Alex

brad_j_crep

Very informative. How would it work if you had an address like 123 12th St NE Apt# 201 #201? I need to get rid of the second #201, remove the # after the Apt, still let the 123 12th St NE Apt 201 go through to the output.

Still new to regex, any help would be appreciated.

Thanks,

Brad

DanM

@brad_j_crep one idea is that you can use the parse Output Method and use this expression to remove the #201 (.*)\s#. You can then use a Formula tool and use the replacechar expression to remove the other # that will be left.

AlexKo

Thanks for reading, @brad_j_crep. I think @DanM has a nice solution there; I'll just add another here that uses the replace method...

Sometimes when building these out the question is really 'how often will this work?' - in this case I'm not sure if you'll see a dataset with this odd repeated suite name. If this is peppered in the data, it may be best to just use a Formula tool or RegEx tool upstream of your main parsing step that replaces any # in the dataset and cleans unnecessary stuff up.

Alex

JohnHG

Hi, I tweak around a bit to understand what every bit does. I've opened the build in RegEx example. In the 3rd example (In "Parse" mode, partial matches will be considered, however only the first match will be returned.) when I put in some square brackets like: -([\d{4}]) it returns in the zip4 ,column 4,6,1,3 etc.

The explanation about sq brackets states:

To match the street, we have to allow our expression to pick up multiple words of characters, including any number of spaces, since streets will often be longer than one word. One way to match this is by what we don't expect using [^...]. This grouping notation matches 'any character not listed here'. What we end up with is match one or more times any character that is not a number.

I expected all to be [null]

Can anyone explain to me what happened here?

Thanks,

John

AlexKo

Hi @JohnHG -

This is a failure on my part to differentiate the difference between:

matching characters in a set using [...]
matching characters not in a set using [^...]

In the first, we're matching any character that is there, and in the second any character that isn't. Take a look at this link for a more eloquent explanation:

https://regexone.com/lesson/matching_characters

https://regexone.com/lesson/excluding_characters

philipmannering

Anybody know how you can do named groups in Alteryx? For example you could have two date formats:

13-Nov-2017 and Nov 15, 2017.

Can you use something like (\d+)-(\w+)-(\d+)|(\w+)\s(\d+),\s(\d+)

but assign the first group (the 13) and the fifth group (the 15) to the same output field day?

Maybe something like (?P<day>\d+)-(\w+)-(\d+)|(\w+)\s(?P<day>\d+),\s(\d+) ... except these named groups don't work in Alteryx.

DanM

@PhilipMannering at this time the REGEX tool does not do grouping. The tool looks for specific patterns, so if the string does not match the entire pattern you will not get a return of data. I would suggest using two REGEX tools and then using the Formula tool to combine the data back using an IF statement to fill in the Nulls from the other fields.

vfranklin

I am trying to parse a string of text:

<a href="/123abc"target="_self">I want this string

I want to parse everthing prior to and including > into one column and everything after > into a second column. How do I do that?

girisri

Lets say I want to parse Age,Age,Age to reflect only Age i.e parsing it out from where the first delimiter ',' starts, how do I achieve this?

JoeS

@girisri - You can use the Regex Match and an lazy quantifier (AKA ?)

This will take the first part that matches the string, before your comma

girisri

Thank you, That worked.

However in some rows I have values without repetition,So instead of Age, Age, Age I have only Age in some rows

So in this case if I apply (.*?), it replaces Age with [Null], how to avoid this? as I need the value Age

JoeS

I am sure you could handle it in the Regex, but rather than racking my brains, I would use a formula tool after and use:

IIF(ISNULL([NewField]),[OriginalField],[NewField])

That will then replace the null value with the value from the original field.

The reason it doesnt work is that it's looking for a comma, and not finding that in the field with only one value.

girisri

That of course works! Thanks

Just was curious how to achieve that thru Regex, will explore, Thanks

dcappellino

Is there anyway that I can pull all characters after a number? So basically any field that starts with a number I want that entire field, including the numbers, in a new column?

AntonioGonzales

Dear @girisri,

Hope you are well.

Have you tried to parse the last occurrence of Age instead of the first one?

(\d+)$

Explanation:

() --> marked group (what you want to pull out)

\d --> any digit

+ --> one or more repetition

$ --> end of line

In my tests worked fine, please see the picture below.

Hope it can help you.

Looking forward to hearing from you.

Cheers!

Antonio Alves

AntonioGonzales

Dear @dcappellino,

Hope you are well.

I couldn't understand very well your question, sorry.

It is possible to share an example of your line to me have a try?

Cheers,

Antonio Alves

AntonioGonzales

Dear @vfranklin,

Hope you are well.

Have you tried this:

(<.+>)(.+)$

Explanation:

()() --> you need to parse to groups as you said

(<.+>) --> this group try to parse everything between < and >

(.+)$ --> the second group try to parse everything after the first group until the end of the line

In your given example it worked fine.

I hope that this example could help you.

Cheers!

Antonio Alves

dcappellino

AntonioGonzales,

Basically I'm trying to split items in one column, into two based on if they are account numbers vs account names, but can't seem to figure it out. So any cell that starts with a number, I want to move the entire cell into a new column. I tried text to column, but when I use that the delimiter is not transferred to the next cell(i.e. the 4th cell would transfer over as "0010:Cash" instead of "10010: Cash" when I use 1 as the delimiter). I think the RegEx tool is my best bet, but cant figure out the exact string that would perform this task.

I hope this is more clear.

Thanks!

AntonioGonzales

Dear @dcappellino,

Thanks for share more details, now I guess that I got it.

In my tests the regex

^(\d+.*)$

worked well.

Version 1)

Let me explain this regex in more details.

^ --> this indicate start the line

() --> we need just 1 marked group (account)

\d+ --> 1 number or more

.* --> any caracters

$ --> end of the line

So, if I mount it in parts:

^ --> start of my line

^()$ --> everything in my line

^(\d+)$ --> Line just with numbers

^(\d+.*)$ --> Line starting with numbers followed by anything

I did a second version where I remove this numbers from the first column as the text-to-column normally do.

The idea is the same but in this case, I had to use the formula tool with regex_match and regex_replace.

Hope this can help you.

Let me know if you would like to have my workflow or more explanation.

Cheers.

Antonio Alves

JoeS

Hi @dcappellino

You can use the Regex Match function and then filter for the values to be true.

If you use "\d.+" This means there needs to be a number and then at least one other thing.

Edit: not sure why, but the comment above didn't load when I replied, so it's now covered twice!

mdkhusru

Hi Alex,

This was an excellent article. As a newbie, This really helped in understanding the usage of Regex so neatly.

I am working on an exercise to sort out mapping data for addresses in various formats The sample is as below

Addressline	type	unit	loc	street
L572,31A LIBERTY GROVE	L	572	31A	LIBERTY GROVE
12 BAYER STREET		12		BAYER STREET
3/350 TORRENS RD		3	350	TORRENS RD
202/12-14 THE SHORES		202	12-14	THE SHORES
Flat 6, 77 Kensington Road	flat	6	77	Kensington Road
CNR TAYLORS RDUN45 HUB DR		45		CNR TAYLORS RDUN HUB DR
SALFORD RET ESTATE31A LYN ST		31A		SALFORD RET ESTATE LYN ST
SALFORD RETMT ESTATE U 3 / 1A LYN ST	U	3	1A	SALFORD RETMT ESTATE LYN ST
UNIT 2/164 HUB DRIVE	UNIT	2	164	HUB DRIVE
40 MOORFIELD MEWS, LOT 3	LOT	40	3	MOORFIELD MEWS

each of these address have a unit number and/or a Location number. some of the location number could be in ranges (12-14) The address to parse is in the first column and the result I am looking out is in other columns. I would really appreciate if you could help me with some hints or solution.

Regards,

Ali

AlexKo

Hi @mdkhusru,

I'm glad you enjoyed the article. Your example was much tougher than I originally expected! It definitely touched on a lot of different concepts. I've attached a solution I came up with. I hope you'll find it helpful.

regex 6Feb2019.PNG

Best,

Alex

Here is the workflow xml if you want to open it. Just copy this to a text editor and save it as a yxmd file.

Spoiler

<?xml version="1.0"?>
<AlteryxDocument yxmdVer="11.0">
  <Nodes>
    <Node ToolID="4">
      <GuiSettings Plugin="AlteryxBasePluginsGui.BrowseV2.BrowseV2">
        <Position x="1074" y="318" />
      </GuiSettings>
      <Properties>
        <Configuration>
          <TempFile>C:\Users\AKOSZY~1\AppData\Local\Temp\Engine_9208_3243a9519d4042838d360d78cb4d7d17_\Engine_21724_535649afd86349a68f5168ed62000f64_.yxdb</TempFile>
          <TempFileDataProfiling />
          <Layout>
            <ViewMode>Single</ViewMode>
            <ViewSize value="100" />
            <View1>
              <DefaultTab>Profile</DefaultTab>
              <Hints>
                <Table />
              </Hints>
            </View1>
            <View2 />
          </Layout>
        </Configuration>
        <Annotation DisplayMode="0">
          <Name />
          <DefaultAnnotationText />
          <Left value="False" />
        </Annotation>
      </Properties>
      <EngineSettings EngineDll="AlteryxBasePluginsEngine.dll" EngineDllEntryPoint="AlteryxBrowseV2" />
    </Node>
    <Node ToolID="6">
      <GuiSettings Plugin="AlteryxBasePluginsGui.RegEx.RegEx">
        <Position x="174" y="174" />
      </GuiSettings>
      <Properties>
        <Configuration>
          <Field>Addressline</Field>
          <RegExExpression value="[^\d]*(\d+\w?)[^\d]+(\d+\w?(?:-\d+\w?)?)?" />
          <CaseInsensitve value="True" />
          <Method>ParseComplex</Method>
          <Replace expression="">
            <CopyUnmatched value="True" />
          </Replace>
          <ParseSimple>
            <SplitToRows value="False" />
            <RootName />
            <NumFields value="3" />
            <ErrorHandling>Warn</ErrorHandling>
          </ParseSimple>
          <ParseComplex>
            <Field field="unit" type="V_WString" size="254" />
            <Field field="loc" type="V_WString" size="254" />
          </ParseComplex>
          <Match>
            <Field>Addressline_Matched</Field>
            <ErrorUnmatched value="False" />
          </Match>
        </Configuration>
        <Annotation DisplayMode="0">
          <Name />
          <AnnotationText>Parse:
[^\d]*(\d+\w?)[^\d]+(\d+\w?(?:-\d+\w?)?)?</AnnotationText>
          <DefaultAnnotationText>Parse:
[^\d]*(\d+\w?)[^\d]+(\d+\w?...</DefaultAnnotationText>
          <Left value="False" />
        </Annotation>
      </Properties>
      <EngineSettings EngineDll="AlteryxBasePluginsEngine.dll" EngineDllEntryPoint="AlteryxRegEx" />
    </Node>
    <Node ToolID="7">
      <GuiSettings Plugin="AlteryxBasePluginsGui.RegEx.RegEx">
        <Position x="330" y="174" />
      </GuiSettings>
      <Properties>
        <Configuration>
          <Field>Addressline</Field>
          <RegExExpression value="((?:LOT)|(?:UNIT)|(?:FLAT)|(?:L)|(?:U)|(?:F))[^A-Z]" />
          <CaseInsensitve value="True" />
          <Method>ParseComplex</Method>
          <Replace expression="">
            <CopyUnmatched value="True" />
          </Replace>
          <ParseSimple>
            <SplitToRows value="False" />
            <RootName />
            <NumFields value="3" />
            <ErrorHandling>Warn</ErrorHandling>
          </ParseSimple>
          <ParseComplex>
            <Field field="type" type="V_WString" size="254" />
          </ParseComplex>
          <Match>
            <Field>Addressline_Matched</Field>
            <ErrorUnmatched value="False" />
          </Match>
        </Configuration>
        <Annotation DisplayMode="0">
          <Name />
          <AnnotationText>Parse:
((?:LOT)|(?:UNIT)|(?:FLAT)|(?:L)|(?:U)|(?:F))[^A-Z]</AnnotationText>
          <DefaultAnnotationText>Parse:
((?:LOT)|(?:UNIT)|(?:FLAT)|...</DefaultAnnotationText>
          <Left value="False" />
        </Annotation>
      </Properties>
      <EngineSettings EngineDll="AlteryxBasePluginsEngine.dll" EngineDllEntryPoint="AlteryxRegEx" />
    </Node>
    <Node ToolID="8">
      <GuiSettings Plugin="AlteryxBasePluginsGui.RegEx.RegEx">
        <Position x="642" y="330" />
      </GuiSettings>
      <Properties>
        <Configuration>
          <Field>parseme</Field>
          <RegExExpression value="(?:\d+\w)|(?:\w\d+)|[^,\s\d/-]+" />
          <CaseInsensitve value="True" />
          <Method>ParseSimple</Method>
          <Replace expression="">
            <CopyUnmatched value="True" />
          </Replace>
          <ParseSimple>
            <SplitToRows value="True" />
          </ParseSimple>
          <ParseComplex>
            <Field field="No Marked Groups Found" type="No Marked Groups Found" size="No Marked Groups Found" />
          </ParseComplex>
          <Match>
            <Field>Addressline_Matched</Field>
            <ErrorUnmatched value="False" />
          </Match>
        </Configuration>
        <Annotation DisplayMode="0">
          <Name />
          <AnnotationText>Tokenize:
(?:\d+\w)|(?:\w\d+)|[^,\s\d/-]+</AnnotationText>
          <DefaultAnnotationText>Tokenize:
(?:\d+\w)|(?:\w\d+)|[^,\s\d...</DefaultAnnotationText>
          <Left value="False" />
        </Annotation>
      </Properties>
      <EngineSettings EngineDll="AlteryxBasePluginsEngine.dll" EngineDllEntryPoint="AlteryxRegEx" />
    </Node>
    <Node ToolID="9">
      <GuiSettings Plugin="AlteryxBasePluginsGui.Formula.Formula">
        <Position x="450" y="330" />
      </GuiSettings>
      <Properties>
        <Configuration>
          <FormulaFields>
            <FormulaField expression="[Addressline]" field="parseme" size="1073741823" type="V_WString" />
          </FormulaFields>
        </Configuration>
        <Annotation DisplayMode="0">
          <Name />
          <DefaultAnnotationText><![CDATA[parseme = [Addressline]
]]></DefaultAnnotationText>
          <Left value="False" />
        </Annotation>
      </Properties>
      <EngineSettings EngineDll="AlteryxBasePluginsEngine.dll" EngineDllEntryPoint="AlteryxFormula" />
    </Node>
    <Node ToolID="10">
      <GuiSettings Plugin="AlteryxBasePluginsGui.Filter.Filter">
        <Position x="762" y="330" />
      </GuiSettings>
      <Properties>
        <Configuration>
          <Expression>[type]!=[parseme] &amp;&amp; !REGEX_Match([parseme],'^.*?\d.*?')</Expression>
          <Mode>Custom</Mode>
          <Simple>
            <Operator>=</Operator>
            <Field>
            </Field>
            <Operands>
              <IgnoreTimeInDateTime>True</IgnoreTimeInDateTime>
              <DateType>fixed</DateType>
              <PeriodDate>2019-02-06 12:17:48</PeriodDate>
              <PeriodType>
              </PeriodType>
              <PeriodCount>0</PeriodCount>
              <Operand>
              </Operand>
              <StartDate>2019-02-06 12:17:48</StartDate>
              <EndDate>2019-02-06 12:17:48</EndDate>
            </Operands>
          </Simple>
        </Configuration>
        <Annotation DisplayMode="0">
          <Name />
          <DefaultAnnotationText>[type]!=[parseme] &amp;&amp; !REGEX_Match([parseme],'^.*?\d.*?')</DefaultAnnotationText>
          <Left value="False" />
        </Annotation>
      </Properties>
      <EngineSettings EngineDll="AlteryxBasePluginsEngine.dll" EngineDllEntryPoint="AlteryxFilter" />
    </Node>
    <Node ToolID="11">
      <GuiSettings Plugin="AlteryxSpatialPluginsGui.Summarize.Summarize">
        <Position x="870" y="318" />
      </GuiSettings>
      <Properties>
        <Configuration>
          <SummarizeFields>
            <SummarizeField field="RecordID" action="GroupBy" rename="RecordID" />
            <SummarizeField field="Addressline" action="First" rename="Addressline" />
            <SummarizeField field="type" action="First" rename="type" />
            <SummarizeField field="unit" action="First" rename="unit" />
            <SummarizeField field="loc" action="First" rename="loc" />
            <SummarizeField field="parseme" action="Concat" rename="street">
              <Concat_Start />
              <Separator>\s</Separator>
              <Concat_End />
            </SummarizeField>
          </SummarizeFields>
        </Configuration>
        <Annotation DisplayMode="0">
          <Name />
          <DefaultAnnotationText />
          <Left value="False" />
        </Annotation>
      </Properties>
      <EngineSettings EngineDll="AlteryxSpatialPluginsEngine.dll" EngineDllEntryPoint="AlteryxSummarize" />
    </Node>
    <Node ToolID="12">
      <GuiSettings Plugin="AlteryxBasePluginsGui.RecordID.RecordID">
        <Position x="546" y="330" />
      </GuiSettings>
      <Properties>
        <Configuration>
          <FieldName>RecordID</FieldName>
          <StartValue>1</StartValue>
          <FieldType>Int32</FieldType>
          <FieldSize>6</FieldSize>
          <Position>0</Position>
        </Configuration>
        <Annotation DisplayMode="0">
          <Name />
          <DefaultAnnotationText />
          <Left value="False" />
        </Annotation>
      </Properties>
      <EngineSettings EngineDll="AlteryxBasePluginsEngine.dll" EngineDllEntryPoint="AlteryxRecordID" />
    </Node>
    <Node ToolID="14">
      <GuiSettings Plugin="AlteryxBasePluginsGui.TextInput.TextInput">
        <Position x="54" y="174" />
      </GuiSettings>
      <Properties>
        <Configuration>
          <NumRows value="10" />
          <Fields>
            <Field name="Addressline" />
          </Fields>
          <Data>
            <r>
              <c>L572,31A LIBERTY GROVE</c>
            </r>
            <r>
              <c>12 BAYER STREET</c>
            </r>
            <r>
              <c>3/350 TORRENS RD</c>
            </r>
            <r>
              <c>202/12-14 THE SHORES</c>
            </r>
            <r>
              <c>Flat 6, 77 Kensington Road</c>
            </r>
            <r>
              <c><![CDATA[CNR TAYLORS RDUN45 HUB DR ]]></c>
            </r>
            <r>
              <c><![CDATA[SALFORD RET ESTATE31A LYN ST ]]></c>
            </r>
            <r>
              <c>SALFORD RETMT ESTATE U 3 / 1A LYN ST</c>
            </r>
            <r>
              <c><![CDATA[UNIT 2/164 HUB DRIVE ]]></c>
            </r>
            <r>
              <c><![CDATA[40 MOORFIELD MEWS, LOT 3 ]]></c>
            </r>
          </Data>
        </Configuration>
        <Annotation DisplayMode="0">
          <Name />
          <DefaultAnnotationText />
          <Left value="False" />
        </Annotation>
      </Properties>
      <EngineSettings EngineDll="AlteryxBasePluginsEngine.dll" EngineDllEntryPoint="AlteryxTextInput" />
    </Node>
    <Node ToolID="15">
      <GuiSettings>
        <Position x="978" y="318" />
      </GuiSettings>
      <Properties>
        <Configuration>
          <Value name="List Box (11)">"RecordID","Addressline","type","unit","loc","street"</Value>
          <Value name="Check Box (84)">True</Value>
          <Value name="Check Box (117)">False</Value>
          <Value name="Check Box (15)">True</Value>
          <Value name="Check Box (109)">False</Value>
          <Value name="Check Box (122)">False</Value>
          <Value name="Check Box (53)">False</Value>
          <Value name="Check Box (58)">False</Value>
          <Value name="Check Box (70)">False</Value>
          <Value name="Check Box (77)">True</Value>
          <Value name="Drop Down (81)">upper</Value>
        </Configuration>
        <Annotation DisplayMode="0">
          <Name />
          <DefaultAnnotationText />
          <Left value="False" />
        </Annotation>
        <Dependencies>
          <Implicit />
        </Dependencies>
      </Properties>
      <EngineSettings Macro="Cleanse.yxmc" />
    </Node>
    <Node ToolID="16">
      <GuiSettings Plugin="AlteryxGuiToolkit.TextBox.TextBox">
        <Position x="66" y="30" width="192" height="132" />
      </GuiSettings>
      <Properties>
        <Configuration>
          <Text>"unit" and "loc" are tough, because the order they appear in matters and they can be anywhere in the address string. Pull these out first.

Main concept:
A lot of things are optional, so you need to be clever with "?"</Text>
          <Font name="Arial" size="8.25" style="0" />
          <TextColor name="Black" />
          <FillColor name="White" />
          <Shape shape="0" />
          <Justification Justification="0" />
        </Configuration>
        <Annotation DisplayMode="0">
          <Name />
          <DefaultAnnotationText />
          <Left value="False" />
        </Annotation>
      </Properties>
    </Node>
    <Node ToolID="17">
      <GuiSettings Plugin="AlteryxGuiToolkit.TextBox.TextBox">
        <Position x="294" y="30" width="360" height="144" />
      </GuiSettings>
      <Properties>
        <Configuration>
          <Text>"type" is also tough because it could easily look like a street name.

For this, the best bet is probably to look for exact matches. This expression looks for the words "FLAT", "UNIT","LOT" and their abbreviations.

Main concept:
The OR operator | is used to parse out these exact words, and the unmarked group helps break apart the matches. Order matters here - you want to pull out "LOT" before checking for "L".</Text>
          <Font name="Arial" size="8.25" style="0" />
          <TextColor name="Black" />
          <FillColor name="White" />
          <Shape shape="0" />
          <Justification Justification="0" />
        </Configuration>
        <Annotation DisplayMode="0">
          <Name />
          <DefaultAnnotationText />
          <Left value="False" />
        </Annotation>
      </Properties>
    </Node>
    <Node ToolID="19">
      <GuiSettings Plugin="AlteryxGuiToolkit.TextBox.TextBox">
        <Position x="690" y="30" width="444" height="252" />
      </GuiSettings>
      <Properties>
        <Configuration>
          <Text>"street" is really just everything else, so the trick is to scoop that stuff up without grabbing everything you've already parsed out.

One way to do this is to split out all the words into their own rows and then concatenating them back together.

To do this, I used the Tokenize option and matched on the things I want to pull out using the OR operator | along with unmarked groups. The expressions are looking for either:
1) a number followed by a single letter
2) a single letter followed by a number
3) or anything else that is between a comma, space, slash, or dash

I used a REGEX_Match function in the Filter tool to take out anything with numbers in it. I also removed any of the "type" matches that resulted from the Tokenize.

Then, I combined all the individual words using the Concatenate option in the Summarize tool, using "\s" as the separator.</Text>
          <Font name="Arial" size="8.25" style="0" />
          <TextColor name="Black" />
          <FillColor name="White" />
          <Shape shape="0" />
          <Justification Justification="0" />
        </Configuration>
        <Annotation DisplayMode="0">
          <Name />
          <DefaultAnnotationText />
          <Left value="False" />
        </Annotation>
      </Properties>
    </Node>
  </Nodes>
  <Connections>
    <Connection>
      <Origin ToolID="6" Connection="Output" />
      <Destination ToolID="7" Connection="Input" />
    </Connection>
    <Connection>
      <Origin ToolID="7" Connection="Output" />
      <Destination ToolID="9" Connection="Input" />
    </Connection>
    <Connection>
      <Origin ToolID="8" Connection="Output" />
      <Destination ToolID="10" Connection="Input" />
    </Connection>
    <Connection>
      <Origin ToolID="9" Connection="Output" />
      <Destination ToolID="12" Connection="Input" />
    </Connection>
    <Connection>
      <Origin ToolID="10" Connection="True" />
      <Destination ToolID="11" Connection="Input" />
    </Connection>
    <Connection>
      <Origin ToolID="11" Connection="Output" />
      <Destination ToolID="15" Connection="Input2" />
    </Connection>
    <Connection>
      <Origin ToolID="12" Connection="Output" />
      <Destination ToolID="8" Connection="Input" />
    </Connection>
    <Connection>
      <Origin ToolID="14" Connection="Output" />
      <Destination ToolID="6" Connection="Input" />
    </Connection>
    <Connection>
      <Origin ToolID="15" Connection="Output26" />
      <Destination ToolID="4" Connection="Input" />
    </Connection>
  </Connections>
  <Properties>
    <Memory default="True" />
    <GlobalRecordLimit value="0" />
    <TempFiles default="True" />
    <Annotation on="True" includeToolName="False" />
    <ConvErrorLimit value="10" />
    <ConvErrorLimit_Stop value="False" />
    <CancelOnError value="False" />
    <DisableBrowse value="False" />
    <EnablePerformanceProfiling value="False" />
    <DisableAllOutput value="False" />
    <ShowAllMacroMessages value="False" />
    <ShowConnectionStatusIsOn value="True" />
    <ShowConnectionStatusOnlyWhenRunning value="True" />
    <ZoomLevel value="0" />
    <LayoutType>Horizontal</LayoutType>
    <MetaInfo>
      <NameIsFileName value="True" />
      <Name>regex 6Feb2019</Name>
      <Description />
      <RootToolName />
      <ToolVersion />
      <ToolInDb value="False" />
      <CategoryName />
      <SearchTags />
      <Author />
      <Company />
      <Copyright />
      <DescriptionLink actual="" displayed="" />
      <Example>
        <Description />
        <File />
      </Example>
    </MetaInfo>
    <Events>
      <Enabled value="True" />
    </Events>
  </Properties>
</AlteryxDocument>

mdkhusru

Hi Alex,

Thanks a lot. That was wonderful. This is exactly what I was looking out.

Regards,

Ali Khusru

HuwHallam

Soooo handy! Thank you!!

On your extra credit point (though someone's bound to have said this already...): it has to be that when you don't include the end-of-line ($) alternative in the unmarked group, the third address, which isn't followed by a comma, won't match the regex formula. Only the first two addresses, followed by commas, will be tokenised.

AlexKo

Glad you enjoyed the article @HuwHallam!

You almost have the answer to the extra credit! While you're correct that it won't match, the reason you give isn't quite right.. I'll give you a hint, it has something to do with how the proceeding marked group ends...

(It may be easier to understand if you download the attached workflow and change the RegEx tool (#25) around to see the result.)

HuwHallam

Oh I see - I think I misread the question... Yes, if the formula is (.+?),? then the second ? means that the marked group doesn't have to be followed by a comma (it can be followed by anything or nothing). Without this restriction, the first 'greedy' ? causes the regex to return a result as soon as it can satisfy .+ (any character, one or more times). Hence you get a series of single character outputs.

Thanks for pushing me in the right direction!

jhosmer002

Hi! I am looking to use the Tokenize function in the Regex tool. I am trying to split text from 1 column into 2 columns, without a common delimiter. I figure that I can split at the word "at" but I am at a loss on how to pull the company name out. I have about 700 fields all with different information and character lengths but this common setup.

An example of the field I am trying to split into title and company currently looks like this:
Manager, Strategy at ABC Corp. Go to ABC Corp. account page1year 15 months