community
cancel
Showing results for 
Search instead for 
Did you mean: 

Alteryx Knowledge Base

Definitive answers from Designer experts.
Upgrade Alteryx Designer in 10 Steps

Debating whether or not to upgrade to the latest version of Alteryx Designer?

LEARN MORE

Tool Mastery | RegEx

Alteryx
Alteryx
Created on

This article is part of the Tool Mastery Series, a compilation of Knowledge Base contributions to introduce diverse working examples for Designer Tools. Here we’ll delve into uses of the RegEx Tool on our way to mastering the Alteryx Designer:

 RegEx.png

The RegEx tool is kind of like the Swiss Army Knife of parsing in Alteryx; there are a whole lot of ways you can use it to do things faster or more effectively, but even if you just use the blade it's still immensely useful. Sometimes that's all you need, but if you do take the time to figure out how to use a few other tools in that knife, you'll start to see that there isn't much you can't do with it.

 

mr bean.pngmacgyver.jpg

Before and after using the RegEx tool.

 

RegEx: What is it good for?

 

RegEx is an abbreviation of Regular Expression, and you can essentially think of it as another language. It uses symbols just like any other language, but in regular expressions these symbols are used to match sequences of letters, numbers, or characters in text. It's a language that's all about recognizing patterns.

 

Humans are really good at this sort of thing - let's say I gave you this block of text:

 

3345 Michelson Drive, Suite 400, Irvine, CA 92612

12303 Airport Way, Suite 250, Broomfield, CO 80021

Two North Riverside Plaza, Suite 1430, Chicago, IL 60606

 

You'd have no problem telling me these are addresses, and what part is a street number or a city name. But a computer would just see a block of text, and it wouldn't care to check if it was an address or not. RegEx is one way we can 'recognize' useful data in text. Let's 'translate' this to a RegEx version:

 

3345 ^\d+  The ^ signifies the beginning of a line in RegEx, so it's good practice to include it with your initial pattern. Here, our pattern is \d which means 'any numerical character' (0-9). The signifies that we want to match the previous expression one or more times. Since the first part of the address is a street number, this allows us to have a number of any length.
Michelson Drive [^\d]+ To match the street, we have to allow our expression to pick up multiple words of characters, including any number of spaces, since streets will often be longer than one word. One way to match this is by what we don't expect using [^...]. This grouping notation matches 'any character not listed here'. What we end up with is match one or more times any character that is not a number.
Suite 400 .* The next part of the address is a suite number, which may or may not be present, and could potentially take on various naming conventions. In order to define a flexible expression to match anything we see there, we can use a . to match 'any character'. The * then signifies that we can match any character zero or more times.
Irvine [^\d]+ As before, this just means 'any character that is not a number'. 
CA \u{2} To match the state we can make use of \u to signify 'any uppercase letter'. Since we're expecting this to always be a two-letter sequence, we can also specify the length of the match by using {...} after our expression, or 'match any two uppercase letters'.
92612 \d{5}$ The zip code will likewise come in as 5 digits, so we can do something similar to say 'match five number characters'. Then, we can tack on $ to signify that we're expecting this to be the end of the current line

 

cheat.png

Tools of the trade

 

There are quite a few symbols used to build RegEx expressions, but Alteryx provides a nice little 'cheat-sheet' right in the tool for you.

 

You don't really need much more than this to get rolling with using RegEx, and much like in Alteryx, you'll find that there are many many different ways you can create an expression to match a pattern. So don't sweat the details too much, and don't be afraid to spend some time with the good old fashioned "guess-and-test" method of learning.

 

trial and error.jpg

Don't worry, I'm classically trained.

 

For more complete guides on RegEx, you can also turn to the Boost-Extended Format String Syntax Guide, as well as the RegEx Perl Syntax Guide.

 

Alteryx has structured RegEx functionality into four methods: Match, Parse, Replace & Tokenize. Our help documentation for these methods is very good as well.

 

Match

 

The Match method simply checks to see whether a string can be described by the given regular expression, and gives you a True or False. Let's use the expression elements in the table above to match our addresses. You can create one long expression by just putting them together and including spaces \s and commas , wherever they appear.

 

regex1.JPG

 

The first two addresses match just fine using this expression, but the third address strays from what we expected to see and fails. Due to an obscure rule about buildings that share a name with their address, the street number is spelled out and our regular expression is unable to match it.

 

The key to writing a good RegEx is foreseeing these exceptions in your data, and accounting for them within the expression. In order to match the 'Two' in this address, let's put in another expression to check if and only if this one fails to match. Most addresses will start with numerical characters, but if they don't then this expression will check for a word instead. Here's how it looks:

 

(?:^\d+)|(?:^\w+)

 

This is a lot simpler than it looks, and really just uses two additional RegEx symbols. The pipe symbol means 'or'. For example, a|b would just check if matches, and if it doesn't, if matches. The second symbol is (?:...), also known as an unmarked group. This is just a way for us to group these things together for the or operation.

 

 

regex2.JPG

 

In summary, the first group here (?:^\d+is just doing the same thing as before, but when it fails the expression tries to match the second group (?:^\w+). This lets us match the word version of our address above without an issue.

 

For more on Match:

Extra Credit: There are many ways to structure RegEx; comment below with a better alternative for (?:^\d+)|(?:^\w+). Why is it better?

 

Parse

 

Matching is nice in that you can use it for validation, but the Parse method is really when RegEx comes into its own, allowing you to extract useful information out of a block of text. The RegEx tool makes this easy for us - all we need is to place parentheses (...around each thing we want to pull out. These are called marked groups - the counterpart to our unmarked groups above (?:...).

 

regex3.JPG

 

As you type in the parentheses, you'll see these pop into a Select tool style Output Fields window, which allows you to rename your fields and change their datatypes.

 

For more on Parse:

 

Replace

 

As you can see, the Parse method is really just an extension of Match that allows us to pull stuff out of text and place it neatly in a new column. Well, what happens when we extend this concept and ask ourselves, 'How can I put stuff back in?' That's where the Replace method comes in.

 

With Replace, we can parse components of our string, replace them, and even rearrange them. We can do this by specifying marked groups to tell the RegEx tool what to replace and where, in a language anybody can understand...

 

dollah bills

 

The dapper sloth is absolutely right of course, we can use dollar signs $ along with numbers to specify each exact marked group. For example, $1, $2, and $3 refer to the first, second, and third marked groups, respectively. So for our list of addresses, if we just wanted to parse out a list of city-states, we can type in the expression $4, $5.

 

regex4.JPG

 

Note we were able to add in our own little comma there, and a space, just by typing in that text box. The Replace method is very flexible, and you can also use it from the Formula tool with the function Regex_Replace.

 

For more on Replace:

 

Tokenize

 

RegEx is pretty nifty when things are given to us in a nice, neat list, but what happens if all of these addresses are just unceremoniously dumped out by some API call or database as a single block of text?

 

3345 Michelson Drive, Suite 400, Irvine, CA 92612,12303 Airport Way, Suite 250, Broomfield, CO 80021,Two North Riverside Plaza, Suite 1430, Chicago, IL 60606

 

Well, this isn't very useful - it's all comma-delimited, but we have no easy way of knowing when one address ends and another begins. What we need here is the Tokenize method, which will take that chunk of text and split it into columns or rows, much like the Text to Columns tool. There are two important differences that set RegEx Tokenize apart:

  1. Instead of matching on what you don't want (like a comma), you match on what you do want (everything else).
  2. You have the option of choosing what's split out and what is ignored by using a marked group

This may sound topsy-turvy, but it actually gives you a lot more flexibility in what/how you split your data. To illustrate, let's split our address text blurb up into multiple rows, using the comma as our delimiter.

 

regex5.JPG

 

Since we have to match everything that we want, we need to use the expression (.+?)(?:,|$). Let's break this down:

  • .+ means any one character . matched one or more times +
  • ? is how we tell this match to be lazy rather than greedy. This is a really useful distinction that may be a bit difficult to understand at first, but for the purposes of this crash course let's just focus on what the ? means: match whatever is before zero times (not at all) or exactly once.
  • The plus sign + is actually the opposite: it's a greedy symbol, so the previous one character . will try to match one or more times. So what does it mean when we tell something greedy to be lazy? Well, it actually modifies just how greedy it can be by forcing it to look at the next match. So what .+? really means is:
    'match one character one or more times until you can match whatever comes next'.
  • What comes next is an unmarked group (?:,|$) that essentially functions as a STOP sign for the previous (.+?). Since it's not in the marked group, it won't be present in the result. It's just in an unmarked group so that we can tell it to look for either a comma , or the end-of-line $ at the end of the text blurb.

Phew - that's certainly a lot. Feel free to take this opportunity to stretch, go for a walk, or meditate.

 

line noise.gif

 

Obviously, the Text to Columns tool can split on a comma way easier than the above, but the flexibility of Tokenize comes into the fore when we try to do something a tad more useful. For instance, we can use it to split that block of address information into the original three addresses.

 

regex6.JPG

 

In this case, we are just using our last matching expression for the zip code \d{5} to mark out where each match ends. As above, we know that each line will either end with a comma or the end of line, and we could use the (?:,|$here to split these successfully. In the example above, I chose to showcase the ability to match zero or one time to replace this - so we can split on an 'optional comma' after the marked group match.

 

For more on Tokenize:

Extra Credit: The ,? won't work in the previous case, splitting by comma alone - (.+?),instead of (.+?)(?:,|$). Why?

Comment with an explanation below for eternal glory and bragging rights.

 

regex.png

 

By now, you should have expert-level proficiency with the RegEx Tool! If you can think of a use case we left out, feel free to use the comments section below! Consider yourself a Tool Master already? Let us know at community@alteryx.com if you’d like your creative tool uses to be featured in the Tool Mastery Series.

 

Stay tuned with our latest posts every Tool Tuesday by following Alteryx on Twitter! If you want to master all the Designer tools, consider subscribing for email notifications.

Attachments
Comments
Asteroid
Asteroid

 Alex,

 

Thanks for this excellent article. As time goes by and I develop more and more sophisticated workflows, my usage of regex continues to grow. After reading this article, I took a look at a recent bit of work and found a whole series of regex tools I strung together to do a bunch of things. I just wanted to say things for producing this well-documented piece of work.

 

Ken

 

regex.JPG

Alteryx
Alteryx

Happy to hear you found this useful, @Ken_Black! Happy RegEx'ing!

 

Alex

Asteroid

Very informative.  How would it work if you had an address like 123 12th St NE Apt# 201 #201?  I need to get rid of the second #201, remove the # after the Apt, still let the 123 12th St NE Apt 201 go through to the output. 

Still new to regex, any help would be appreciated.

 

Thanks,

Brad

Community Operations Manager
Community Operations Manager

@brad_j_crep one idea is that you can use the parse Output Method and use this expression to remove the #201 (.*)\s#. You can then use a Formula tool and use the replacechar expression to remove the other # that will be left.

Alteryx
Alteryx

Thanks for reading, @brad_j_crep. I think @DanM has a nice solution there; I'll just add another here that uses the replace method...

 

address.PNG

 

Sometimes when building these out the question is really 'how often will this work?' - in this case I'm not sure if you'll see a dataset with this odd repeated suite name. If this is peppered in the data, it may be best to just use a Formula tool or RegEx tool upstream of your main parsing step that replaces any # in the dataset and cleans unnecessary stuff up.

 

Alex

Meteoroid

Hi, I tweak around a bit to understand what every bit does. I've opened the build in RegEx example. In the 3rd example (In "Parse" mode, partial matches will be considered, however only the first match will be returned.) when I put in some square brackets like: -([\d{4}]) it returns in the zip4 ,column 4,6,1,3 etc. 

 

The explanation about sq brackets states: 

 

To match the street, we have to allow our expression to pick up multiple words of characters, including any number of spaces, since streets will often be longer than one word. One way to match this is by what we don't expect using [^...]. This grouping notation matches 'any character not listed here'. What we end up with is match one or more times any character that is not a number.

 

I expected all to be [null]

 

Can anyone explain to me what happened here?

Thanks,

 

John

Alteryx
Alteryx

Hi @JohnHG -

 

This is a failure on my part to differentiate the difference between:

  • matching characters in a set using [...]
  • matching characters not in a set using [^...]

In the first, we're matching any character that is there, and in the second any character that isn't. Take a look at this link for a more eloquent explanation:

https://regexone.com/lesson/matching_characters

https://regexone.com/lesson/excluding_characters

Alteryx Certified Partner

Anybody know how you can do named groups in Alteryx? For example you could have two date formats:

13-Nov-2017 and Nov 15, 2017.

Can you use something like (\d+)-(\w+)-(\d+)|(\w+)\s(\d+),\s(\d+)

but assign the first group (the 13) and the fifth group (the 15) to the same output field day?

Maybe something like (?P<day>\d+)-(\w+)-(\d+)|(\w+)\s(?P<day>\d+),\s(\d+) ... except these named groups don't work in Alteryx.

Community Operations Manager
Community Operations Manager

@PhilipMannering at this time the REGEX tool does not do grouping. The tool looks for specific patterns, so if the string does not match the entire pattern you will not get a return of data. I would suggest using two REGEX tools and then using the Formula tool to combine the data back using an IF statement to fill in the Nulls from the other fields. 

Meteoroid

I am trying to parse a string of text:

<a href="/123abc"target="_self">I want this string

I want to parse everthing prior to and including > into one column and everything after > into a second column.  How do I do that?

Meteor

 Lets say I want to parse  Age,Age,Age  to reflect only Age i.e parsing it out from where the first delimiter ',' starts, how do I achieve this?

Alteryx
Alteryx

@girisri - You can use the Regex Match and an lazy quantifier (AKA ?)

 

RegexUngreedy.png 

 

This will take the first part that matches the string, before your comma

Meteor

Thank you, That worked.

However in some rows I have values without repetition,So instead of Age, Age, Age I have only Age in some rows

So in this case if I apply (.*?), it replaces Age with [Null], how to avoid this? as I need the value Age

Alteryx
Alteryx

I am sure you could handle it in the Regex, but rather than racking my brains, I would use a formula tool after and use:

 

IIF(ISNULL([NewField]),[OriginalField],[NewField])

That will then replace the null value with the value from the original field.

 

The reason it doesnt work is that it's looking for a comma, and not finding that in the field with only one value.

Meteor

That of course works! Thanks

Just was curious how to achieve that thru Regex, will explore, Thanks

Is there anyway that I can pull all characters after a number?  So basically any field that starts with a number I want that entire field, including the numbers, in a new column? 

Dear @girisri,

 

Hope you are well.

 

Have you tried to parse the last occurrence of Age instead of the first one?

(\d+)$

 

Explanation:

() --> marked group (what you want to pull out)

\d --> any digit

+ --> one or more repetition

$ --> end of line

 

In my tests worked fine, please see the picture below.

regex.png

 

Hope it can help you.

Looking forward to hearing from you.

 

Cheers!

Antonio Alves

Dear @dcappellino,

 

Hope you are well.

 

I couldn't understand very well your question, sorry.

It is possible to share an example of your line to me have a try?

 

Cheers,

Antonio Alves

Dear @vfranklin,

 

Hope you are well.

 

Have you tried this:

(<.+>)(.+)$

 

Explanation:

()() --> you need to parse to groups as you said

(<.+>) --> this group try to parse everything between < and >

(.+)$ --> the second group try to parse everything after the first group until the end of the line

 

In your given example it worked fine.

regex2.png

 

I hope that this example could help you.

Cheers!

 

Antonio Alves

Hi @AntonioGonzales,

 

Basically I'm trying to split items in one column, into two based on if they are account numbers vs account names, but can't seem to figure it out.  So any cell that starts with a number, I want to move the entire cell into a new column. I tried text to column, but when I use that the delimiter is not transferred to the next cell(i.e. the 4th cell would transfer over as "0010:Cash" instead of "10010: Cash" when I use 1 as the delimiter).    I think the RegEx tool is my best bet, but cant figure out the exact string that would perform this task.

 

Capture.PNG

 

I hope this is more clear.

 

Thanks!

Dear @dcappellino,

 

Thanks for share more details, now I guess that I got it.

 

In my tests the regex 

^(\d+.*)$

worked well.

 

Version 1)

regex3.png

 

Let me explain this regex in more details.

 

^ --> this indicate start the line

() --> we need just 1 marked group (account)

\d+ --> 1 number or more

.* --> any caracters

$ --> end of the line

 

So, if I mount it in parts:

^ --> start of my line

^()$ --> everything in my line

^(\d+)$ --> Line just with numbers

^(\d+.*)$ --> Line starting with numbers followed by anything

 

I did a second version where I remove this numbers from the first column as the text-to-column normally do.

regex4.png

The idea is the same but in this case, I had to use the formula tool with regex_match and regex_replace.

Hope this can help you.

 

Let me know if you would like to have my workflow or more explanation.

 

Cheers.

Antonio Alves

 

Alteryx
Alteryx

Hi @dcappellino

 

You can use the Regex Match function and then filter for the values to be true.

 

If you use "\d.+" This means there needs to be a number and then at least one other thing.

 

NumberText.pngData.png

 

Edit: not sure why, but the comment above didn't load when I replied, so it's now covered twice!

Atom

Hi Alex,

 

This was an excellent article. As a newbie, This really helped in understanding the usage of Regex so neatly. 

I am working on an exercise to sort out mapping data for addresses in various formats The sample is as below

 

Addresslinetypeunitlocstreet
L572,31A LIBERTY GROVEL57231ALIBERTY GROVE
12 BAYER STREET 12 BAYER STREET
3/350 TORRENS RD 3350TORRENS RD
202/12-14 THE SHORES 20212-14THE SHORES
Flat 6, 77 Kensington Roadflat677Kensington Road
CNR TAYLORS RDUN45 HUB DR  45 CNR TAYLORS RDUN HUB DR 
SALFORD RET ESTATE31A LYN ST  31A SALFORD RET ESTATE LYN ST
SALFORD RETMT ESTATE U 3 / 1A LYN STU31ASALFORD RETMT ESTATE  LYN ST
UNIT 2/164 HUB DRIVE UNIT2164HUB DRIVE 
40 MOORFIELD MEWS, LOT 3 LOT403MOORFIELD MEWS

 

each of these address have a unit number and/or a Location number. some of the location number could be in ranges (12-14) The address to parse is in the first column and the result I am looking out is in other columns. I would really appreciate if you could help me with some hints or solution.

 

 

Regards,

Ali

Alteryx
Alteryx

Hi @mdkhusru,

 

I'm glad you enjoyed the article. Your example was much tougher than I originally expected! It definitely touched on a lot of different concepts. I've attached a solution I came up with. I hope you'll find it helpful.

 

 regex 6Feb2019.PNG

 

 

Best,

Alex

 

Here is the workflow xml if you want to open it. Just copy this to a text editor and save it as a yxmd file.

 

Spoiler
<?xml version="1.0"?>
<AlteryxDocument yxmdVer="11.0">
  <Nodes>
    <Node ToolID="4">
      <GuiSettings Plugin="AlteryxBasePluginsGui.BrowseV2.BrowseV2">
        <Position x="1074" y="318" />
      </GuiSettings>
      <Properties>
        <Configuration>
          <TempFile>C:\Users\AKOSZY~1\AppData\Local\Temp\Engine_9208_3243a9519d4042838d360d78cb4d7d17_\Engine_21724_535649afd86349a68f5168ed62000f64_.yxdb</TempFile>
          <TempFileDataProfiling />
          <Layout>
            <ViewMode>Single</ViewMode>
            <ViewSize value="100" />
            <View1>
              <DefaultTab>Profile</DefaultTab>
              <Hints>
                <Table />
              </Hints>
            </View1>
            <View2 />
          </Layout>
        </Configuration>
        <Annotation DisplayMode="0">
          <Name />
          <DefaultAnnotationText />
          <Left value="False" />
        </Annotation>
      </Properties>
      <EngineSettings EngineDll="AlteryxBasePluginsEngine.dll" EngineDllEntryPoint="AlteryxBrowseV2" />
    </Node>
    <Node ToolID="6">
      <GuiSettings Plugin="AlteryxBasePluginsGui.RegEx.RegEx">
        <Position x="174" y="174" />
      </GuiSettings>
      <Properties>
        <Configuration>
          <Field>Addressline</Field>
          <RegExExpression value="[^\d]*(\d+\w?)[^\d]+(\d+\w?(?:-\d+\w?)?)?" />
          <CaseInsensitve value="True" />
          <Method>ParseComplex</Method>
          <Replace expression="">
            <CopyUnmatched value="True" />
          </Replace>
          <ParseSimple>
            <SplitToRows value="False" />
            <RootName />
            <NumFields value="3" />
            <ErrorHandling>Warn</ErrorHandling>
          </ParseSimple>
          <ParseComplex>
            <Field field="unit" type="V_WString" size="254" />
            <Field field="loc" type="V_WString" size="254" />
          </ParseComplex>
          <Match>
            <Field>Addressline_Matched</Field>
            <ErrorUnmatched value="False" />
          </Match>
        </Configuration>
        <Annotation DisplayMode="0">
          <Name />
          <AnnotationText>Parse:
[^\d]*(\d+\w?)[^\d]+(\d+\w?(?:-\d+\w?)?)?</AnnotationText>
          <DefaultAnnotationText>Parse:
[^\d]*(\d+\w?)[^\d]+(\d+\w?...</DefaultAnnotationText>
          <Left value="False" />
        </Annotation>
      </Properties>
      <EngineSettings EngineDll="AlteryxBasePluginsEngine.dll" EngineDllEntryPoint="AlteryxRegEx" />
    </Node>
    <Node ToolID="7">
      <GuiSettings Plugin="AlteryxBasePluginsGui.RegEx.RegEx">
        <Position x="330" y="174" />
      </GuiSettings>
      <Properties>
        <Configuration>
          <Field>Addressline</Field>
          <RegExExpression value="((?:LOT)|(?:UNIT)|(?:FLAT)|(?:L)|(?:U)|(?:F))[^A-Z]" />
          <CaseInsensitve value="True" />
          <Method>ParseComplex</Method>
          <Replace expression="">
            <CopyUnmatched value="True" />
          </Replace>
          <ParseSimple>
            <SplitToRows value="False" />
            <RootName />
            <NumFields value="3" />
            <ErrorHandling>Warn</ErrorHandling>
          </ParseSimple>
          <ParseComplex>
            <Field field="type" type="V_WString" size="254" />
          </ParseComplex>
          <Match>
            <Field>Addressline_Matched</Field>
            <ErrorUnmatched value="False" />
          </Match>
        </Configuration>
        <Annotation DisplayMode="0">
          <Name />
          <AnnotationText>Parse:
((?:LOT)|(?:UNIT)|(?:FLAT)|(?:L)|(?:U)|(?:F))[^A-Z]</AnnotationText>
          <DefaultAnnotationText>Parse:
((?:LOT)|(?:UNIT)|(?:FLAT)|...</DefaultAnnotationText>
          <Left value="False" />
        </Annotation>
      </Properties>
      <EngineSettings EngineDll="AlteryxBasePluginsEngine.dll" EngineDllEntryPoint="AlteryxRegEx" />
    </Node>
    <Node ToolID="8">
      <GuiSettings Plugin="AlteryxBasePluginsGui.RegEx.RegEx">
        <Position x="642" y="330" />
      </GuiSettings>
      <Properties>
        <Configuration>
          <Field>parseme</Field>
          <RegExExpression value="(?:\d+\w)|(?:\w\d+)|[^,\s\d/-]+" />
          <CaseInsensitve value="True" />
          <Method>ParseSimple</Method>
          <Replace expression="">
            <CopyUnmatched value="True" />
          </Replace>
          <ParseSimple>
            <SplitToRows value="True" />
          </ParseSimple>
          <ParseComplex>
            <Field field="No Marked Groups Found" type="No Marked Groups Found" size="No Marked Groups Found" />
          </ParseComplex>
          <Match>
            <Field>Addressline_Matched</Field>
            <ErrorUnmatched value="False" />
          </Match>
        </Configuration>
        <Annotation DisplayMode="0">
          <Name />
          <AnnotationText>Tokenize:
(?:\d+\w)|(?:\w\d+)|[^,\s\d/-]+</AnnotationText>
          <DefaultAnnotationText>Tokenize:
(?:\d+\w)|(?:\w\d+)|[^,\s\d...</DefaultAnnotationText>
          <Left value="False" />
        </Annotation>
      </Properties>
      <EngineSettings EngineDll="AlteryxBasePluginsEngine.dll" EngineDllEntryPoint="AlteryxRegEx" />
    </Node>
    <Node ToolID="9">
      <GuiSettings Plugin="AlteryxBasePluginsGui.Formula.Formula">
        <Position x="450" y="330" />
      </GuiSettings>
      <Properties>
        <Configuration>
          <FormulaFields>
            <FormulaField expression="[Addressline]" field="parseme" size="1073741823" type="V_WString" />
          </FormulaFields>
        </Configuration>
        <Annotation DisplayMode="0">
          <Name />
          <DefaultAnnotationText><![CDATA[parseme = [Addressline]
]]></DefaultAnnotationText>
          <Left value="False" />
        </Annotation>
      </Properties>
      <EngineSettings EngineDll="AlteryxBasePluginsEngine.dll" EngineDllEntryPoint="AlteryxFormula" />
    </Node>
    <Node ToolID="10">
      <GuiSettings Plugin="AlteryxBasePluginsGui.Filter.Filter">
        <Position x="762" y="330" />
      </GuiSettings>
      <Properties>
        <Configuration>
          <Expression>[type]!=[parseme] &amp;&amp; !REGEX_Match([parseme],'^.*?\d.*?')</Expression>
          <Mode>Custom</Mode>
          <Simple>
            <Operator>=</Operator>
            <Field>
            </Field>
            <Operands>
              <IgnoreTimeInDateTime>True</IgnoreTimeInDateTime>
              <DateType>fixed</DateType>
              <PeriodDate>2019-02-06 12:17:48</PeriodDate>
              <PeriodType>
              </PeriodType>
              <PeriodCount>0</PeriodCount>
              <Operand>
              </Operand>
              <StartDate>2019-02-06 12:17:48</StartDate>
              <EndDate>2019-02-06 12:17:48</EndDate>
            </Operands>
          </Simple>
        </Configuration>
        <Annotation DisplayMode="0">
          <Name />
          <DefaultAnnotationText>[type]!=[parseme] &amp;&amp; !REGEX_Match([parseme],'^.*?\d.*?')</DefaultAnnotationText>
          <Left value="False" />
        </Annotation>
      </Properties>
      <EngineSettings EngineDll="AlteryxBasePluginsEngine.dll" EngineDllEntryPoint="AlteryxFilter" />
    </Node>
    <Node ToolID="11">
      <GuiSettings Plugin="AlteryxSpatialPluginsGui.Summarize.Summarize">
        <Position x="870" y="318" />
      </GuiSettings>
      <Properties>
        <Configuration>
          <SummarizeFields>
            <SummarizeField field="RecordID" action="GroupBy" rename="RecordID" />
            <SummarizeField field="Addressline" action="First" rename="Addressline" />
            <SummarizeField field="type" action="First" rename="type" />
            <SummarizeField field="unit" action="First" rename="unit" />
            <SummarizeField field="loc" action="First" rename="loc" />
            <SummarizeField field="parseme" action="Concat" rename="street">
              <Concat_Start />
              <Separator>\s</Separator>
              <Concat_End />
            </SummarizeField>
          </SummarizeFields>
        </Configuration>
        <Annotation DisplayMode="0">
          <Name />
          <DefaultAnnotationText />
          <Left value="False" />
        </Annotation>
      </Properties>
      <EngineSettings EngineDll="AlteryxSpatialPluginsEngine.dll" EngineDllEntryPoint="AlteryxSummarize" />
    </Node>
    <Node ToolID="12">
      <GuiSettings Plugin="AlteryxBasePluginsGui.RecordID.RecordID">
        <Position x="546" y="330" />
      </GuiSettings>
      <Properties>
        <Configuration>
          <FieldName>RecordID</FieldName>
          <StartValue>1</StartValue>
          <FieldType>Int32</FieldType>
          <FieldSize>6</FieldSize>
          <Position>0</Position>
        </Configuration>
        <Annotation DisplayMode="0">
          <Name />
          <DefaultAnnotationText />
          <Left value="False" />
        </Annotation>
      </Properties>
      <EngineSettings EngineDll="AlteryxBasePluginsEngine.dll" EngineDllEntryPoint="AlteryxRecordID" />
    </Node>
    <Node ToolID="14">
      <GuiSettings Plugin="AlteryxBasePluginsGui.TextInput.TextInput">
        <Position x="54" y="174" />
      </GuiSettings>
      <Properties>
        <Configuration>
          <NumRows value="10" />
          <Fields>
            <Field name="Addressline" />
          </Fields>
          <Data>
            <r>
              <c>L572,31A LIBERTY GROVE</c>
            </r>
            <r>
              <c>12 BAYER STREET</c>
            </r>
            <r>
              <c>3/350 TORRENS RD</c>
            </r>
            <r>
              <c>202/12-14 THE SHORES</c>
            </r>
            <r>
              <c>Flat 6, 77 Kensington Road</c>
            </r>
            <r>
              <c><![CDATA[CNR TAYLORS RDUN45 HUB DR ]]></c>
            </r>
            <r>
              <c><![CDATA[SALFORD RET ESTATE31A LYN ST ]]></c>
            </r>
            <r>
              <c>SALFORD RETMT ESTATE U 3 / 1A LYN ST</c>
            </r>
            <r>
              <c><![CDATA[UNIT 2/164 HUB DRIVE ]]></c>
            </r>
            <r>
              <c><![CDATA[40 MOORFIELD MEWS, LOT 3 ]]></c>
            </r>
          </Data>
        </Configuration>
        <Annotation DisplayMode="0">
          <Name />
          <DefaultAnnotationText />
          <Left value="False" />
        </Annotation>
      </Properties>
      <EngineSettings EngineDll="AlteryxBasePluginsEngine.dll" EngineDllEntryPoint="AlteryxTextInput" />
    </Node>
    <Node ToolID="15">
      <GuiSettings>
        <Position x="978" y="318" />
      </GuiSettings>
      <Properties>
        <Configuration>
          <Value name="List Box (11)">"RecordID","Addressline","type","unit","loc","street"</Value>
          <Value name="Check Box (84)">True</Value>
          <Value name="Check Box (117)">False</Value>
          <Value name="Check Box (15)">True</Value>
          <Value name="Check Box (109)">False</Value>
          <Value name="Check Box (122)">False</Value>
          <Value name="Check Box (53)">False</Value>
          <Value name="Check Box (58)">False</Value>
          <Value name="Check Box (70)">False</Value>
          <Value name="Check Box (77)">True</Value>
          <Value name="Drop Down (81)">upper</Value>
        </Configuration>
        <Annotation DisplayMode="0">
          <Name />
          <DefaultAnnotationText />
          <Left value="False" />
        </Annotation>
        <Dependencies>
          <Implicit />
        </Dependencies>
      </Properties>
      <EngineSettings Macro="Cleanse.yxmc" />
    </Node>
    <Node ToolID="16">
      <GuiSettings Plugin="AlteryxGuiToolkit.TextBox.TextBox">
        <Position x="66" y="30" width="192" height="132" />
      </GuiSettings>
      <Properties>
        <Configuration>
          <Text>"unit" and "loc" are tough, because the order they appear in matters and they can be anywhere in the address string. Pull these out first.

Main concept:
A lot of things are optional, so you need to be clever with "?"</Text>
          <Font name="Arial" size="8.25" style="0" />
          <TextColor name="Black" />
          <FillColor name="White" />
          <Shape shape="0" />
          <Justification Justification="0" />
        </Configuration>
        <Annotation DisplayMode="0">
          <Name />
          <DefaultAnnotationText />
          <Left value="False" />
        </Annotation>
      </Properties>
    </Node>
    <Node ToolID="17">
      <GuiSettings Plugin="AlteryxGuiToolkit.TextBox.TextBox">
        <Position x="294" y="30" width="360" height="144" />
      </GuiSettings>
      <Properties>
        <Configuration>
          <Text>"type" is also tough because it could easily look like a street name.

For this, the best bet is probably to look for exact matches. This expression looks for the words "FLAT", "UNIT","LOT" and their abbreviations.

Main concept:
The OR operator | is used to parse out these exact words, and the unmarked group helps break apart the matches. Order matters here - you want to pull out "LOT" before checking for "L".</Text>
          <Font name="Arial" size="8.25" style="0" />
          <TextColor name="Black" />
          <FillColor name="White" />
          <Shape shape="0" />
          <Justification Justification="0" />
        </Configuration>
        <Annotation DisplayMode="0">
          <Name />
          <DefaultAnnotationText />
          <Left value="False" />
        </Annotation>
      </Properties>
    </Node>
    <Node ToolID="19">
      <GuiSettings Plugin="AlteryxGuiToolkit.TextBox.TextBox">
        <Position x="690" y="30" width="444" height="252" />
      </GuiSettings>
      <Properties>
        <Configuration>
          <Text>"street" is really just everything else, so the trick is to scoop that stuff up without grabbing everything you've already parsed out.

One way to do this is to split out all the words into their own rows and then concatenating them back together.

To do this, I used the Tokenize option and matched on the things I want to pull out using the OR operator | along with unmarked groups. The expressions are looking for either:
1) a number followed by a single letter
2) a single letter followed by a number
3) or anything else that is between a comma, space, slash, or dash

I used a REGEX_Match function in the Filter tool to take out anything with numbers in it. I also removed any of the "type" matches that resulted from the Tokenize.

Then, I combined all the individual words using the Concatenate option in the Summarize tool, using "\s" as the separator.</Text>
          <Font name="Arial" size="8.25" style="0" />
          <TextColor name="Black" />
          <FillColor name="White" />
          <Shape shape="0" />
          <Justification Justification="0" />
        </Configuration>
        <Annotation DisplayMode="0">
          <Name />
          <DefaultAnnotationText />
          <Left value="False" />
        </Annotation>
      </Properties>
    </Node>
  </Nodes>
  <Connections>
    <Connection>
      <Origin ToolID="6" Connection="Output" />
      <Destination ToolID="7" Connection="Input" />
    </Connection>
    <Connection>
      <Origin ToolID="7" Connection="Output" />
      <Destination ToolID="9" Connection="Input" />
    </Connection>
    <Connection>
      <Origin ToolID="8" Connection="Output" />
      <Destination ToolID="10" Connection="Input" />
    </Connection>
    <Connection>
      <Origin ToolID="9" Connection="Output" />
      <Destination ToolID="12" Connection="Input" />
    </Connection>
    <Connection>
      <Origin ToolID="10" Connection="True" />
      <Destination ToolID="11" Connection="Input" />
    </Connection>
    <Connection>
      <Origin ToolID="11" Connection="Output" />
      <Destination ToolID="15" Connection="Input2" />
    </Connection>
    <Connection>
      <Origin ToolID="12" Connection="Output" />
      <Destination ToolID="8" Connection="Input" />
    </Connection>
    <Connection>
      <Origin ToolID="14" Connection="Output" />
      <Destination ToolID="6" Connection="Input" />
    </Connection>
    <Connection>
      <Origin ToolID="15" Connection="Output26" />
      <Destination ToolID="4" Connection="Input" />
    </Connection>
  </Connections>
  <Properties>
    <Memory default="True" />
    <GlobalRecordLimit value="0" />
    <TempFiles default="True" />
    <Annotation on="True" includeToolName="False" />
    <ConvErrorLimit value="10" />
    <ConvErrorLimit_Stop value="False" />
    <CancelOnError value="False" />
    <DisableBrowse value="False" />
    <EnablePerformanceProfiling value="False" />
    <DisableAllOutput value="False" />
    <ShowAllMacroMessages value="False" />
    <ShowConnectionStatusIsOn value="True" />
    <ShowConnectionStatusOnlyWhenRunning value="True" />
    <ZoomLevel value="0" />
    <LayoutType>Horizontal</LayoutType>
    <MetaInfo>
      <NameIsFileName value="True" />
      <Name>regex 6Feb2019</Name>
      <Description />
      <RootToolName />
      <ToolVersion />
      <ToolInDb value="False" />
      <CategoryName />
      <SearchTags />
      <Author />
      <Company />
      <Copyright />
      <DescriptionLink actual="" displayed="" />
      <Example>
        <Description />
        <File />
      </Example>
    </MetaInfo>
    <Events>
      <Enabled value="True" />
    </Events>
  </Properties>
</AlteryxDocument>

 

 

Atom

Hi Alex,

 

Thanks a lot. That was wonderful. This is exactly what I was looking out. 

 

Regards,

 

Ali Khusru

 

 

 

Asteroid

Soooo handy! Thank you!!

On your extra credit point (though someone's bound to have said this already...): it has to be that when you don't include the end-of-line ($) alternative in the unmarked group, the third address, which isn't followed by a comma, won't match the regex formula. Only the first two addresses, followed by commas, will be tokenised.

Alteryx
Alteryx

Glad you enjoyed the article @HuwHallam!

You almost have the answer to the extra credit! While you're correct that it won't match, the reason you give isn't quite right.. I'll give you a hint, it has something to do with how the proceeding marked group ends...

 

(It may be easier to understand if you download the attached workflow and change the RegEx tool (#25) around to see the result.)

Asteroid

Oh I see - I think I misread the question... Yes, if the formula is (.+?),? then the second ? means that the marked group doesn't have to be followed by a comma (it can be followed by anything or nothing). Without this restriction, the first 'greedy' ? causes the regex to return a result as soon as it can satisfy .+ (any character, one or more times). Hence you get a series of single character outputs.

 

Thanks for pushing me in the right direction! Smiley Happy