Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Tool Mastery

Explore a diverse compilation of articles that take an in-depth look at Designer tools.
Become a Tool Master

Learn how you can share your expertise with the Community

LEARN MORE

Tool Mastery | RegEx

AlexKo
Alteryx Alumni (Retired)
Created

This article is part of the Tool Mastery Series, a compilation of Knowledge Base contributions to introduce diverse working examples for Designer Tools. Here we’ll delve into uses of the RegEx Tool on our way to mastering the Alteryx Designer:

RegEx.png

The RegEx tool is kind of like the Swiss Army Knife of parsing in Alteryx; there are a whole lot of ways you can use it to do things faster or more effectively, but even if you just use the blade it's still immensely useful. Sometimes that's all you need, but if you do take the time to figure outhow to use a few other tools in that knife, you'll start to see that there isn't much you can't do with it.

 

mr bean.pngmacgyver.jpg

Before and after using the RegEx tool.

 

RegEx: What is it good for?

 

RegEx is an abbreviation of Regular Expression,and you can essentially think of it as anotherlanguage. It uses symbols just like any other language, but in regular expressions these symbols are used to match sequences of letters, numbers, or characters in text. It's a language that's all aboutrecognizing patterns.

 

Humans are really good at this sort of thing - let's say I gave you this block of text:

 

3345 Michelson Drive, Suite 400,Irvine, CA 92612

12303 Airport Way, Suite 250, Broomfield, CO 80021

Two North Riverside Plaza, Suite 1430, Chicago, IL 60606

 

You'd have no problem telling me these are addresses, and what part is a street number or a city name. But a computer would just see a block of text, and it wouldn't care to check if it was an address or not. RegEx is one way we can 'recognize' useful data in text. Let's 'translate' this to a RegEx version:

 

3345^\d+The ^signifies the beginning of a line in RegEx, so it's good practice to include it with your initial pattern. Here, our pattern is\d which means 'any numerical character' (0-9). The+signifies that we want to match theprevious expression oneor more times. Since the first part of the address is a street number, this allows us to have a number of any length.
Michelson Drive[^\d]+To match the street, we have to allow our expression to pick up multiple words of characters, including any number of spaces, since streets will often be longer than one word. One way to match this is bywhat we don't expectusing[^...]. This grouping notation matches 'any character not listed here'. What we end up with is matchone or more times any character that isnot a number.
Suite 400.*The next part of the address is a suite number, which may or may not be present, and could potentially take on various naming conventions. In order to define a flexible expression to match anything we see there, we can use a . to match 'any character'. The* then signifies that we can match any characterzero or more times.
Irvine[^\d]+As before, this just means 'any character that is not a number'.
CA\u{2}To match the state we can make use of \uto signify 'any uppercaseletter'. Since we're expecting this to always be a two-letter sequence, we can also specify the length of the match by using{...} after our expression, or 'match any two uppercase letters'.
92612\d{5}$The zip code will likewise come in as 5 digits, so we can do something similar to say 'match five number characters'. Then, we can tack on $to signify that we're expecting this to be the end of the current line.

 

cheat.png

Tools of the trade

 

There are quite a few symbols used to build RegEx expressions, but Alteryx provides a nice little 'cheat-sheet' right in the tool for you.

 

You don't really need much more than this to get rolling with using RegEx, and much like in Alteryx, you'll find that there are many many different ways you can create an expression to match a pattern.So don't sweat the details too much, and don't be afraid to spend some time with the good old fashioned "guess-and-test" method of learning.

 

trial and error.jpg

Don't worry, I'm classically trained.

 

For more complete guides on RegEx, you can also turn to the Boost-Extended Format String Syntax Guide, as well as the RegEx Perl Syntax Guide.

 

Alteryx has structured RegEx functionality into four methods: Match, Parse, Replace & Tokenize. Our help documentation for these methods is very good as well.

 

Match

 

TheMatchmethod simply checks to see whether a string can be described by the given regular expression, and gives you aTrueorFalse. Let's use the expression elements in the table above to match our addresses. Youcan create one long expression by just putting them together and including spaces\s and commas, wherever theyappear.

 

regex1.PNG

 

The first two addresses match just fine using this expression, butthe third address strays from what we expected to see and fails. Due to anobscure ruleabout buildings that share a name with their address, the street number is spelled out and our regular expression is unable to match it.

 

The key to writing a good RegEx is foreseeingthese exceptions in your data, and accounting for them within the expression. In order to match the 'Two' in this address, let'sput in another expression to check if and only ifthis one fails to match. Most addresses will start with numerical characters, but if they don't then this expression will check for a word instead. Here's how it looks:

 

(?:^\d+)|(?:^\w+)

 

This is a lot simpler than it looks, and really just uses two additional RegEx symbols. Thepipe symbolImeans 'or'. For example,a|b would just check ifamatches, and if it doesn't, ifbmatches. The second symbol is (?:...), also known as anunmarked group. This is just a way for us to group these things together for theoroperation.

 

 

regex2.PNG

 

In summary, the first group here (?:^\d+)is just doing the same thing as before, but when it fails the expression tries to match the second group(?:^\w+). This lets us match the word version of our address above without an issue.

 

For more onMatch:

Extra Credit:There are many ways to structure RegEx; comment below with a better alternative for(?:^\d+)|(?:^\w+). Why is it better?

 

Parse

 

Matching is nice in that you can use it for validation, but theParsemethod is really when RegEx comes into its own, allowing you toextractuseful information out of a block of text. The RegEx tool makes this easy for us - all we need is to place parentheses (...)around each thing we want to pull out. These are calledmarked groups - the counterpart to ourunmarked groups above (?:...).

 

regex3.PNG

 

As you type in the parentheses, you'll see these pop into a Select tool style Output Fieldswindow, which allows you to rename your fields and change their datatypes.

 

For more onParse:

 

Replace

 

As you can see, theParsemethod is really just an extension ofMatchthat allows us to pull stuff out of text and place it neatly in a new column. Well, what happens when we extend this concept and ask ourselves, 'How can I put stuff back in?'That's where theReplace method comes in.

 

WithReplace, we can parse components of our string, replace them, and even rearrange them. We can do thisby specifyingmarked groupsto tell the RegEx toolwhat to replace and where, in a language anybody can understand...

 

dollah bills

 

The dapper sloth is absolutely right of course, we can use dollar signs$ along with numbers to specify each exact marked group. For example, $1, $2, and $3refer to thefirst, second,andthirdmarked groups, respectively. So for our list of addresses, if we just wanted to parse out a list of city-states, we cantype inthe expression$4, $5.

 

regex4.PNG

 

Note we were able to add in our own little comma there, and a space, just by typing in that text box. TheReplacemethod is very flexible, and you can also use it from the Formula tool withthe function Regex_Replace.

 

For more on Replace:

 

Tokenize

 

RegEx is pretty nifty when things are given to us in a nice, neat list, but what happens if all of these addresses are just unceremoniously dumped out by some API call or database as a single block of text?

 

3345 Michelson Drive, Suite 400, Irvine, CA 92612,12303 Airport Way, Suite 250, Broomfield, CO 80021,Two North Riverside Plaza, Suite 1430, Chicago, IL 60606

 

Well, this isn't very useful - it's all comma-delimited, but we have no easy way of knowing when one address ends and another begins. What we need here is theTokenize method, which will take that chunk of text and split it into columns or rows, much like the Text to Columnstool. There are two important differences that set RegEx Tokenize apart:

  1. Instead of matchingon what you don't want (like a comma), you match on what you do want (everything else).
  2. You have the option of choosing what's split out and what is ignored by using amarked group.

This may sound topsy-turvy, but it actually gives you a lot more flexibility in what/how you split your data. To illustrate, let's split our address text blurb up into multiple rows, using the comma as our delimiter.

 

regex5.PNG

 

Since we have to match everything that we want, we need to use the expression (.+?)(?:,|$). Let's break this down:

  • .+meansany one character.matched one or more times +
  • ?is how we tell this match to be lazyrather thangreedy. This is a really useful distinctionthat may be a bit difficult to understand at first, but for the purposes of this crash course let's just focus on what the? means:match whatever is before zero times (not at all) or exactly once.
  • The plus sign+is actually the opposite: it's agreedysymbol, so the previousone character .will try to matchone or more times. So what does it mean when we tell something greedy to be lazy? Well, it actually modifies just how greedy it can be by forcing it to look at the next match. So what.+?really means is:
    'match one character one or more times until you can match whatever comes next'.
  • What comes next is anunmarked group(?:,|$)that essentially functions as a STOP sign for the previous (.+?). Since it's not in the marked group, it won't be present in the result. It's just in anunmarked group so that we can tell it to look foreithera comma ,orthe end-of-line $at the end of the text blurb.

Phew - that's certainly a lot. Feel free to take this opportunity to stretch, go for a walk, or meditate.

 

line noise.gif

 

Obviously, the Text to Columnstool can split on a comma way easier than the above, but the flexibility ofTokenize comes into the fore when we try to do something a tad more useful.For instance, we can use it to split that block of address information into the original three addresses.

 

regex6.PNG

 

In this case, we are just using our last matching expression for the zip code \d{5}to mark out where each match ends. As above, we know that each line will either end with a comma or the end of line, and we could use the(?:,|$)here to split these successfully. In the example above, I chose to showcase the?ability to matchzero or onetime to replace this - so we can split on an 'optional comma' after the marked groupmatch.

 

For more onTokenize:

Extra Credit:The ,?won't work in the previous case, splitting by comma alone - (.+?),?instead of(.+?)(?:,|$). Why?

Comment with an explanation below for eternal glory and bragging rights.

 

regex.png

 

By now, you should have expert-level proficiency with the RegExTool! If you can think of a use case we left out, feel free to use the comments section below! Consider yourself a Tool Master already? Let us know at community@alteryx.com if you’d like your creative tool uses to be featured in the Tool Mastery Series.

 

Stay tuned with our latest posts everyTool Tuesdayby followingAlteryxon Twitter! If you want to master all the Designer tools, considersubscribingfor email notifications.

Additional Information

Click on the corresponding language link below to access this article in another language -

Attachments
Comments
Ken_Black
9 - Comet
9 - Comet

 Alex,

 

Thanks for this excellent article. As time goes by and I develop more and more sophisticated workflows, my usage of regex continues to grow. After reading this article, I took a look at a recent bit of work and found a whole series of regex tools I strung together to do a bunch of things. I just wanted to say things for producing this well-documented piece of work.

 

Ken

 

regex.JPG

AlexKo
Alteryx Alumni (Retired)

Happy to hear you found this useful, @Ken_Black! Happy RegEx'ing!

 

Alex

brad_j_crep
8 - Asteroid

Very informative.  How would it work if you had an address like 123 12th St NE Apt# 201 #201?  I need to get rid of the second #201, remove the # after the Apt, still let the 123 12th St NE Apt 201 go through to the output. 

Still new to regex, any help would be appreciated.

 

Thanks,

Brad

DanM
Alteryx Community Team
Alteryx Community Team

@brad_j_crep one idea is that you can use the parse Output Method and use this expression to remove the #201 (.*)\s#. You can then use a Formula tool and use the replacechar expression to remove the other # that will be left.

AlexKo
Alteryx Alumni (Retired)

Thanks for reading, @brad_j_crep. I think @DanM has a nice solution there; I'll just add another here that uses the replace method...

 

address.PNG

 

Sometimes when building these out the question is really 'how often will this work?' - in this case I'm not sure if you'll see a dataset with this odd repeated suite name. If this is peppered in the data, it may be best to just use a Formula tool or RegEx tool upstream of your main parsing step that replaces any # in the dataset and cleans unnecessary stuff up.

 

Alex

JohnHG
6 - Meteoroid

Hi, I tweak around a bit to understand what every bit does. I've opened the build in RegEx example. In the 3rd example (In "Parse" mode, partial matches will be considered, however only the first match will be returned.) when I put in some square brackets like: -([\d{4}]) it returns in the zip4 ,column 4,6,1,3 etc. 

 

The explanation about sq brackets states: 

 

To match the street, we have to allow our expression to pick up multiple words of characters, including any number of spaces, since streets will often be longer than one word. One way to match this is by what we don't expect using [^...]. This grouping notation matches 'any character not listed here'. What we end up with is match one or more times any character that is not a number.

 

I expected all to be [null]

 

Can anyone explain to me what happened here?

Thanks,

 

John

AlexKo
Alteryx Alumni (Retired)

Hi @JohnHG -

 

This is a failure on my part to differentiate the difference between:

  • matching characters in a set using [...]
  • matching characters not in a set using [^...]

In the first, we're matching any character that is there, and in the second any character that isn't. Take a look at this link for a more eloquent explanation:

https://regexone.com/lesson/matching_characters

https://regexone.com/lesson/excluding_characters

PhilipMannering
16 - Nebula
16 - Nebula

Anybody know how you can do named groups in Alteryx? For example you could have two date formats:

13-Nov-2017 and Nov 15, 2017.

Can you use something like (\d+)-(\w+)-(\d+)|(\w+)\s(\d+),\s(\d+)

but assign the first group (the 13) and the fifth group (the 15) to the same output field day?

Maybe something like (?P<day>\d+)-(\w+)-(\d+)|(\w+)\s(?P<day>\d+),\s(\d+) ... except these named groups don't work in Alteryx.

DanM
Alteryx Community Team
Alteryx Community Team

@PhilipMannering at this time the REGEX tool does not do grouping. The tool looks for specific patterns, so if the string does not match the entire pattern you will not get a return of data. I would suggest using two REGEX tools and then using the Formula tool to combine the data back using an IF statement to fill in the Nulls from the other fields. 

vfranklin
6 - Meteoroid

I am trying to parse a string of text:

<a href="/123abc"target="_self">I want this string

I want to parse everthing prior to and including > into one column and everything after > into a second column.  How do I do that?

girisri
7 - Meteor

 Lets say I want to parse  Age,Age,Age  to reflect only Age i.e parsing it out from where the first delimiter ',' starts, how do I achieve this?

JoeS
Alteryx
Alteryx

@girisri - You can use the Regex Match and an lazy quantifier (AKA ?)

 

RegexUngreedy.png 

 

This will take the first part that matches the string, before your comma

girisri
7 - Meteor

Thank you, That worked.

However in some rows I have values without repetition,So instead of Age, Age, Age I have only Age in some rows

So in this case if I apply (.*?), it replaces Age with [Null], how to avoid this? as I need the value Age

JoeS
Alteryx
Alteryx

I am sure you could handle it in the Regex, but rather than racking my brains, I would use a formula tool after and use:

 

IIF(ISNULL([NewField]),[OriginalField],[NewField])

That will then replace the null value with the value from the original field.

 

The reason it doesnt work is that it's looking for a comma, and not finding that in the field with only one value.

girisri
7 - Meteor

That of course works! Thanks

Just was curious how to achieve that thru Regex, will explore, Thanks

dcappellino
5 - Atom

Is there anyway that I can pull all characters after a number?  So basically any field that starts with a number I want that entire field, including the numbers, in a new column? 

AntonioGonzales
8 - Asteroid

Dear @girisri,

 

Hope you are well.

 

Have you tried to parse the last occurrence of Age instead of the first one?

(\d+)$

 

Explanation:

() --> marked group (what you want to pull out)

\d --> any digit

+ --> one or more repetition

$ --> end of line

 

In my tests worked fine, please see the picture below.

regex.png

 

Hope it can help you.

Looking forward to hearing from you.

 

Cheers!

Antonio Alves

AntonioGonzales
8 - Asteroid

Dear @dcappellino,

 

Hope you are well.

 

I couldn't understand very well your question, sorry.

It is possible to share an example of your line to me have a try?

 

Cheers,

Antonio Alves

AntonioGonzales
8 - Asteroid

Dear @vfranklin,

 

Hope you are well.

 

Have you tried this:

(<.+>)(.+)$

 

Explanation:

()() --> you need to parse to groups as you said

(<.+>) --> this group try to parse everything between < and >

(.+)$ --> the second group try to parse everything after the first group until the end of the line

 

In your given example it worked fine.

regex2.png

 

I hope that this example could help you.

Cheers!

 

Antonio Alves

dcappellino
5 - Atom

Hi @AntonioGonzales,

 

Basically I'm trying to split items in one column, into two based on if they are account numbers vs account names, but can't seem to figure it out.  So any cell that starts with a number, I want to move the entire cell into a new column. I tried text to column, but when I use that the delimiter is not transferred to the next cell(i.e. the 4th cell would transfer over as "0010:Cash" instead of "10010: Cash" when I use 1 as the delimiter).    I think the RegEx tool is my best bet, but cant figure out the exact string that would perform this task.

 

Capture.PNG

 

I hope this is more clear.

 

Thanks!

AntonioGonzales
8 - Asteroid

Dear @dcappellino,

 

Thanks for share more details, now I guess that I got it.

 

In my tests the regex 

^(\d+.*)$

worked well.

 

Version 1)

regex3.png

 

Let me explain this regex in more details.

 

^ --> this indicate start the line

() --> we need just 1 marked group (account)

\d+ --> 1 number or more

.* --> any caracters

$ --> end of the line

 

So, if I mount it in parts:

^ --> start of my line

^()$ --> everything in my line

^(\d+)$ --> Line just with numbers

^(\d+.*)$ --> Line starting with numbers followed by anything

 

I did a second version where I remove this numbers from the first column as the text-to-column normally do.

regex4.png

The idea is the same but in this case, I had to use the formula tool with regex_match and regex_replace.

Hope this can help you.

 

Let me know if you would like to have my workflow or more explanation.

 

Cheers.

Antonio Alves

 

JoeS
Alteryx
Alteryx

Hi @dcappellino

 

You can use the Regex Match function and then filter for the values to be true.

 

If you use "\d.+" This means there needs to be a number and then at least one other thing.

 

NumberText.pngData.png

 

Edit: not sure why, but the comment above didn't load when I replied, so it's now covered twice!

mdkhusru
5 - Atom

Hi Alex,

 

This was an excellent article. As a newbie, This really helped in understanding the usage of Regex so neatly. 

I am working on an exercise to sort out mapping data for addresses in various formats The sample is as below

 

Addresslinetypeunitlocstreet
L572,31A LIBERTY GROVEL57231ALIBERTY GROVE
12 BAYER STREET 12 BAYER STREET
3/350 TORRENS RD 3350TORRENS RD
202/12-14 THE SHORES 20212-14THE SHORES
Flat 6, 77 Kensington Roadflat677Kensington Road
CNR TAYLORS RDUN45 HUB DR  45 CNR TAYLORS RDUN HUB DR 
SALFORD RET ESTATE31A LYN ST  31A SALFORD RET ESTATE LYN ST
SALFORD RETMT ESTATE U 3 / 1A LYN STU31ASALFORD RETMT ESTATE  LYN ST
UNIT 2/164 HUB DRIVE UNIT2164HUB DRIVE 
40 MOORFIELD MEWS, LOT 3 LOT403MOORFIELD MEWS

 

each of these address have a unit number and/or a Location number. some of the location number could be in ranges (12-14) The address to parse is in the first column and the result I am looking out is in other columns. I would really appreciate if you could help me with some hints or solution.

 

 

Regards,

Ali

AlexKo
Alteryx Alumni (Retired)

Hi @mdkhusru,

 

I'm glad you enjoyed the article. Your example was much tougher than I originally expected! It definitely touched on a lot of different concepts. I've attached a solution I came up with. I hope you'll find it helpful.

 

 regex 6Feb2019.PNG

 

 

Best,

Alex

 

Here is the workflow xml if you want to open it. Just copy this to a text editor and save it as a yxmd file.

 

Spoiler
<?xml version="1.0"?>
<AlteryxDocument yxmdVer="11.0">
  <Nodes>
    <Node ToolID="4">
      <GuiSettings Plugin="AlteryxBasePluginsGui.BrowseV2.BrowseV2">
        <Position x="1074" y="318" />
      </GuiSettings>
      <Properties>
        <Configuration>
          <TempFile>C:\Users\AKOSZY~1\AppData\Local\Temp\Engine_9208_3243a9519d4042838d360d78cb4d7d17_\Engine_21724_535649afd86349a68f5168ed62000f64_.yxdb</TempFile>
          <TempFileDataProfiling />
          <Layout>
            <ViewMode>Single</ViewMode>
            <ViewSize value="100" />
            <View1>
              <DefaultTab>Profile</DefaultTab>
              <Hints>
                <Table />
              </Hints>
            </View1>
            <View2 />
          </Layout>
        </Configuration>
        <Annotation DisplayMode="0">
          <Name />
          <DefaultAnnotationText />
          <Left value="False" />
        </Annotation>
      </Properties>
      <EngineSettings EngineDll="AlteryxBasePluginsEngine.dll" EngineDllEntryPoint="AlteryxBrowseV2" />
    </Node>
    <Node ToolID="6">
      <GuiSettings Plugin="AlteryxBasePluginsGui.RegEx.RegEx">
        <Position x="174" y="174" />
      </GuiSettings>
      <Properties>
        <Configuration>
          <Field>Addressline</Field>
          <RegExExpression value="[^\d]*(\d+\w?)[^\d]+(\d+\w?(?:-\d+\w?)?)?" />
          <CaseInsensitve value="True" />
          <Method>ParseComplex</Method>
          <Replace expression="">
            <CopyUnmatched value="True" />
          </Replace>
          <ParseSimple>
            <SplitToRows value="False" />
            <RootName />
            <NumFields value="3" />
            <ErrorHandling>Warn</ErrorHandling>
          </ParseSimple>
          <ParseComplex>
            <Field field="unit" type="V_WString" size="254" />
            <Field field="loc" type="V_WString" size="254" />
          </ParseComplex>
          <Match>
            <Field>Addressline_Matched</Field>
            <ErrorUnmatched value="False" />
          </Match>
        </Configuration>
        <Annotation DisplayMode="0">
          <Name />
          <AnnotationText>Parse:
[^\d]*(\d+\w?)[^\d]+(\d+\w?(?:-\d+\w?)?)?</AnnotationText>
          <DefaultAnnotationText>Parse:
[^\d]*(\d+\w?)[^\d]+(\d+\w?...</DefaultAnnotationText>
          <Left value="False" />
        </Annotation>
      </Properties>
      <EngineSettings EngineDll="AlteryxBasePluginsEngine.dll" EngineDllEntryPoint="AlteryxRegEx" />
    </Node>
    <Node ToolID="7">
      <GuiSettings Plugin="AlteryxBasePluginsGui.RegEx.RegEx">
        <Position x="330" y="174" />
      </GuiSettings>
      <Properties>
        <Configuration>
          <Field>Addressline</Field>
          <RegExExpression value="((?:LOT)|(?:UNIT)|(?:FLAT)|(?:L)|(?:U)|(?:F))[^A-Z]" />
          <CaseInsensitve value="True" />
          <Method>ParseComplex</Method>
          <Replace expression="">
            <CopyUnmatched value="True" />
          </Replace>
          <ParseSimple>
            <SplitToRows value="False" />
            <RootName />
            <NumFields value="3" />
            <ErrorHandling>Warn</ErrorHandling>
          </ParseSimple>
          <ParseComplex>
            <Field field="type" type="V_WString" size="254" />
          </ParseComplex>
          <Match>
            <Field>Addressline_Matched</Field>
            <ErrorUnmatched value="False" />
          </Match>
        </Configuration>
        <Annotation DisplayMode="0">
          <Name />
          <AnnotationText>Parse:
((?:LOT)|(?:UNIT)|(?:FLAT)|(?:L)|(?:U)|(?:F))[^A-Z]</AnnotationText>
          <DefaultAnnotationText>Parse:
((?:LOT)|(?:UNIT)|(?:FLAT)|...</DefaultAnnotationText>
          <Left value="False" />
        </Annotation>
      </Properties>
      <EngineSettings EngineDll="AlteryxBasePluginsEngine.dll" EngineDllEntryPoint="AlteryxRegEx" />
    </Node>
    <Node ToolID="8">
      <GuiSettings Plugin="AlteryxBasePluginsGui.RegEx.RegEx">
        <Position x="642" y="330" />
      </GuiSettings>
      <Properties>
        <Configuration>
          <Field>parseme</Field>
          <RegExExpression value="(?:\d+\w)|(?:\w\d+)|[^,\s\d/-]+" />
          <CaseInsensitve value="True" />
          <Method>ParseSimple</Method>
          <Replace expression="">
            <CopyUnmatched value="True" />
          </Replace>
          <ParseSimple>
            <SplitToRows value="True" />
          </ParseSimple>
          <ParseComplex>
            <Field field="No Marked Groups Found" type="No Marked Groups Found" size="No Marked Groups Found" />
          </ParseComplex>
          <Match>
            <Field>Addressline_Matched</Field>
            <ErrorUnmatched value="False" />
          </Match>
        </Configuration>
        <Annotation DisplayMode="0">
          <Name />
          <AnnotationText>Tokenize:
(?:\d+\w)|(?:\w\d+)|[^,\s\d/-]+</AnnotationText>
          <DefaultAnnotationText>Tokenize:
(?:\d+\w)|(?:\w\d+)|[^,\s\d...</DefaultAnnotationText>
          <Left value="False" />
        </Annotation>
      </Properties>
      <EngineSettings EngineDll="AlteryxBasePluginsEngine.dll" EngineDllEntryPoint="AlteryxRegEx" />
    </Node>
    <Node ToolID="9">
      <GuiSettings Plugin="AlteryxBasePluginsGui.Formula.Formula">
        <Position x="450" y="330" />
      </GuiSettings>
      <Properties>
        <Configuration>
          <FormulaFields>
            <FormulaField expression="[Addressline]" field="parseme" size="1073741823" type="V_WString" />
          </FormulaFields>
        </Configuration>
        <Annotation DisplayMode="0">
          <Name />
          <DefaultAnnotationText><![CDATA[parseme = [Addressline]
]]></DefaultAnnotationText>
          <Left value="False" />
        </Annotation>
      </Properties>
      <EngineSettings EngineDll="AlteryxBasePluginsEngine.dll" EngineDllEntryPoint="AlteryxFormula" />
    </Node>
    <Node ToolID="10">
      <GuiSettings Plugin="AlteryxBasePluginsGui.Filter.Filter">
        <Position x="762" y="330" />
      </GuiSettings>
      <Properties>
        <Configuration>
          <Expression>[type]!=[parseme] &amp;&amp; !REGEX_Match([parseme],'^.*?\d.*?')</Expression>
          <Mode>Custom</Mode>
          <Simple>
            <Operator>=</Operator>
            <Field>
            </Field>
            <Operands>
              <IgnoreTimeInDateTime>True</IgnoreTimeInDateTime>
              <DateType>fixed</DateType>
              <PeriodDate>2019-02-06 12:17:48</PeriodDate>
              <PeriodType>
              </PeriodType>
              <PeriodCount>0</PeriodCount>
              <Operand>
              </Operand>
              <StartDate>2019-02-06 12:17:48</StartDate>
              <EndDate>2019-02-06 12:17:48</EndDate>
            </Operands>
          </Simple>
        </Configuration>
        <Annotation DisplayMode="0">
          <Name />
          <DefaultAnnotationText>[type]!=[parseme] &amp;&amp; !REGEX_Match([parseme],'^.*?\d.*?')</DefaultAnnotationText>
          <Left value="False" />
        </Annotation>
      </Properties>
      <EngineSettings EngineDll="AlteryxBasePluginsEngine.dll" EngineDllEntryPoint="AlteryxFilter" />
    </Node>
    <Node ToolID="11">
      <GuiSettings Plugin="AlteryxSpatialPluginsGui.Summarize.Summarize">
        <Position x="870" y="318" />
      </GuiSettings>
      <Properties>
        <Configuration>
          <SummarizeFields>
            <SummarizeField field="RecordID" action="GroupBy" rename="RecordID" />
            <SummarizeField field="Addressline" action="First" rename="Addressline" />
            <SummarizeField field="type" action="First" rename="type" />
            <SummarizeField field="unit" action="First" rename="unit" />
            <SummarizeField field="loc" action="First" rename="loc" />
            <SummarizeField field="parseme" action="Concat" rename="street">
              <Concat_Start />
              <Separator>\s</Separator>
              <Concat_End />
            </SummarizeField>
          </SummarizeFields>
        </Configuration>
        <Annotation DisplayMode="0">
          <Name />
          <DefaultAnnotationText />
          <Left value="False" />
        </Annotation>
      </Properties>
      <EngineSettings EngineDll="AlteryxSpatialPluginsEngine.dll" EngineDllEntryPoint="AlteryxSummarize" />
    </Node>
    <Node ToolID="12">
      <GuiSettings Plugin="AlteryxBasePluginsGui.RecordID.RecordID">
        <Position x="546" y="330" />
      </GuiSettings>
      <Properties>
        <Configuration>
          <FieldName>RecordID</FieldName>
          <StartValue>1</StartValue>
          <FieldType>Int32</FieldType>
          <FieldSize>6</FieldSize>
          <Position>0</Position>
        </Configuration>
        <Annotation DisplayMode="0">
          <Name />
          <DefaultAnnotationText />
          <Left value="False" />
        </Annotation>
      </Properties>
      <EngineSettings EngineDll="AlteryxBasePluginsEngine.dll" EngineDllEntryPoint="AlteryxRecordID" />
    </Node>
    <Node ToolID="14">
      <GuiSettings Plugin="AlteryxBasePluginsGui.TextInput.TextInput">
        <Position x="54" y="174" />
      </GuiSettings>
      <Properties>
        <Configuration>
          <NumRows value="10" />
          <Fields>
            <Field name="Addressline" />
          </Fields>
          <Data>
            <r>
              <c>L572,31A LIBERTY GROVE</c>
            </r>
            <r>
              <c>12 BAYER STREET</c>
            </r>
            <r>
              <c>3/350 TORRENS RD</c>
            </r>
            <r>
              <c>202/12-14 THE SHORES</c>
            </r>
            <r>
              <c>Flat 6, 77 Kensington Road</c>
            </r>
            <r>
              <c><![CDATA[CNR TAYLORS RDUN45 HUB DR ]]></c>
            </r>
            <r>
              <c><![CDATA[SALFORD RET ESTATE31A LYN ST ]]></c>
            </r>
            <r>
              <c>SALFORD RETMT ESTATE U 3 / 1A LYN ST</c>
            </r>
            <r>
              <c><![CDATA[UNIT 2/164 HUB DRIVE ]]></c>
            </r>
            <r>
              <c><![CDATA[40 MOORFIELD MEWS, LOT 3 ]]></c>
            </r>
          </Data>
        </Configuration>
        <Annotation DisplayMode="0">
          <Name />
          <DefaultAnnotationText />
          <Left value="False" />
        </Annotation>
      </Properties>
      <EngineSettings EngineDll="AlteryxBasePluginsEngine.dll" EngineDllEntryPoint="AlteryxTextInput" />
    </Node>
    <Node ToolID="15">
      <GuiSettings>
        <Position x="978" y="318" />
      </GuiSettings>
      <Properties>
        <Configuration>
          <Value name="List Box (11)">"RecordID","Addressline","type","unit","loc","street"</Value>
          <Value name="Check Box (84)">True</Value>
          <Value name="Check Box (117)">False</Value>
          <Value name="Check Box (15)">True</Value>
          <Value name="Check Box (109)">False</Value>
          <Value name="Check Box (122)">False</Value>
          <Value name="Check Box (53)">False</Value>
          <Value name="Check Box (58)">False</Value>
          <Value name="Check Box (70)">False</Value>
          <Value name="Check Box (77)">True</Value>
          <Value name="Drop Down (81)">upper</Value>
        </Configuration>
        <Annotation DisplayMode="0">
          <Name />
          <DefaultAnnotationText />
          <Left value="False" />
        </Annotation>
        <Dependencies>
          <Implicit />
        </Dependencies>
      </Properties>
      <EngineSettings Macro="Cleanse.yxmc" />
    </Node>
    <Node ToolID="16">
      <GuiSettings Plugin="AlteryxGuiToolkit.TextBox.TextBox">
        <Position x="66" y="30" width="192" height="132" />
      </GuiSettings>
      <Properties>
        <Configuration>
          <Text>"unit" and "loc" are tough, because the order they appear in matters and they can be anywhere in the address string. Pull these out first.

Main concept:
A lot of things are optional, so you need to be clever with "?"</Text>
          <Font name="Arial" size="8.25" style="0" />
          <TextColor name="Black" />
          <FillColor name="White" />
          <Shape shape="0" />
          <Justification Justification="0" />
        </Configuration>
        <Annotation DisplayMode="0">
          <Name />
          <DefaultAnnotationText />
          <Left value="False" />
        </Annotation>
      </Properties>
    </Node>
    <Node ToolID="17">
      <GuiSettings Plugin="AlteryxGuiToolkit.TextBox.TextBox">
        <Position x="294" y="30" width="360" height="144" />
      </GuiSettings>
      <Properties>
        <Configuration>
          <Text>"type" is also tough because it could easily look like a street name.

For this, the best bet is probably to look for exact matches. This expression looks for the words "FLAT", "UNIT","LOT" and their abbreviations.

Main concept:
The OR operator | is used to parse out these exact words, and the unmarked group helps break apart the matches. Order matters here - you want to pull out "LOT" before checking for "L".</Text>
          <Font name="Arial" size="8.25" style="0" />
          <TextColor name="Black" />
          <FillColor name="White" />
          <Shape shape="0" />
          <Justification Justification="0" />
        </Configuration>
        <Annotation DisplayMode="0">
          <Name />
          <DefaultAnnotationText />
          <Left value="False" />
        </Annotation>
      </Properties>
    </Node>
    <Node ToolID="19">
      <GuiSettings Plugin="AlteryxGuiToolkit.TextBox.TextBox">
        <Position x="690" y="30" width="444" height="252" />
      </GuiSettings>
      <Properties>
        <Configuration>
          <Text>"street" is really just everything else, so the trick is to scoop that stuff up without grabbing everything you've already parsed out.

One way to do this is to split out all the words into their own rows and then concatenating them back together.

To do this, I used the Tokenize option and matched on the things I want to pull out using the OR operator | along with unmarked groups. The expressions are looking for either:
1) a number followed by a single letter
2) a single letter followed by a number
3) or anything else that is between a comma, space, slash, or dash

I used a REGEX_Match function in the Filter tool to take out anything with numbers in it. I also removed any of the "type" matches that resulted from the Tokenize.

Then, I combined all the individual words using the Concatenate option in the Summarize tool, using "\s" as the separator.</Text>
          <Font name="Arial" size="8.25" style="0" />
          <TextColor name="Black" />
          <FillColor name="White" />
          <Shape shape="0" />
          <Justification Justification="0" />
        </Configuration>
        <Annotation DisplayMode="0">
          <Name />
          <DefaultAnnotationText />
          <Left value="False" />
        </Annotation>
      </Properties>
    </Node>
  </Nodes>
  <Connections>
    <Connection>
      <Origin ToolID="6" Connection="Output" />
      <Destination ToolID="7" Connection="Input" />
    </Connection>
    <Connection>
      <Origin ToolID="7" Connection="Output" />
      <Destination ToolID="9" Connection="Input" />
    </Connection>
    <Connection>
      <Origin ToolID="8" Connection="Output" />
      <Destination ToolID="10" Connection="Input" />
    </Connection>
    <Connection>
      <Origin ToolID="9" Connection="Output" />
      <Destination ToolID="12" Connection="Input" />
    </Connection>
    <Connection>
      <Origin ToolID="10" Connection="True" />
      <Destination ToolID="11" Connection="Input" />
    </Connection>
    <Connection>
      <Origin ToolID="11" Connection="Output" />
      <Destination ToolID="15" Connection="Input2" />
    </Connection>
    <Connection>
      <Origin ToolID="12" Connection="Output" />
      <Destination ToolID="8" Connection="Input" />
    </Connection>
    <Connection>
      <Origin ToolID="14" Connection="Output" />
      <Destination ToolID="6" Connection="Input" />
    </Connection>
    <Connection>
      <Origin ToolID="15" Connection="Output26" />
      <Destination ToolID="4" Connection="Input" />
    </Connection>
  </Connections>
  <Properties>
    <Memory default="True" />
    <GlobalRecordLimit value="0" />
    <TempFiles default="True" />
    <Annotation on="True" includeToolName="False" />
    <ConvErrorLimit value="10" />
    <ConvErrorLimit_Stop value="False" />
    <CancelOnError value="False" />
    <DisableBrowse value="False" />
    <EnablePerformanceProfiling value="False" />
    <DisableAllOutput value="False" />
    <ShowAllMacroMessages value="False" />
    <ShowConnectionStatusIsOn value="True" />
    <ShowConnectionStatusOnlyWhenRunning value="True" />
    <ZoomLevel value="0" />
    <LayoutType>Horizontal</LayoutType>
    <MetaInfo>
      <NameIsFileName value="True" />
      <Name>regex 6Feb2019</Name>
      <Description />
      <RootToolName />
      <ToolVersion />
      <ToolInDb value="False" />
      <CategoryName />
      <SearchTags />
      <Author />
      <Company />
      <Copyright />
      <DescriptionLink actual="" displayed="" />
      <Example>
        <Description />
        <File />
      </Example>
    </MetaInfo>
    <Events>
      <Enabled value="True" />
    </Events>
  </Properties>
</AlteryxDocument>

 

 

mdkhusru
5 - Atom

Hi Alex,

 

Thanks a lot. That was wonderful. This is exactly what I was looking out. 

 

Regards,

 

Ali Khusru

 

 

 

HuwHallam
8 - Asteroid

Soooo handy! Thank you!!

On your extra credit point (though someone's bound to have said this already...): it has to be that when you don't include the end-of-line ($) alternative in the unmarked group, the third address, which isn't followed by a comma, won't match the regex formula. Only the first two addresses, followed by commas, will be tokenised.

AlexKo
Alteryx Alumni (Retired)

Glad you enjoyed the article @HuwHallam!

You almost have the answer to the extra credit! While you're correct that it won't match, the reason you give isn't quite right.. I'll give you a hint, it has something to do with how the proceeding marked group ends...

 

(It may be easier to understand if you download the attached workflow and change the RegEx tool (#25) around to see the result.)

HuwHallam
8 - Asteroid

Oh I see - I think I misread the question... Yes, if the formula is (.+?),? then the second ? means that the marked group doesn't have to be followed by a comma (it can be followed by anything or nothing). Without this restriction, the first 'greedy' ? causes the regex to return a result as soon as it can satisfy .+ (any character, one or more times). Hence you get a series of single character outputs.

 

Thanks for pushing me in the right direction! :)

jhosmer002
7 - Meteor

Hi! I am looking to use the Tokenize function in the Regex tool. I am trying to split text from 1 column into 2 columns, without a common delimiter. I figure that I can split at the word "at" but I am at a loss on how to pull the company name out. I have about 700 fields all with different information and character lengths but this common setup. 

An example of the field I am trying to split into title and company currently looks like this:
Manager, Strategy at ABC Corp. Go to ABC Corp. account page1year 15 months


I want my output to be:
Title (Column 1): Manager, Strategy
Company (Column 2): ABC Corp.

 

Thank you!

AlexKo
Alteryx Alumni (Retired)

@jhosmer002 This should give you a good start.

 

(.*?)(?:(?:\sat\s)|$)

 

(.*?)    - Look for any number of single characters until you match what is next.

(?:this|that)     - Use an OR inside an unmarked group to specify a match on the left or the right of the pipe |.

\sat\s     - Matches " at ".

$     - Matches the end of the string.

 

Cheers,

Alex

jhosmer002
7 - Meteor

@AlexKo

Thank you! That worked! If I wanted to delete everything after (and including) the word Go, what would I need to add to the expression? 

Manager, Strategy at ABC Corp. Go to ABC Corp. account page1year 15 months

 

Anuj
8 - Asteroid

hi , i am trying to parse the below but cant get it to work completely 

 

7925 Kaumualii Hwy|Kekaha, HI 96752

 

((?:^\d+)|(?:^\w+))(.*?)(?:(?:\s\|\s)|\|)\s(\u{2})\s(\d{5}$)

 

any help of what i am doing  wrong 

CTubbs
6 - Meteoroid

Great article, super helpful!

izamryan
8 - Asteroid

I know I'm coming late to the party! But why not, here's a stab at the "Extra Credit" questions @AlexKo / @HuwHallam 

I think only the first one hasn't been resolved in this thread yet, while the second one has been addressed by Huw.

I give my workings below so you can work through it the same way I did.

 

 

Spoiler
Extra Credit #1

this expression 
(?:^\d+)|(?:^\w+)

Is saying: match at the beginning of the field, a DIGIT or a WORD.
Well, this is SORT OF the same as saying: match at the beginning of the field, a NON-WHITESPACE character, which in PCRE RegEx is

(?:^\S+)

but then I realised, if we want to have a mutually exclusive list of the token-space, you can think of the entire token-space as being comprised of ALL of the Digits, Words, Punctuation and Special Characters.

NON-WHITESPACE is a subset of that entire token-space: it includes Punctuation AND Special Characters. And matching on \S could produce unexpected results, for example if you have addresses that look like this:

#twentytwo Northumberland Drive, Apostrophe 55, Moon, MN 13145
!Variety Coffe Cup, Potraiture 247, Mars, MS 88888
.* Everywhere, Nowhere 123, Eternity, ET 12345

So to exclude the Punctuation and Special Characters, and we could just list out the range of "valid" characters we want to match on like so:
(?:^[0-9A-Za-z]+)

This makes it easier to read (you don't have to do the | and pipe in your head) and it literally lists out the set of the characters you want the RegEx to match on.

If it helps you visualise it, I contained everything here:
https://regex101.com/r/oSHN21/1 (the proposed solution which excludes punctuation)
https://regex101.com/r/SYuJgE/1 (the first solution but note that it matches punctuation so if you have a place that is "#twentyTwo it'll match)


Extra Credit #2:

These two expressions:

(.+?
),?

(.+?)(?:,|$)

the first one is problematic because:

https://regex101.com/r/fJ6C48/2
the first one reads: Match any character, between one and unlimited times in a Lazy way
you kind of get an infinite nibble: it's going to nibble away ant match every single Word, Digit, Non-whitespace, Whitespace EXCEPT for line terminators, one-at-a-time (lazy) until you get to the end ... total 295 matches, one for each character

https://regex101.com/r/9C2abE/1
the second one reads: Nibble the line and add each nibble to the group until you get to the Non-capturing group, which is defined as the FIRST comma or end-of-line character. Only the second one gives the expected result.

 

 

ChrisTX
15 - Aurora

In the first example, I don't think the explanation "\u{2} will extract 2 uppercase letters for state" is correct.

 

Here is documentation from the Boost-Extended Format String Syntax Guide

   https://www.boost.org/doc/libs/1_56_0/libs/regex/doc/html/boost_regex/format/boost_format_syntax.htm...

 

u

Causes the next character to be outputted, to be output in upper case.

\U

Causes all subsequent characters to be output in upper case, until a \E is found.

 

 

rafamsba
5 - Atom

Thanks for sharing! Really appreciate the contribution to our community!

ASpicer12
5 - Atom

Thanks for sharing! I'm new to Parse & this really helped me get started. I am stuck on a special characters issue. 

 

I need to convert: 

 

From: 

 

{"city":"Aberdeen","country":"USA","geocodeAccuracy":null,"latitude":null,"longitude":null,"postalCode":"21001","state":"Maryland","street":"513 Heathland Trail"}

 

To: 

 

StreetCityStateZip
513 Heatherland TrailAberdeenMD21001

 

I was able to follow along with the example above if this was in a standard address format. However, all the special characters here are tripping me up. Help?! 

 

Thanks! 

 

alexnajm
16 - Nebula
16 - Nebula

What a fantastic article! The only thing I'd say is to change [^\d]+ to [^\n]+ to include any streets that may have a number (i.e. 16th Street) 😊

ctrl_kudorje
8 - Asteroid

Happy Regex Year to yall!

rohityaman
8 - Asteroid

Just an attempt for the alternate Regex formula for Address Strings -- ([^,]+),([^,]+),([^,]+),\s(\u{2})\s(\d{5}) 

 

(?:^\d+)|(?:^\w+) can be replaced with ([^,]+), however this new regex will accept any nonword characters.

 

rohityaman_0-1686612747651.png

 

SachinG
8 - Asteroid

This is really insightful. thanks for posting in verry easy words.