on 11-16-2016 03:53 PM - edited on 04-30-2024 08:12 AM by olacey
This article is part of the Tool Mastery Series, a compilation of Knowledge Base contributions to introduce diverse working examples for Designer Tools. Here we’ll delve into uses of the RegEx Tool on our way to mastering the Alteryx Designer:
The RegEx tool is kind of like the Swiss Army Knife of parsing in Alteryx; there are a whole lot of ways you can use it to do things faster or more effectively, but even if you just use the blade it's still immensely useful. Sometimes that's all you need, but if you do take the time to figure outhow to use a few other tools in that knife, you'll start to see that there isn't much you can't do with it.
Before and after using the RegEx tool.
RegEx is an abbreviation of Regular Expression,and you can essentially think of it as anotherlanguage. It uses symbols just like any other language, but in regular expressions these symbols are used to match sequences of letters, numbers, or characters in text. It's a language that's all aboutrecognizing patterns.
Humans are really good at this sort of thing - let's say I gave you this block of text:
3345 Michelson Drive, Suite 400,Irvine, CA 92612
12303 Airport Way, Suite 250, Broomfield, CO 80021
Two North Riverside Plaza, Suite 1430, Chicago, IL 60606
You'd have no problem telling me these are addresses, and what part is a street number or a city name. But a computer would just see a block of text, and it wouldn't care to check if it was an address or not. RegEx is one way we can 'recognize' useful data in text. Let's 'translate' this to a RegEx version:
3345 | ^\d+ | The ^signifies the beginning of a line in RegEx, so it's good practice to include it with your initial pattern. Here, our pattern is\d which means 'any numerical character' (0-9). The+signifies that we want to match theprevious expression oneor more times. Since the first part of the address is a street number, this allows us to have a number of any length. |
Michelson Drive | [^\d]+ | To match the street, we have to allow our expression to pick up multiple words of characters, including any number of spaces, since streets will often be longer than one word. One way to match this is bywhat we don't expectusing[^...]. This grouping notation matches 'any character not listed here'. What we end up with is matchone or more times any character that isnot a number. |
Suite 400 | .* | The next part of the address is a suite number, which may or may not be present, and could potentially take on various naming conventions. In order to define a flexible expression to match anything we see there, we can use a . to match 'any character'. The* then signifies that we can match any characterzero or more times. |
Irvine | [^\d]+ | As before, this just means 'any character that is not a number'. |
CA | \u{2} | To match the state we can make use of \uto signify 'any uppercaseletter'. Since we're expecting this to always be a two-letter sequence, we can also specify the length of the match by using{...} after our expression, or 'match any two uppercase letters'. |
92612 | \d{5}$ | The zip code will likewise come in as 5 digits, so we can do something similar to say 'match five number characters'. Then, we can tack on $to signify that we're expecting this to be the end of the current line. |
There are quite a few symbols used to build RegEx expressions, but Alteryx provides a nice little 'cheat-sheet' right in the tool for you.
You don't really need much more than this to get rolling with using RegEx, and much like in Alteryx, you'll find that there are many many different ways you can create an expression to match a pattern.So don't sweat the details too much, and don't be afraid to spend some time with the good old fashioned "guess-and-test" method of learning.
Don't worry, I'm classically trained.
For more complete guides on RegEx, you can also turn to the Boost-Extended Format String Syntax Guide, as well as the RegEx Perl Syntax Guide.
Alteryx has structured RegEx functionality into four methods: Match, Parse, Replace & Tokenize. Our help documentation for these methods is very good as well.
TheMatchmethod simply checks to see whether a string can be described by the given regular expression, and gives you aTrueorFalse . Let's use the expression elements in the table above to match our addresses. Youcan create one long expression by just putting them together and including spaces\s and commas, wherever theyappear.
The first two addresses match just fine using this expression, butthe third address strays from what we expected to see and fails. Due to an obscure rule about buildings that share a name with their address, the street number is spelled out and our regular expression is unable to match it.
The key to writing a good RegEx is foreseeingthese exceptions in your data, and accounting for them within the expression. In order to match the 'Two' in this address, let'sput in another expression to check if and only ifthis one fails to match. Most addresses will start with numerical characters, but if they don't then this expression will check for a word instead. Here's how it looks:
This is a lot simpler than it looks, and really just uses two additional RegEx symbols. Thepipe symbolImeans 'or'. For example,a|b would just check ifamatches, and if it doesn't, ifbmatches. The second symbol is (?:...), also known as anunmarked group. This is just a way for us to group these things together for theoroperation.
In summary, the first group here (?:^\d+)is just doing the same thing as before, but when it fails the expression tries to match the second group(?:^\w+). This lets us match the word version of our address above without an issue.
For more onMatch:
Extra Credit:There are many ways to structure RegEx; comment below with a better alternative for(?:^\d+)|(?:^\w+). Why is it better?
Matching is nice in that you can use it for validation, but theParsemethod is really when RegEx comes into its own, allowing you toextractuseful information out of a block of text. The RegEx tool makes this easy for us - all we need is to place parentheses (...)around each thing we want to pull out. These are calledmarked groups - the counterpart to ourunmarked groups above (?:...).
As you type in the parentheses, you'll see these pop into a Select tool style Output Fieldswindow, which allows you to rename your fields and change their datatypes.
For more onParse:
As you can see, theParsemethod is really just an extension ofMatchthat allows us to pull stuff out of text and place it neatly in a new column. Well, what happens when we extend this concept and ask ourselves, 'How can I put stuff back in?'That's where theReplace method comes in.
WithReplace, we can parse components of our string, replace them, and even rearrange them. We can do thisby specifyingmarked groupsto tell the RegEx toolwhat to replace and where, in a language anybody can understand...
The dapper sloth is absolutely right of course, we can use dollar signs$ along with numbers to specify each exact marked group. For example, $1, $2, and $3refer to thefirst, second,andthirdmarked groups, respectively. So for our list of addresses, if we just wanted to parse out a list of city-states, we cantype inthe expression$4, $5.
Note we were able to add in our own little comma there, and a space, just by typing in that text box. TheReplacemethod is very flexible, and you can also use it from the Formula tool withthe function Regex_Replace.
For more on Replace:
RegEx is pretty nifty when things are given to us in a nice, neat list, but what happens if all of these addresses are just unceremoniously dumped out by some API call or database as a single block of text?
3345 Michelson Drive, Suite 400, Irvine, CA 92612,12303 Airport Way, Suite 250, Broomfield, CO 80021,Two North Riverside Plaza, Suite 1430, Chicago, IL 60606
Well, this isn't very useful - it's all comma-delimited, but we have no easy way of knowing when one address ends and another begins. What we need here is theTokenize method, which will take that chunk of text and split it into columns or rows, much like the Text to Columnstool. There are two important differences that set RegEx Tokenize apart:
This may sound topsy-turvy, but it actually gives you a lot more flexibility in what/how you split your data. To illustrate, let's split our address text blurb up into multiple rows, using the comma as our delimiter.
Since we have to match everything that we want, we need to use the expression (.+?)(?:,|$). Let's break this down:
Phew - that's certainly a lot. Feel free to take this opportunity to stretch, go for a walk, or meditate.
Obviously, the Text to Columnstool can split on a comma way easier than the above, but the flexibility ofTokenize comes into the fore when we try to do something a tad more useful.For instance, we can use it to split that block of address information into the original three addresses.
In this case, we are just using our last matching expression for the zip code \d{5}to mark out where each match ends. As above, we know that each line will either end with a comma or the end of line, and we could use the(?:,|$)here to split these successfully. In the example above, I chose to showcase the?ability to matchzero or onetime to replace this - so we can split on an 'optional comma' after the marked groupmatch.
For more onTokenize:
Extra Credit:The ,?won't work in the previous case, splitting by comma alone - (.+?),?instead of(.+?)(?:,|$). Why?
Comment with an explanation below for eternal glory and bragging rights.
By now, you should have expert-level proficiency with the RegExTool! If you can think of a use case we left out, feel free to use the comments section below! Consider yourself a Tool Master already? Let us know at community@alteryx.com if you’d like your creative tool uses to be featured in the Tool Mastery Series.
Stay tuned with our latest posts everyTool Tuesdayby followingAlteryxon Twitter! If you want to master all the Designer tools, considersubscribingfor email notifications.
Additional Information
Click on the corresponding language link below to access this article in another language -
Alex,
Thanks for this excellent article. As time goes by and I develop more and more sophisticated workflows, my usage of regex continues to grow. After reading this article, I took a look at a recent bit of work and found a whole series of regex tools I strung together to do a bunch of things. I just wanted to say things for producing this well-documented piece of work.
Ken
Very informative. How would it work if you had an address like 123 12th St NE Apt# 201 #201? I need to get rid of the second #201, remove the # after the Apt, still let the 123 12th St NE Apt 201 go through to the output.
Still new to regex, any help would be appreciated.
Thanks,
Brad
@brad_j_crep one idea is that you can use the parse Output Method and use this expression to remove the #201 (.*)\s#. You can then use a Formula tool and use the replacechar expression to remove the other # that will be left.
Thanks for reading, @brad_j_crep. I think @DanM has a nice solution there; I'll just add another here that uses the replace method...
Sometimes when building these out the question is really 'how often will this work?' - in this case I'm not sure if you'll see a dataset with this odd repeated suite name. If this is peppered in the data, it may be best to just use a Formula tool or RegEx tool upstream of your main parsing step that replaces any # in the dataset and cleans unnecessary stuff up.
Alex
Hi, I tweak around a bit to understand what every bit does. I've opened the build in RegEx example. In the 3rd example (In "Parse" mode, partial matches will be considered, however only the first match will be returned.) when I put in some square brackets like: -([\d{4}]) it returns in the zip4 ,column 4,6,1,3 etc.
The explanation about sq brackets states:
To match the street, we have to allow our expression to pick up multiple words of characters, including any number of spaces, since streets will often be longer than one word. One way to match this is by what we don't expect using [^...]. This grouping notation matches 'any character not listed here'. What we end up with is match one or more times any character that is not a number.
I expected all to be [null]
Can anyone explain to me what happened here?
Thanks,
John
Hi @JohnHG -
This is a failure on my part to differentiate the difference between:
In the first, we're matching any character that is there, and in the second any character that isn't. Take a look at this link for a more eloquent explanation:
Anybody know how you can do named groups in Alteryx? For example you could have two date formats:
13-Nov-2017 and Nov 15, 2017.
Can you use something like (\d+)-(\w+)-(\d+)|(\w+)\s(\d+),\s(\d+)
but assign the first group (the 13) and the fifth group (the 15) to the same output field day?
Maybe something like (?P<day>\d+)-(\w+)-(\d+)|(\w+)\s(?P<day>\d+),\s(\d+) ... except these named groups don't work in Alteryx.
@PhilipMannering at this time the REGEX tool does not do grouping. The tool looks for specific patterns, so if the string does not match the entire pattern you will not get a return of data. I would suggest using two REGEX tools and then using the Formula tool to combine the data back using an IF statement to fill in the Nulls from the other fields.
I am trying to parse a string of text:
<a href="/123abc"target="_self">I want this string
I want to parse everthing prior to and including > into one column and everything after > into a second column. How do I do that?
Lets say I want to parse Age,Age,Age to reflect only Age i.e parsing it out from where the first delimiter ',' starts, how do I achieve this?
Thank you, That worked.
However in some rows I have values without repetition,So instead of Age, Age, Age I have only Age in some rows
So in this case if I apply (.*?), it replaces Age with [Null], how to avoid this? as I need the value Age
I am sure you could handle it in the Regex, but rather than racking my brains, I would use a formula tool after and use:
IIF(ISNULL([NewField]),[OriginalField],[NewField])
That will then replace the null value with the value from the original field.
The reason it doesnt work is that it's looking for a comma, and not finding that in the field with only one value.
That of course works! Thanks
Just was curious how to achieve that thru Regex, will explore, Thanks
Is there anyway that I can pull all characters after a number? So basically any field that starts with a number I want that entire field, including the numbers, in a new column?
Dear @girisri,
Hope you are well.
Have you tried to parse the last occurrence of Age instead of the first one?
(\d+)$
Explanation:
() --> marked group (what you want to pull out)
\d --> any digit
+ --> one or more repetition
$ --> end of line
In my tests worked fine, please see the picture below.
Hope it can help you.
Looking forward to hearing from you.
Cheers!
Antonio Alves
Dear @dcappellino,
Hope you are well.
I couldn't understand very well your question, sorry.
It is possible to share an example of your line to me have a try?
Cheers,
Antonio Alves
Dear @vfranklin,
Hope you are well.
Have you tried this:
(<.+>)(.+)$
Explanation:
()() --> you need to parse to groups as you said
(<.+>) --> this group try to parse everything between < and >
(.+)$ --> the second group try to parse everything after the first group until the end of the line
In your given example it worked fine.
I hope that this example could help you.
Cheers!
Antonio Alves
Hi @AntonioGonzales,
Basically I'm trying to split items in one column, into two based on if they are account numbers vs account names, but can't seem to figure it out. So any cell that starts with a number, I want to move the entire cell into a new column. I tried text to column, but when I use that the delimiter is not transferred to the next cell(i.e. the 4th cell would transfer over as "0010:Cash" instead of "10010: Cash" when I use 1 as the delimiter). I think the RegEx tool is my best bet, but cant figure out the exact string that would perform this task.
I hope this is more clear.
Thanks!
Dear @dcappellino,
Thanks for share more details, now I guess that I got it.
In my tests the regex
^(\d+.*)$
worked well.
Version 1)
Let me explain this regex in more details.
^ --> this indicate start the line
() --> we need just 1 marked group (account)
\d+ --> 1 number or more
.* --> any caracters
$ --> end of the line
So, if I mount it in parts:
^ --> start of my line
^()$ --> everything in my line
^(\d+)$ --> Line just with numbers
^(\d+.*)$ --> Line starting with numbers followed by anything
I did a second version where I remove this numbers from the first column as the text-to-column normally do.
The idea is the same but in this case, I had to use the formula tool with regex_match and regex_replace.
Hope this can help you.
Let me know if you would like to have my workflow or more explanation.
Cheers.
Antonio Alves
Hi @dcappellino
You can use the Regex Match function and then filter for the values to be true.
If you use "\d.+" This means there needs to be a number and then at least one other thing.
Edit: not sure why, but the comment above didn't load when I replied, so it's now covered twice!
Hi Alex,
This was an excellent article. As a newbie, This really helped in understanding the usage of Regex so neatly.
I am working on an exercise to sort out mapping data for addresses in various formats The sample is as below
Addressline | type | unit | loc | street |
L572,31A LIBERTY GROVE | L | 572 | 31A | LIBERTY GROVE |
12 BAYER STREET | 12 | BAYER STREET | ||
3/350 TORRENS RD | 3 | 350 | TORRENS RD | |
202/12-14 THE SHORES | 202 | 12-14 | THE SHORES | |
Flat 6, 77 Kensington Road | flat | 6 | 77 | Kensington Road |
CNR TAYLORS RDUN45 HUB DR | 45 | CNR TAYLORS RDUN HUB DR | ||
SALFORD RET ESTATE31A LYN ST | 31A | SALFORD RET ESTATE LYN ST | ||
SALFORD RETMT ESTATE U 3 / 1A LYN ST | U | 3 | 1A | SALFORD RETMT ESTATE LYN ST |
UNIT 2/164 HUB DRIVE | UNIT | 2 | 164 | HUB DRIVE |
40 MOORFIELD MEWS, LOT 3 | LOT | 40 | 3 | MOORFIELD MEWS |
each of these address have a unit number and/or a Location number. some of the location number could be in ranges (12-14) The address to parse is in the first column and the result I am looking out is in other columns. I would really appreciate if you could help me with some hints or solution.
Regards,
Ali
Hi @mdkhusru,
I'm glad you enjoyed the article. Your example was much tougher than I originally expected! It definitely touched on a lot of different concepts. I've attached a solution I came up with. I hope you'll find it helpful.
Best,
Alex
Here is the workflow xml if you want to open it. Just copy this to a text editor and save it as a yxmd file.
<?xml version="1.0"?> <AlteryxDocument yxmdVer="11.0"> <Nodes> <Node ToolID="4"> <GuiSettings Plugin="AlteryxBasePluginsGui.BrowseV2.BrowseV2"> <Position x="1074" y="318" /> </GuiSettings> <Properties> <Configuration> <TempFile>C:\Users\AKOSZY~1\AppData\Local\Temp\Engine_9208_3243a9519d4042838d360d78cb4d7d17_\Engine_21724_535649afd86349a68f5168ed62000f64_.yxdb</TempFile> <TempFileDataProfiling /> <Layout> <ViewMode>Single</ViewMode> <ViewSize value="100" /> <View1> <DefaultTab>Profile</DefaultTab> <Hints> <Table /> </Hints> </View1> <View2 /> </Layout> </Configuration> <Annotation DisplayMode="0"> <Name /> <DefaultAnnotationText /> <Left value="False" /> </Annotation> </Properties> <EngineSettings EngineDll="AlteryxBasePluginsEngine.dll" EngineDllEntryPoint="AlteryxBrowseV2" /> </Node> <Node ToolID="6"> <GuiSettings Plugin="AlteryxBasePluginsGui.RegEx.RegEx"> <Position x="174" y="174" /> </GuiSettings> <Properties> <Configuration> <Field>Addressline</Field> <RegExExpression value="[^\d]*(\d+\w?)[^\d]+(\d+\w?(?:-\d+\w?)?)?" /> <CaseInsensitve value="True" /> <Method>ParseComplex</Method> <Replace expression=""> <CopyUnmatched value="True" /> </Replace> <ParseSimple> <SplitToRows value="False" /> <RootName /> <NumFields value="3" /> <ErrorHandling>Warn</ErrorHandling> </ParseSimple> <ParseComplex> <Field field="unit" type="V_WString" size="254" /> <Field field="loc" type="V_WString" size="254" /> </ParseComplex> <Match> <Field>Addressline_Matched</Field> <ErrorUnmatched value="False" /> </Match> </Configuration> <Annotation DisplayMode="0"> <Name /> <AnnotationText>Parse: [^\d]*(\d+\w?)[^\d]+(\d+\w?(?:-\d+\w?)?)?</AnnotationText> <DefaultAnnotationText>Parse: [^\d]*(\d+\w?)[^\d]+(\d+\w?...</DefaultAnnotationText> <Left value="False" /> </Annotation> </Properties> <EngineSettings EngineDll="AlteryxBasePluginsEngine.dll" EngineDllEntryPoint="AlteryxRegEx" /> </Node> <Node ToolID="7"> <GuiSettings Plugin="AlteryxBasePluginsGui.RegEx.RegEx"> <Position x="330" y="174" /> </GuiSettings> <Properties> <Configuration> <Field>Addressline</Field> <RegExExpression value="((?:LOT)|(?:UNIT)|(?:FLAT)|(?:L)|(?:U)|(?:F))[^A-Z]" /> <CaseInsensitve value="True" /> <Method>ParseComplex</Method> <Replace expression=""> <CopyUnmatched value="True" /> </Replace> <ParseSimple> <SplitToRows value="False" /> <RootName /> <NumFields value="3" /> <ErrorHandling>Warn</ErrorHandling> </ParseSimple> <ParseComplex> <Field field="type" type="V_WString" size="254" /> </ParseComplex> <Match> <Field>Addressline_Matched</Field> <ErrorUnmatched value="False" /> </Match> </Configuration> <Annotation DisplayMode="0"> <Name /> <AnnotationText>Parse: ((?:LOT)|(?:UNIT)|(?:FLAT)|(?:L)|(?:U)|(?:F))[^A-Z]</AnnotationText> <DefaultAnnotationText>Parse: ((?:LOT)|(?:UNIT)|(?:FLAT)|...</DefaultAnnotationText> <Left value="False" /> </Annotation> </Properties> <EngineSettings EngineDll="AlteryxBasePluginsEngine.dll" EngineDllEntryPoint="AlteryxRegEx" /> </Node> <Node ToolID="8"> <GuiSettings Plugin="AlteryxBasePluginsGui.RegEx.RegEx"> <Position x="642" y="330" /> </GuiSettings> <Properties> <Configuration> <Field>parseme</Field> <RegExExpression value="(?:\d+\w)|(?:\w\d+)|[^,\s\d/-]+" /> <CaseInsensitve value="True" /> <Method>ParseSimple</Method> <Replace expression=""> <CopyUnmatched value="True" /> </Replace> <ParseSimple> <SplitToRows value="True" /> </ParseSimple> <ParseComplex> <Field field="No Marked Groups Found" type="No Marked Groups Found" size="No Marked Groups Found" /> </ParseComplex> <Match> <Field>Addressline_Matched</Field> <ErrorUnmatched value="False" /> </Match> </Configuration> <Annotation DisplayMode="0"> <Name /> <AnnotationText>Tokenize: (?:\d+\w)|(?:\w\d+)|[^,\s\d/-]+</AnnotationText> <DefaultAnnotationText>Tokenize: (?:\d+\w)|(?:\w\d+)|[^,\s\d...</DefaultAnnotationText> <Left value="False" /> </Annotation> </Properties> <EngineSettings EngineDll="AlteryxBasePluginsEngine.dll" EngineDllEntryPoint="AlteryxRegEx" /> </Node> <Node ToolID="9"> <GuiSettings Plugin="AlteryxBasePluginsGui.Formula.Formula"> <Position x="450" y="330" /> </GuiSettings> <Properties> <Configuration> <FormulaFields> <FormulaField expression="[Addressline]" field="parseme" size="1073741823" type="V_WString" /> </FormulaFields> </Configuration> <Annotation DisplayMode="0"> <Name /> <DefaultAnnotationText><![CDATA[parseme = [Addressline] ]]></DefaultAnnotationText> <Left value="False" /> </Annotation> </Properties> <EngineSettings EngineDll="AlteryxBasePluginsEngine.dll" EngineDllEntryPoint="AlteryxFormula" /> </Node> <Node ToolID="10"> <GuiSettings Plugin="AlteryxBasePluginsGui.Filter.Filter"> <Position x="762" y="330" /> </GuiSettings> <Properties> <Configuration> <Expression>[type]!=[parseme] && !REGEX_Match([parseme],'^.*?\d.*?')</Expression> <Mode>Custom</Mode> <Simple> <Operator>=</Operator> <Field> </Field> <Operands> <IgnoreTimeInDateTime>True</IgnoreTimeInDateTime> <DateType>fixed</DateType> <PeriodDate>2019-02-06 12:17:48</PeriodDate> <PeriodType> </PeriodType> <PeriodCount>0</PeriodCount> <Operand> </Operand> <StartDate>2019-02-06 12:17:48</StartDate> <EndDate>2019-02-06 12:17:48</EndDate> </Operands> </Simple> </Configuration> <Annotation DisplayMode="0"> <Name /> <DefaultAnnotationText>[type]!=[parseme] && !REGEX_Match([parseme],'^.*?\d.*?')</DefaultAnnotationText> <Left value="False" /> </Annotation> </Properties> <EngineSettings EngineDll="AlteryxBasePluginsEngine.dll" EngineDllEntryPoint="AlteryxFilter" /> </Node> <Node ToolID="11"> <GuiSettings Plugin="AlteryxSpatialPluginsGui.Summarize.Summarize"> <Position x="870" y="318" /> </GuiSettings> <Properties> <Configuration> <SummarizeFields> <SummarizeField field="RecordID" action="GroupBy" rename="RecordID" /> <SummarizeField field="Addressline" action="First" rename="Addressline" /> <SummarizeField field="type" action="First" rename="type" /> <SummarizeField field="unit" action="First" rename="unit" /> <SummarizeField field="loc" action="First" rename="loc" /> <SummarizeField field="parseme" action="Concat" rename="street"> <Concat_Start /> <Separator>\s</Separator> <Concat_End /> </SummarizeField> </SummarizeFields> </Configuration> <Annotation DisplayMode="0"> <Name /> <DefaultAnnotationText /> <Left value="False" /> </Annotation> </Properties> <EngineSettings EngineDll="AlteryxSpatialPluginsEngine.dll" EngineDllEntryPoint="AlteryxSummarize" /> </Node> <Node ToolID="12"> <GuiSettings Plugin="AlteryxBasePluginsGui.RecordID.RecordID"> <Position x="546" y="330" /> </GuiSettings> <Properties> <Configuration> <FieldName>RecordID</FieldName> <StartValue>1</StartValue> <FieldType>Int32</FieldType> <FieldSize>6</FieldSize> <Position>0</Position> </Configuration> <Annotation DisplayMode="0"> <Name /> <DefaultAnnotationText /> <Left value="False" /> </Annotation> </Properties> <EngineSettings EngineDll="AlteryxBasePluginsEngine.dll" EngineDllEntryPoint="AlteryxRecordID" /> </Node> <Node ToolID="14"> <GuiSettings Plugin="AlteryxBasePluginsGui.TextInput.TextInput"> <Position x="54" y="174" /> </GuiSettings> <Properties> <Configuration> <NumRows value="10" /> <Fields> <Field name="Addressline" /> </Fields> <Data> <r> <c>L572,31A LIBERTY GROVE</c> </r> <r> <c>12 BAYER STREET</c> </r> <r> <c>3/350 TORRENS RD</c> </r> <r> <c>202/12-14 THE SHORES</c> </r> <r> <c>Flat 6, 77 Kensington Road</c> </r> <r> <c><![CDATA[CNR TAYLORS RDUN45 HUB DR ]]></c> </r> <r> <c><![CDATA[SALFORD RET ESTATE31A LYN ST ]]></c> </r> <r> <c>SALFORD RETMT ESTATE U 3 / 1A LYN ST</c> </r> <r> <c><![CDATA[UNIT 2/164 HUB DRIVE ]]></c> </r> <r> <c><![CDATA[40 MOORFIELD MEWS, LOT 3 ]]></c> </r> </Data> </Configuration> <Annotation DisplayMode="0"> <Name /> <DefaultAnnotationText /> <Left value="False" /> </Annotation> </Properties> <EngineSettings EngineDll="AlteryxBasePluginsEngine.dll" EngineDllEntryPoint="AlteryxTextInput" /> </Node> <Node ToolID="15"> <GuiSettings> <Position x="978" y="318" /> </GuiSettings> <Properties> <Configuration> <Value name="List Box (11)">"RecordID","Addressline","type","unit","loc","street"</Value> <Value name="Check Box (84)">True</Value> <Value name="Check Box (117)">False</Value> <Value name="Check Box (15)">True</Value> <Value name="Check Box (109)">False</Value> <Value name="Check Box (122)">False</Value> <Value name="Check Box (53)">False</Value> <Value name="Check Box (58)">False</Value> <Value name="Check Box (70)">False</Value> <Value name="Check Box (77)">True</Value> <Value name="Drop Down (81)">upper</Value> </Configuration> <Annotation DisplayMode="0"> <Name /> <DefaultAnnotationText /> <Left value="False" /> </Annotation> <Dependencies> <Implicit /> </Dependencies> </Properties> <EngineSettings Macro="Cleanse.yxmc" /> </Node> <Node ToolID="16"> <GuiSettings Plugin="AlteryxGuiToolkit.TextBox.TextBox"> <Position x="66" y="30" width="192" height="132" /> </GuiSettings> <Properties> <Configuration> <Text>"unit" and "loc" are tough, because the order they appear in matters and they can be anywhere in the address string. Pull these out first. Main concept: A lot of things are optional, so you need to be clever with "?"</Text> <Font name="Arial" size="8.25" style="0" /> <TextColor name="Black" /> <FillColor name="White" /> <Shape shape="0" /> <Justification Justification="0" /> </Configuration> <Annotation DisplayMode="0"> <Name /> <DefaultAnnotationText /> <Left value="False" /> </Annotation> </Properties> </Node> <Node ToolID="17"> <GuiSettings Plugin="AlteryxGuiToolkit.TextBox.TextBox"> <Position x="294" y="30" width="360" height="144" /> </GuiSettings> <Properties> <Configuration> <Text>"type" is also tough because it could easily look like a street name. For this, the best bet is probably to look for exact matches. This expression looks for the words "FLAT", "UNIT","LOT" and their abbreviations. Main concept: The OR operator | is used to parse out these exact words, and the unmarked group helps break apart the matches. Order matters here - you want to pull out "LOT" before checking for "L".</Text> <Font name="Arial" size="8.25" style="0" /> <TextColor name="Black" /> <FillColor name="White" /> <Shape shape="0" /> <Justification Justification="0" /> </Configuration> <Annotation DisplayMode="0"> <Name /> <DefaultAnnotationText /> <Left value="False" /> </Annotation> </Properties> </Node> <Node ToolID="19"> <GuiSettings Plugin="AlteryxGuiToolkit.TextBox.TextBox"> <Position x="690" y="30" width="444" height="252" /> </GuiSettings> <Properties> <Configuration> <Text>"street" is really just everything else, so the trick is to scoop that stuff up without grabbing everything you've already parsed out. One way to do this is to split out all the words into their own rows and then concatenating them back together. To do this, I used the Tokenize option and matched on the things I want to pull out using the OR operator | along with unmarked groups. The expressions are looking for either: 1) a number followed by a single letter 2) a single letter followed by a number 3) or anything else that is between a comma, space, slash, or dash I used a REGEX_Match function in the Filter tool to take out anything with numbers in it. I also removed any of the "type" matches that resulted from the Tokenize. Then, I combined all the individual words using the Concatenate option in the Summarize tool, using "\s" as the separator.</Text> <Font name="Arial" size="8.25" style="0" /> <TextColor name="Black" /> <FillColor name="White" /> <Shape shape="0" /> <Justification Justification="0" /> </Configuration> <Annotation DisplayMode="0"> <Name /> <DefaultAnnotationText /> <Left value="False" /> </Annotation> </Properties> </Node> </Nodes> <Connections> <Connection> <Origin ToolID="6" Connection="Output" /> <Destination ToolID="7" Connection="Input" /> </Connection> <Connection> <Origin ToolID="7" Connection="Output" /> <Destination ToolID="9" Connection="Input" /> </Connection> <Connection> <Origin ToolID="8" Connection="Output" /> <Destination ToolID="10" Connection="Input" /> </Connection> <Connection> <Origin ToolID="9" Connection="Output" /> <Destination ToolID="12" Connection="Input" /> </Connection> <Connection> <Origin ToolID="10" Connection="True" /> <Destination ToolID="11" Connection="Input" /> </Connection> <Connection> <Origin ToolID="11" Connection="Output" /> <Destination ToolID="15" Connection="Input2" /> </Connection> <Connection> <Origin ToolID="12" Connection="Output" /> <Destination ToolID="8" Connection="Input" /> </Connection> <Connection> <Origin ToolID="14" Connection="Output" /> <Destination ToolID="6" Connection="Input" /> </Connection> <Connection> <Origin ToolID="15" Connection="Output26" /> <Destination ToolID="4" Connection="Input" /> </Connection> </Connections> <Properties> <Memory default="True" /> <GlobalRecordLimit value="0" /> <TempFiles default="True" /> <Annotation on="True" includeToolName="False" /> <ConvErrorLimit value="10" /> <ConvErrorLimit_Stop value="False" /> <CancelOnError value="False" /> <DisableBrowse value="False" /> <EnablePerformanceProfiling value="False" /> <DisableAllOutput value="False" /> <ShowAllMacroMessages value="False" /> <ShowConnectionStatusIsOn value="True" /> <ShowConnectionStatusOnlyWhenRunning value="True" /> <ZoomLevel value="0" /> <LayoutType>Horizontal</LayoutType> <MetaInfo> <NameIsFileName value="True" /> <Name>regex 6Feb2019</Name> <Description /> <RootToolName /> <ToolVersion /> <ToolInDb value="False" /> <CategoryName /> <SearchTags /> <Author /> <Company /> <Copyright /> <DescriptionLink actual="" displayed="" /> <Example> <Description /> <File /> </Example> </MetaInfo> <Events> <Enabled value="True" /> </Events> </Properties> </AlteryxDocument>
Hi Alex,
Thanks a lot. That was wonderful. This is exactly what I was looking out.
Regards,
Ali Khusru
Soooo handy! Thank you!!
On your extra credit point (though someone's bound to have said this already...): it has to be that when you don't include the end-of-line ($) alternative in the unmarked group, the third address, which isn't followed by a comma, won't match the regex formula. Only the first two addresses, followed by commas, will be tokenised.
Glad you enjoyed the article @HuwHallam!
You almost have the answer to the extra credit! While you're correct that it won't match, the reason you give isn't quite right.. I'll give you a hint, it has something to do with how the proceeding marked group ends...
(It may be easier to understand if you download the attached workflow and change the RegEx tool (#25) around to see the result.)
Oh I see - I think I misread the question... Yes, if the formula is (.+?),? then the second ? means that the marked group doesn't have to be followed by a comma (it can be followed by anything or nothing). Without this restriction, the first 'greedy' ? causes the regex to return a result as soon as it can satisfy .+ (any character, one or more times). Hence you get a series of single character outputs.
Thanks for pushing me in the right direction! :)
Hi! I am looking to use the Tokenize function in the Regex tool. I am trying to split text from 1 column into 2 columns, without a common delimiter. I figure that I can split at the word "at" but I am at a loss on how to pull the company name out. I have about 700 fields all with different information and character lengths but this common setup.
An example of the field I am trying to split into title and company currently looks like this:
Manager, Strategy at ABC Corp. Go to ABC Corp. account page1year 15 months
I want my output to be:
Title (Column 1): Manager, Strategy
Company (Column 2): ABC Corp.
Thank you!
@jhosmer002 This should give you a good start.
(.*?)(?:(?:\sat\s)|$)
(.*?) - Look for any number of single characters until you match what is next.
(?:this|that) - Use an OR inside an unmarked group to specify a match on the left or the right of the pipe |.
\sat\s - Matches " at ".
$ - Matches the end of the string.
Cheers,
Alex
Thank you! That worked! If I wanted to delete everything after (and including) the word Go, what would I need to add to the expression?
Manager, Strategy at ABC Corp. Go to ABC Corp. account page1year 15 months
hi , i am trying to parse the below but cant get it to work completely
7925 Kaumualii Hwy|Kekaha, HI 96752
((?:^\d+)|(?:^\w+))(.*?)(?:(?:\s\|\s)|\|)\s(\u{2})\s(\d{5}$)
any help of what i am doing wrong
Great article, super helpful!
I know I'm coming late to the party! But why not, here's a stab at the "Extra Credit" questions @AlexKo / @HuwHallam
I think only the first one hasn't been resolved in this thread yet, while the second one has been addressed by Huw.
I give my workings below so you can work through it the same way I did.
In the first example, I don't think the explanation "\u{2} will extract 2 uppercase letters for state" is correct.
Here is documentation from the Boost-Extended Format String Syntax Guide
u | Causes the next character to be outputted, to be output in upper case. |
\U | Causes all subsequent characters to be output in upper case, until a \E is found. |
Thanks for sharing! Really appreciate the contribution to our community!
Thanks for sharing! I'm new to Parse & this really helped me get started. I am stuck on a special characters issue.
I need to convert:
From:
{"city":"Aberdeen","country":"USA","geocodeAccuracy":null,"latitude":null,"longitude":null,"postalCode":"21001","state":"Maryland","street":"513 Heathland Trail"}
To:
Street | City | State | Zip |
513 Heatherland Trail | Aberdeen | MD | 21001 |
I was able to follow along with the example above if this was in a standard address format. However, all the special characters here are tripping me up. Help?!
Thanks!
What a fantastic article! The only thing I'd say is to change [^\d]+ to [^\n]+ to include any streets that may have a number (i.e. 16th Street) 😊
Happy Regex Year to yall!
Just an attempt for the alternate Regex formula for Address Strings -- ([^,]+),([^,]+),([^,]+),\s(\u{2})\s(\d{5})
(?:^\d+)|(?:^\w+) can be replaced with ([^,]+), however this new regex will accept any nonword characters.
This is really insightful. thanks for posting in verry easy words.