ALTERYX INSPIRE | Join us this May for for a multi-day virtual analytics + data science experience like no other! Register Now

Engine Works

Under the hood of Alteryx: tips, tricks and how-tos.
NeilR
Alteryx Community Team
Alteryx Community Team

Cleanse icon

 

People say you can spend upwards of 60% of your time “preparing” your data rather than analyzing it. Alteryx Designer has great, user-friendly tools to try to cut into that estimate, but the tools are purposefully generic so that they can be applied to as many situations as possible. If you find yourself applying the same sequence of tools over and over to get your data clean, perhaps a macro is in order.

 

I set out to create a macro to perform some basic data cleansing tasks in a single, easy-to-use tool. The macro performs various tasks that fall into three overall categories:

 

 

  1. Cleanse: Alter data records based on various criteria (e.g. remove punctuation from phone number field)
  2. Parse: Extract certain data elements from a larger body of data (e.g. identify IP addresses from a semi-structured security log)
  3. Validate: Determine whether a data element conforms to a certain pattern (e.g. is the format of a social security number valid) 

 

What it does…

I would guess that the General Cleanse section will be the more universally useful part of the tool. Here you can check a box to remove punctuation/letters/numbers/whitespace, to uppercase/lowercase, to trim whitespace/quotes from the beginning and end of a field, to compress multiple sequential whitespaces into a single whitespace, and to remove HTML tags. Expert Alteryx users will note that most of these operations can be performed easily using functions in the Formula tool. The advantage here is that you can perform multiple operations by simply clicking checkboxes without having to build a complicated expression.

     Cleanse General Options configuration

The tool also cleanses phone numbers and dates by letting the user select from a range of input formats and outputting the cleansed field to a specific output format. Essentially you’re standardizing the phone number or date format here.

 

Use the parsing options when you have unstructured or semi-structured data and you’re looking to pull out certain data elements into separate rows. The Cleanse tool can parse email and IP addresses, phone numbers and dates. It also has a “Custom” parse option where it looks for a user-defined start and end string, and returns what it finds in between (think HTML tags). 

 

Finally, the tool can validate email, IP, phone, date, and SSN fields, returning True or False. It validates both structure, and to a certain extent, content. For example, SSNs must be of the form nnn-nn-nnnn (structure), and no subgroup can contain all zeros (content).

 

If you want to use this macro (or make it better)…

If you find this tool useful, download below or head over to the Macro District where you’ll find instructions on how to install it so that it appears in the Preparation section of your Tool Palette. Then you can drag it into your canvas just like any other tool and cleanse your data in a single step. If you frequently use other methods within Alteryx to get your data clean that you aren’t finding in the Cleanse tool, try editing the macro so that it contains all the means necessary to get your data cleansing done quickly and easily!

 

Note…

If you do end up perusing the guts of the macro, perhaps with the goal of augmenting it to fit your cleansing needs, you’ll note the liberal use of user-defined constants. Because I often use regular expressions more than once within the macro (I use the same expression to both parse and validate email addresses, for example) I wanted to keep all regular expressions in a central location, so that if I decided to change the expression, the change would be applied throughout the macro in all the different places that it is used. You can find the user constants in the Module Properties tab. To resolve a user constant in your workflow, wrap the full constant name in the % symbol (i.e. %User.validate-ssn%).

     User Constants

Neil Ryan
Sr Program Manager, Community Content

Neil Ryan is the Sr Program Manager, Community Content, responsible for the content in the Alteryx Community. He held previous roles at Alteryx including Advanced Analytics Product Manager and Content Engineer, and had prior gigs doing fraud detection analytics consulting and creating actuarial pricing models. Neil's industry experience and technical skills are wide ranging and well suited to drive compelling content tailored for Community members to rank up in their careers.

Neil Ryan is the Sr Program Manager, Community Content, responsible for the content in the Alteryx Community. He held previous roles at Alteryx including Advanced Analytics Product Manager and Content Engineer, and had prior gigs doing fraud detection analytics consulting and creating actuarial pricing models. Neil's industry experience and technical skills are wide ranging and well suited to drive compelling content tailored for Community members to rank up in their careers.

Comments
DataGrrl
7 - Meteor

So this Macro is exactly what I needed, but I have one issue. For the parse it defaults to the field length of the Source field, but for the cleanse it defaults to 64,and I can't change it. If anyone can help me with this I would appreciate it.

 

*Answered*---Thanks to Neil for helping me figure out where to modify the Macro for the output field. Works like a charm.

GeoSinam
6 - Meteoroid

Very handy macro - much obliged.

NeilR
Alteryx Community Team
Alteryx Community Team

Here was my reply to DataGrrl...

 

There are multiple Formula tools throughout the macro that are creating length 64 strings. If you are attempting the Cleanse -> General -> Remove HTML Tags operation, I think the formula tools that you want to alter are #17 and #128, upping the size from 64 to something high enough to suit your needs. Let me know if that works. See screenshot below...

original.png

brooklynrose
5 - Atom

Hey I've been using this tool for cleansing phone numbers, but there are occasionally cases where it will filter out things that appear to be perfectly valid 10-digit phone numbers. Does anyone know why this would happen?

DataGrrl
7 - Meteor

I haven't used it for phone number, but having done my fair amount of phone number cleansing over the years, is it possible that they have included some of the validation rules for us numbers? Basically valid area code and exchanges?

NeilR
Alteryx Community Team
Alteryx Community Team

It uses variations of the following regex expression for phone numbers:

[2-9]\d{2}-[2-9]\d{2}-\d{4}

Which means the first digit in the area code and the exchange code (first and second groups of three numbers) can't start with 0 or 1. See here:

https://en.wikipedia.org/wiki/North_American_Numbering_Plan#Modern_plan

angelevan
5 - Atom

Very useful macro, can't wait to try it out. I thought this type of data cleansing work was only possible using more expensive tools, so glad to see there is a lower-cost option too. 

mattamidus
7 - Meteor

Hello,

 

Im trying to change the punctuation part of the tool to only remove ".  Ive tried just about every combination on the below logic but cant seem to not get malformed if statement when ever i use the " character.  Ive highlighted my change in bold.  Any help would be most appreciated.

 

 

IF [#1]=='True' THEN
"IF [_CurrentFieldName_] IN(" + [#2] + ") && !IsNull([_CurrentField_]) THEN
REGEX_REPLACE([_CurrentField_],'/"','')
ELSE [_CurrentField_]
ENDIF"
ELSE [Destination] ENDIF

NeilR
Alteryx Community Team
Alteryx Community Team

@mattamidus I was able to alter the tool to only remove the " character when the Cleanse -> General -> Remove punctuation option is configured by changing the Formula configuration of Action tool #74 from...

IF FindString([#1], 'A') >= 0 
THEN "REGEX_REPLACE("+[Destination]+",'[[:punct:]]','')" 
ELSE [Destination] 
ENDIF 

to...

IF FindString([#1], 'A') >= 0 
THEN "REGEX_REPLACE("+[Destination]+",'\x{0022}','')" 
ELSE [Destination] 
ENDIF 

...essentially using the Unicode representation of the " character so as not to have to deal with mismatched quotes.

mattamidus
7 - Meteor

@NeilR That seemed to get rid of the error in the programming section.  However the tool doesn't seem to be removing character x{0022} when I run the workflow.  

NeilR
Alteryx Community Team
Alteryx Community Team

@mattamidus it seems to work for me (see below). To be clear, this is the " character. There are other characters that looks similar.

 

mattamidus
7 - Meteor

Thanks, that video helped.  All 400 columns cleared of character x{0022}.

brendafos
10 - Fireball
I'm using this macro to clean up SSN. THANK YOU! It would be even better if you went a bit deeper into SSN clean up and removed some of the impossible numbers, like in this link https://www.searchbug.com/peoplefinder/invalid-social-security-numbers.aspx This is a really great macro tool.
AshishD
7 - Meteor

Thank you!

lees
5 - Atom

Is this macro still available? I went to the Macro District web page but the Cleanse macro is not listed. 

NeilR
Alteryx Community Team
Alteryx Community Team

@lees the macro is embedded in the file attached to the post: Cleanse example.yxzp

juandiegocuervo
6 - Meteoroid

How can I use this macro dynamically?

NeilR
Alteryx Community Team
Alteryx Community Team

Hi @juandiegocuervo, can you explain a little more what you mean by dynamically? What are you trying to do?

juandiegocuervo
6 - Meteoroid

Hi @NeilR ,I am trying to use The Cleanse Macro on multiple dynamically changing fields, like Multi-Field Formula Tool.

NeilR
Alteryx Community Team
Alteryx Community Team

@juandiegocuervo I can't think of a straightforward way to do this without rewriting the macro.

MarqueeCrew
19 - Altair
19 - Altair

@juandiegocuervo ,

 

I'd open the macro (right-click and OPEN) and find the formula that you're interested in.  Once you find the formula, you can modify as needed and embed in your formula tool.

 

Cheers,

 

Mark

juandiegocuervo
6 - Meteoroid

Thanks friends! I owe you one!