People say you can spend upwards of 60% of your time “preparing” your data rather than analyzing it. Alteryx Designer has great, user-friendly tools to try to cut into that estimate, but the tools are purposefully generic so that they can be applied to as many situations as possible. If you find yourself applying the same sequence of tools over and over to get your data clean, perhaps a macro is in order.
I set out to create a macro to perform some basic data cleansing tasks in a single, easy-to-use tool. The macro performs various tasks that fall into three overall categories:
- Cleanse: Alter data records based on various criteria (e.g. remove punctuation from phone number field)
- Parse: Extract certain data elements from a larger body of data (e.g. identify IP addresses from a semi-structured security log)
- Validate: Determine whether a data element conforms to a certain pattern (e.g. is the format of a social security number valid)
What it does…
I would guess that the General Cleanse section will be the more universally useful part of the tool. Here you can check a box to remove punctuation/letters/numbers/whitespace, to uppercase/lowercase, to trim whitespace/quotes from the beginning and end of a field, to compress multiple sequential whitespaces into a single whitespace, and to remove HTML tags. Expert Alteryx users will note that most of these operations can be performed easily using functions in the Formula tool. The advantage here is that you can perform multiple operations by simply clicking checkboxes without having to build a complicated expression.
The tool also cleanses phone numbers and dates by letting the user select from a range of input formats and outputting the cleansed field to a specific output format. Essentially you’re standardizing the phone number or date format here.
Use the parsing options when you have unstructured or semi-structured data and you’re looking to pull out certain data elements into separate rows. The Cleanse tool can parse email and IP addresses, phone numbers and dates. It also has a “Custom” parse option where it looks for a user-defined start and end string, and returns what it finds in between (think HTML tags).
Finally, the tool can validate email, IP, phone, date, and SSN fields, returning True or False. It validates both structure, and to a certain extent, content. For example, SSNs must be of the form nnn-nn-nnnn (structure), and no subgroup can contain all zeros (content).
If you want to use this macro (or make it better)…
If you find this tool useful, download below. If you frequently use other methods within Alteryx to get your data clean that you aren’t finding in the Cleanse tool, try editing the macro so that it contains all the means necessary to get your data cleansing done quickly and easily!
Note…
If you do end up perusing the guts of the macro, perhaps with the goal of augmenting it to fit your cleansing needs, you’ll note the liberal use of user-defined constants. Because I often use regular expressions more than once within the macro (I use the same expression to both parse and validate email addresses, for example) I wanted to keep all regular expressions in a central location, so that if I decided to change the expression, the change would be applied throughout the macro in all the different places that it is used. You can find the user constants in the Module Properties tab. To resolve a user constant in your workflow, wrap the full constant name in the % symbol (i.e. %User.validate-ssn%).