Cleanse Macro
Given a choice between the delivered macro and the CReW macro, I’ll choose the CReW macro for both speed and functionality. Wikipedia says, “Data cleansing or data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the data and then replacing, modifying, or deleting the dirty or coarse data.” If Alteryx were to convert the macro to a true tool, here is my feature request list:
Performance:
- AMP compatible – Fast!
- Faster than the CReW macro for deleting empty fields/rows
- Resolve time it takes to load the tool (current macro versions are slow), html is faster.
Feature Enhancement:
- Allow selection of fields based on data type
- Include incoming/outgoing SELECT functionality
- Allow for PREFIX functionality (like multi-field formula), but NOT default
- Read incoming metadata to provide color coding of fields to indicate where potential problems exist (e.g. NULL, Whitespace) – part of browse everywhere currently
- Allow for Nulls to convert to 0/blank or 0/blank to convert to Null
- When removing punctuation, provide for exceptions (e.g. Numeric set of negative, comma and period).
- Include HTML tag removal
- Support internationalization (character sets)
Going the extra mile:
- Display or opt for output, cleanup metrics. How dirty was my data? Potentially, allow for ERROR to stop workflow if garbage is detected.
- Optional: Detect outliers in numeric data. I’ve got an outlier detection macro that we can review, but while you are passing all of the data for numeric values, explaining or tagging outliers would be useful. Could be a box-whisker on numeric values maybe?
- Make outlier actionable
- Identify in data (new field indicator)
- Remove
- Modify/Impute
- Test/Preview against metadata: (pre-run), see what the incoming/outgoing results would be on *all of the metadata before I run the workflow.
- camelCase: https://en.wikipedia.org/wiki/Camel_case
- Identify/Replace unknown values (e.g. N/A, Not Applicable, #) with Null() or other?
- Identify/Remove duplicate values within a cell
- See also: https://en.wikipedia.org/wiki/Data_cleansing
- Option to point to a “personal” dictionary for spelling or validation
- Provide “smart” annotation on tool