Using Alteryx to Perform Data Dictionary Checks

Question

When creating reports for clients there can be financial penalties if they don’t meet the agreed standards, particularly in the financial sector. Alteryx can help in this scenario by making the checks easier and faster using the data dictionary, which the client typically provides or approves with the format of the data they expect to receive.

Overview Alteryx Solution

An Alteryx data dictionary macro can tackle this and will require two inputs: the first the data that is being tested, the second the Data Dictionary with the standards it will check the data against.

The Data Dictionary can be comprised of different fields to test the data. However in this scenario it is comprised of 6 columns. What is particularly important is if the data needs to match a certain format to be valid, then regex can greatly help. The Data format column has regex to be used in the checking process (e.g., CUSIP where the nine characters must be either numeric or alphanumeric depending on the position). Using regex in this way can enhance the data quality checking beyond just checking the length of a field or the data type.

Example of Data Dictionary

Closer look at the Macro

The Macro can perform a number of checks to the validity of your data. This macro provides a detailed list of all the anomalies. It also provides a summary report of the total errors, performs a check on the column names, and also generates a list of all the data that correctly meets the standards defined in the Data Dictionary.

Data Dictionary Macro Overview

The section that checks the data quality works by pivoting the data vertically and joining with the Data Dictionary. Once this is done the checks can be performed.

Field checks overview

For ease of understanding and updating, the checks have been made into individual formulas, so they can easily be adjusted or added to. The regex defined in the Data Dictionary is used in the Format check and uses the function Regex match, to test the contents of the Value field match the defined regex.

Data Quality Checks

The remainder of this sections then pivots the checks vertically, so they can be concatenated into one field for each anomaly. The rationale for this is that the output will be dynamic and have listed all the issues the checks have found and not generate duplicates of the data.

Output: Detailed anomalies identified

The output from this section provides a detailed list of the anomalies identified and provides the column, row and value identified that needs to check in the original data. This can be sent to the producer of the data and should help them easily find the issues that have been identified.

The check Field Names section checks if the Header names in the data match those from the Data Dictionary and also if they are in the correct position.

Field Names Checks

The output clearly highlights the field names that match the data dictionary and those that don’t match it.

Output: Field Name check

The Summary section counts all the anomalies and totals all the fields in the data, in order to calculate the percentage of errors across all of the data.

Summary Section

The summary report highlights the total percent errors across the entire dataset. This can be helpful to identify if a few records are causing the anomalies, or if a more significant problem is occurring across the data.

Output: Summary Report

The fourth output of the macro produces a list of all the correct data, which matches the Data Dictionary. Depending on your requirements you could choose to process this data while you waited for the anomalies to be checked and reviewed.

Output: Data that Matches Data Dictionary Standards

Conclusion

The benefits of creating a Data Dictionary macro in Alteryx is that it can easily identify any anomalies and ensure you remain compliant. The macro is repeatable and can be used to test on multiple versions of the same file and will consistently perform the same checks on the data being generated. Another benefit of creating a checking process like this in Alteryx is any changes that are requested can easily be modified and would not require changes in code, like if these checks were done outside of Alteryx using scripting tools.

KilianL · Answer

Great use case of data checks!

@vasumurugan5459, there is a trick to open workflows not supported by your version: right click the yxmd and edit in notepad++ or similar - simply change the version number in the xml to your version. But keep in mind that your version does not support any functionality of newer versions in case they were used in the workflow.

vasumurugan5459 · Answer

This one is excellent and very beneficial, I am having an issue in running the workflow since it's not supporting for the prior version of Alteryx, I am using 2020.3

HarleyBarnes · Answer

Great approach, we do something similar with performing series of validation checks on our datasets.

We have been playing around with the Visual Layout tool to get results like this into PDF reports which can really help with keeping trails of data irregularities.