Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Assessing availability and quality of data.

dbyfield
8 - Asteroid

Hi All,

 

I've got an upcoming project that I'm sure Alteryx would be ideal for, but I'm at a bit of a roadblock.

 

My organisation receives a quarterly data set from another entity whom we grant funding to, and we then use this data for evaluation and research which then drives improvements, monitors performance etc.

 

What I wish to do has 2 stages to it:

 

1. Evaluate each column within the data set to output a volume and percentage figure based on whether each field in the column has an entry or not (basically CountA vs CountBlank). The idea is there should be *something* entered in each row of each column.

 

2. Evaluate each row within selected columns to assess the quality of data within. For example, an Age field should only contain a numeric entry, a Gender field should contain text meeting a limited criteria etc.

This is the same principle as 1. but goes deeper into assessing the quality of the data we are being supplied with and whether or not it meets our requirements.

Ideally a simple output in this scenario of "valid" or "not valid" would be useful in giving an overall rating.

 

I've experimented with the Field Summary tool and that does parts of the job but may not go as deep as I'd like.

 

Anyone have any ideas on how to approach this?

 

Many thanks

 

D

6 REPLIES 6
danrh
13 - Pulsar

So not seeing the data, here's a workflow that could spark some ideas:

image.png

The output of this gives you some high-level counts to determine quality of the data.  Obviously you would need to add/remove checks as they were relevant, possibly adding a Formula tool to the end to calculate percentages if you need the answer in that format.

 

Hope it gets you moving in the right direction!

dbyfield
8 - Asteroid
Hi Danhr,

Thank you for this - this looks like the right path, think I just needed a steer in the right direction.
If I’m understanding correctly, the formula tool is splitting the text and numeric fields and comparing each text field against a seperate file of accepted values then summarising?

Thanks

D
danrh
13 - Pulsar

The Formula isn't so much splitting as it is performing a validation on every field.  Let me walk through step by step:

 

The Transpose tool puts everything in your data into 2 columns: name and value.  Name is the original name of the column, and value is going to be the data that was in those columns.

The Formula tool then performs the IsNum check - it determines whether each individual field is entirely comprised of numbers and if so, outputs a 1.  If there are any letters or punctuation, it will get a 0.

The Text Input tool houses your list of acceptable values, per field.  The first column in this tool is the field name (matching what is in the Name column that comes out of the Transpose) while the second column in this tool lists all acceptable values.

The Join tool then matches your pre-determined list of acceptable values with the actual data.  If there's a match, that record comes out of the "J" output and gets assigned a 1 in the AllowableValue Formula tool.  If there isn't a match, it comes out of the "L" output and ends up with a null as AllowableValue.

The Union simply joins your "J" and "L" outputs to make sure you still have all your data.

Finally, the Summarize sums up all of those indicators we've been putting in place - specifically, the IsNum and AllowableValue fields.  Additionally, I added in Count, CountNull, and CountNonNull to see the total picture.

 

What all of this produces is one record per field that was in your original dataset, with fields for the following:

- Total number of records

- Number of records that weren't null

- Number of records that were null

- Number of records that were entirely numeric

- Number of records that matched the allowable values list

image.png

The hope is that at this point you can glance at the high-level counts and determine how clean the data is.  In this example, two values weren't filled out in both the age and gender columns, 3 records in the age column weren't numbers, and 3 records in the gender column weren't part of the allowable list.  To make it even more user-friendly, you can add some additional formulas to show error counts, percentages, etc.

 

Hopefully I explained a little clearer, though I wouldn't be surprised if not :)

dbyfield
8 - Asteroid

Ah I see, that makes perfect sense.

 

I will give this a try some point this week and update accordingly with an amended workflow.

 

Thank you for your help so far, it's been really useful.

dbyfield
8 - Asteroid

So finally back working on this after being pulled in a different direction.

 

I'm wondering if within this workflow I can also get a simple output of the data that was inputted where the criteria (i.e. "allowed values" was not met?

 

Reason for asking, is that I've now got a request to get details on what data is being entered that isn't in the allowed values list and also isn't just blank.

 

This will enable my colleagues to better identify any gaps or issues in this process.

 

Thanks

dbyfield
8 - Asteroid

Also, attaed is the current state of the workflow.

thanks for any assistance can offer.

Labels