Alteryx Designer Knowledge Base

Definitive answers from Designer experts.
Don't forget to submit your entry for the Excellence Awards by October 30! | Need more information about the program? Check out the blog here

Pre-Predictive: Using the Data Investigation Tools - Part 1 of 4

Alteryx Community Team
Alteryx Community Team
Created

You want to impress your managers, so you decide to try some predictions on your data – forecasting, scoring potential marketing campaigns, finding new customers…  That's great! Welcome to the addictive world of predictive analytics.  We have the perfect platform for you to start exploring your data.

 

I know you want to dive right in and start testing models.  It's tempting to just pull some data and start trying out tools, but the first and fundamentally most important part of all statistical analysis is the data investigation.

 

Your models won't mean much unless you understand your data.  Here's where the Data Investigation Tools come in!  You can get a statistical breakdown of each of your variables, both string and numeric, check for outliers (categorical and continuous), test correlations to slim down your predictors, and visualize the frequency and dispersion within each of your variables.

 

Part 1 of this article will give you an overview of the Field Summary Tool (never leave home without it!)  Part 2 will touch on the Contingency and Frequency Tables, and Distribution Analysis; Part 3 will be the Association Analysis Tool, and the Pearson and Spearman Correlations; and Part 4 will be all the cool plotting tools.

 

Field Summary.jpg

 

Always, every day, literally every time you acquire a new data set, you will start with the Field Summary Tool.  I cannot emphasize this enough, and I promise it will save you headaches.

 

There are three outputs to this tool: a data table containing your fields and their descriptive statistics, a static report, and the interactive visualization dashboard that provides a visual profile of your variables.  From this output, you can select subsets to view, sort each of the panels, view and zoom in on specific values, and it even includes a visual indicator of data quality.

 

You'll get a nifty report with plots and descriptive statistics for each of your variables.  Likely the most important part of this report is '% Missing' – ideally, you want 0.0% missing.  If you are missing values, don't fret.  You can remove these records or impute those values (another reason knowing your data is so important).

 

Also check 'Unique Values' – if you have a single unique value in one of your variables, that won't add anything useful to your model, so consider deselecting that variable. 

 

The Remarks field is also very useful – it will suggest field-type changes for fields with a small number of unique values, perhaps that should be a string field.  Or, if some values of your field have a small number of value counts, you may consider combining some value levels together.

 

The better YOU know your data, the more efficient and accurate your models will be.  Only you know your data, your use case, and how your results are going to be applied.  But we're here to help you get as familiar as you can with whatever data you have.

 

Stay tuned for subsequent articles – these tools will be your new best friends.  Happy Alteryx-ing!

Comments
5 - Atom

Excuse me in advance if this is well-known!

 

Can the output of the data investigation tools be written to a PDF file?

 

Could a long PDF file of all of the investigations for every field be generated?

 

I found a recommendation to use the Render function to render R tables.  What about flattening the graphical output (HTML)?

Alteryx
Alteryx

@akrinsky - you can use the Render tool to create PDF and HTML!

6 - Meteoroid

Very helpful. Thank you! It prompted me to create an infographic 🙂02.Data Investigation Infographics.001.jpeg

Alteryx Partner

Using the Field Summary tool, I see the plots... but they are not labeled.  What do the X and Y axes actually represent??

 

Also, the third output anchor "l" produces a Field Summary with a single bar for each of the fields, when examining Census data that AutoField has turned into Doubles. Hovering over that bar pops up either "0.0 to 100000.0" on some or "0.0 to 50000.0" on others.  What is this supposed to tell me?

 

Thanks.

8 - Asteroid

@Newt - Think of the X axis like a record ID.  It's basically the line number of the record as shown in your original data.  So the first record will have an X coordinate of 1, the second record will have an X coordinate of 2, and so on.  The Y coordinate for each point will be the value in the numeric field.  Does that help on your first question?

Alteryx Partner

Yes, that is exactly what I need on the first question.  Thanks.