Alteryx Designer Desktop Discussions

DMRoss79 · ‎10-11-2023

I am seeking guidance on using Alteryx to solve a common problem in more advanced machine and deep learning modelling: a lack of labelled observations from which to create a supervised model.

Background

Several projects on which I am working, including the Alteryx for Good 2022 Award Winner, have all reached a critical point. Each project is using unsupervised learning models to identify key classifications for datasets ranging in size from 100,000 to 2,000,000 observations (aka rows or records). Note that the source data is unlabeled text. In each case, we need to find and confirm the classification of some of these observations so that we can create supervised models to assess the accuracy of and explain the results from the unsupervised models’ classifications. Our challenge is that the process of manual verification of the initial classifications is taking a long time, from 50 to 200 confirmed labelled observations per hour per knowledgeable (and expensive) reviewer. To create effective supervised models, we likely need 10,000 to 30,000 confirmed classification labels.

Opportunity

We do have an opportunity to accelerate this review so as to reach the critical volume noted above. We have two data scientists who are also knowledgeable domain experts. As they review the observations, including the unsupervised model tags, they often find patterns that would allow them to rapidly verify the tentative classifications of tens to hundreds of observations in one action. Unfortunately, each must switch tools to do so, losing 10 to 15 minutes per mass verification action. I am requesting your help to find a better approach.

Process Issue

Here are the details of the challenge. Excel provides a very capable analytical template, as a knowledgeable reviewer/verifier can load a multiple hundred-thousand observation/row dataset within a minute or two, set multiple column filters, and manually update a classification confirmation column at a pace of about 10-30 observations a minute – if the spreadsheet is formatted correctly. During this process our data scientists often spot a filtered pattern that would support confirmation of multiple observations at once, from 10 to 500 in one action. However, they cannot programmatically change all values in the confirmation column on a filtered Excel spreadsheet without also affecting the confirmation column on unfiltered rows, making programmatic changes unavailable.

This leaves the data scientists with the alternative of dropping out of the spreadsheet, loading the current spreadsheet into Python or R, setting the same filters in on the enclosed Pandas dataframe, making the programmatic change, and saving the updated dataset to Excel. Then entering Excel, resetting many of the same filters, and resuming the analysis. The entire drop out and reset process can take 10 to 15 minutes per programmatic update.

Request/Question

We think it might be possible to perform the entire process in Alteryx designer and Intelligence Suite. Can someone suggest how we might perform both the manual and programmatic confirmation process in Alteryx?

We are guessing we might be able to use the Browse function to replicate a (full screen) filtered Excel spreadsheet and the Formula or other function to update the values in one column based upon conditions/filters in multiple other columns. Can you confirm that these are the correct functions/tools to use in Alteryx? And point us to some tutorials on how the functions/tools might be used correctly to perform these functions?

Alteryx Designer Desktop Discussions

Creating Labelled Observations for Machine Learning

Re: Change Data Type of Input Data before Reading

Re: Change Data Type of Input Data before Reading

Re: Join versus Union

Re: Filter

Re: Regex help please - Parsing a big text area