Data Science

Machine learning & data science for beginners and experts alike.
gpwasserman
Alteryx
Alteryx

Identifying objects in images (also known as object identification) is time-consuming and prone to inaccuracies. The costs of this work add up: time, money, and confidence are all on the line. As technology advances, many industries are turning to image analysis technology, which is software that can analyze pictures to identify colors, shapes, measurements, and more. From marketing to security to supply chain, the applications of image analysis technology are endless.

 

Image analysis technology, in tandem with Machine Learning, presents a new opportunity for analysts to classify and identify images faster. The combination of quantitative and qualitative image data and the power of Machine Learning can unlock reliable object classification and unleash your organization’s potential.

 

Use Case

 

Alex is an Operations Analyst for a major packager and distributor of dry goods, including dried beans.  Her company is exploring imaging sensor technology to help them direct incoming shipments of dried beans to appropriate processing lines with minimal human involvement.  They've asked her to determine the best way to leverage this technology to automatically identify the 7 bean varieties the company processes. She thinks Machine Learning might help.

 

Alex’s goal is to predict the type of bean in an image based on its physical characteristics so her company can package foods accurately.

 

Her plan is to use the sensor data to build a model in machine learning to see if it has sufficient predictive power in classifying the beans to move forward with the technology in their processing lines. Follow along to see how Alex’s machine learning journey unfolds.

 

Data Acquisition

 

She asks the image sensor vendor to help her capture metrics of thousands of samples of known bean variety. They compiled a dataset based on the details collected with each image's corresponding bean type or bean Class. These bean classes were previously labeled by experts.

 

Her data contains the following information for each record (each bean):

  • Area
  • Perimeter
  • MajorAxisLength
  • MinorAxesLength
  • AspectRation
  • Eccentricity
  • ConvexArea
  • EquivDiameter
  • Extent
  • Solidity
  • Roundness
  • Compactness
  • ShapeFactor1
  • ShapeFactor2
  • ShapeFactor3
  • ShapeFactor4
  • Class

 

Data Preparation and Exploration

 

Alex trusts Alteryx Machine Learning (AYX ML) to help her create the best model for predicting the bean type based on dimensions, so she opens AYX ML on her favorite browser and uploads her data.

 

image001.png

 

She first turns on data profiling by clicking on the bar chart icon above the dataset to get a closer glimpse of the distribution and quality of her data. She is glad to see that her data is looking sound, with no null values to worry about and mostly normal distributions of each column.

 

AYX ML runs automatic data checks to check the quality of the data for machine learning. In this case, there is an ID Column  warning, which means the Bean ID column might contain an ID. Since IDs do not have any predictive value, the Fix Data recommendation is to drop this column before modeling. She follows the prompts to apply this fix in a few swift clicks.

 

Now it's time to “Choose a Target Column.” This forces her to recall the business problem at hand – she is aiming to identify the right type of bean to package goods accurately. For this reason, her target column must be the class of bean, which corresponds to the “Class” column.

 

image003.png

 

To Alex’s delight, Alteryx Machine Learning automatically suggests that Classification is the appropriate Machine Learning Method based on her chosen Target Column. And with the added explanation and details, she is confident that this method matches her use case. She feels ready to move on to the next step, Data Insights.

 

Data Insights

 

Alex proceeds to the next step to find detailed visuals of feature correlations and potential outliers in her data.

 

image004.png

 

As she inspects the correlation matrix, she notices a sea of dark blue, indicating that some of her potential model features are highly correlated. This could signal collinearity, which is when two explanatory variables have a strong linear relationship. This is a problem because if two features are too similar, it’s difficult to identify their unique predictive contributions to the model . In other words, we won’t be able to decipher how one feature impacted the model from another.

 

Because of this, she decides to drop a few columns. She iterates back to problem setup and clicks on the edit columns button above the dataset to drop the following columns:

 

image005.png

 

  • Equivdiameter
  • Eccentricity
  • Shape Factor 1
  • Shape Factor 2
  • Shape Factor 3
  • Major axis length
  • Compactness
  • Convex Area
  • Perimeter
  • Minor Axis Length
  • Leaving her with:
    • Area
    • AspectRation
    • Extent
    • Solidity
    • Roundness
    • ShapeFactor4
    • Class

 

She returns to the data insights step to find an updated correlation matrix with significantly less collinearity, which is exactly what she is looking for.

 

image007.png

 

Data Prep – Outliers

 

She also looks at the Outliers tab to see if there are any outliers significantly skewing the data. None of the outliers look too alarming, and they are all legitimate data points in this context, so she decides to leave it as is.

 

image008.png

 

Auto Model

 

Now it’s time for Alex to kick off the Auto Modeling process. At the Auto Model step, AYX ML tests a series of modeling algorithms on the training data (70% of the original dataset) and ranks them based on the best-performing model so that we don’t have to. Once Auto Modeling is complete, the Random Forest Classifier is her highest-ranked model, meaning that this model was the best performing based on the accuracy metric. Coming in at a 65.393% performance better than the baseline and an accuracy score of 91%, she is feeling good about the model quality.

 

image009.png

 

Alex also has the option to leverage automated feature engineering and tweak advanced model settings to make further adjustments to enhance her modeling. In this case, Alex feels good about her model, so she continues on.

 

Evaluate Model

 

She moves on to the evaluate model step, where AYX ML will use the selected model to predict the class of each bean in the holdout data (The remaining 30% of the data that was not used to train the model). This will reveal how well the model can perform in practice.

 

In the General tab, Alex notes the general model overview, its performance, and pipeline highlights. She also takes a look at Feature Importance, which reveals the most predictive features or columns from the data. Area and AspectRation come in at the highest Importance, so now she knows those are a bean’s most predictive qualities for classifying its type.

 

Evaluate model.png

 

She takes a look at the Advanced Insights tab, where she sees a matrix complete with the actual vs. predicted values for the data. 

 

image011.png

 

For the first pass at modeling this data, Alex’s expectations are exceeded. A 91% accuracy score for a first-time model is an impressive starting point.

 

As she digs deeper into the metrics, she notices for bean type Sira, the model inaccurately predicted 64 of the beans as another class. This is her least accurate category coming in at 88% correctly classified. Since her company prioritizes accuracy, this margin of error is approaching their standards, but not quite there yet. Alex knows that machine learning models are only as good as the data used to train them. She decides to partner with the image sensor vendor to capture more data so she can retrain the model and improve its quality. Once she obtains the new data, she will repeat these steps and compare the new model to the original.

 

Export Model

 

It’s time to share the model with her boss and share her modeling journey progress. She decides to export the model visuals as a PowerPoint, a no-fuss option that allows her to immediately capture the important points to share with her boss in their next meeting.

 

image012.png

 

Conclusion

 

After Alex meets with her boss, they decide that the level of accuracy of this model is near sufficient for their packaging process needs. At a score of 0.91, the model is already a powerful asset to her team. With some more tweaking and data collection, they will be ready to predict bean types (and future additional food items) with high accuracy in no time.

 

In just a short few minutes of deploying Alteryx Machine Learning, Alex has positioned her team for increased efficiency and success.