Data Science

gpwasserman · ‎05-19-2023

Identifying objects in images (also known as object identification) is time-consuming and prone to inaccuracies. The costs of this work add up: time, money, and confidence are all on the line. As technology advances, many industries are turning to image analysis technology, which is software that can analyze pictures to identify colors, shapes, measurements, and more. From marketing to security to supply chain, the applications of image analysis technology are endless.

Image analysis technology, in tandem with Machine Learning, presents a new opportunity for analysts to classify and identify images faster. The combination of quantitative and qualitative image data and the power of Machine Learning can unlock reliable object classification and unleash your organization’s potential.

Use Case

Alex is an Operations Analyst for a major packager and distributor of dry goods, including dried beans. Her company is exploring imaging sensor technology to help them direct incoming shipments of dried beans to appropriate processing lines with minimal human involvement. They've asked her to determine the best way to leverage this technology to automatically identify the 7 bean varieties the company processes. She thinks Machine Learning might help.

Alex’s goal is to predict the type of bean in an image based on its physical characteristics so her company can package foods accurately.

Her plan is to use the sensor data to build a model in machine learning to see if it has sufficient predictive power in classifying the beans to move forward with the technology in their processing lines. Follow along to see how Alex’s machine learning journey unfolds.

Data Acquisition

She asks the image sensor vendor to help her capture metrics of thousands of samples of known bean variety. They compiled a dataset based on the details collected with each image's corresponding bean type or bean Class. These bean classes were previously labeled by experts.

Her data contains the following information for each record (each bean):

Area
Perimeter
MajorAxisLength
MinorAxesLength
AspectRation
Eccentricity
ConvexArea
EquivDiameter
Extent
Solidity
Roundness
Compactness
ShapeFactor1
ShapeFactor2
ShapeFactor3
ShapeFactor4
Class

Data Preparation and Exploration

Alex trusts Alteryx Machine Learning (AYX ML) to help her create the best model for predicting the bean type based on dimensions, so she opens AYX ML on her favorite browser and uploads her data.

She first turns on data profiling by clicking on the bar chart icon above the dataset to get a closer glimpse of the distribution and quality of her data. She is glad to see that her data is looking sound, with no null values to worry about and mostly normal distributions of each column.

AYX ML runs automatic data checks to check the quality of the data for machine learning. In this case, there is an ID Column warning, which means the Bean ID column might contain an ID. Since IDs do not have any predictive value, the Fix Data recommendation is to drop this column before modeling. She follows the prompts to apply this fix in a few swift clicks.

Now it's time to “Choose a Target Column.” This forces her to recall the business problem at hand – she is aiming to identify the right type of bean to package goods accurately. For this reason, her target column must be the class of bean, which corresponds to the “Class” column.

To Alex’s delight, Alteryx Machine Learning automatically suggests that Classification is the appropriate Machine Learning Method based on her chosen Target Column. And with the added explanation and details, she is confident that this method matches her use case. She feels ready to move on to the next step, Data Insights.

Data Insights

Alex proceeds to the next step to find detailed visuals of feature correlations and potential outliers in her data.

As she inspects the correlation matrix, she notices a sea of dark blue, indicating that some of her potential model features are highly correlated. This could signal collinearity, which is when two explanatory variables have a strong linear relationship. This is a problem because if two features are too similar, it’s difficult to identify their unique predictive contributions to the model . In other words, we won’t be able to decipher how one feature impacted the model from another.

Because of this, she decides to drop a few columns. She iterates back to problem setup and clicks on the edit columns button above the dataset to drop the following columns:

Equivdiameter
Eccentricity
Shape Factor 1
Shape Factor 2
Shape Factor 3
Major axis length
Compactness
Convex Area
Perimeter
Minor Axis Length
Leaving her with:
- Area
- AspectRation
- Extent
- Solidity
- Roundness
- ShapeFactor4
- Class

She returns to the data insights step to find an updated correlation matrix with significantly less collinearity, which is exactly what she is looking for.

Data Prep – Outliers

She also looks at the Outliers tab to see if there are any outliers significantly skewing the data. None of the outliers look too alarming, and they are all legitimate data points in this context, so she decides to leave it as is.

Auto Model

Now it’s time for Alex to kick off the Auto Modeling process. At the Auto Model step, AYX ML tests a series of modeling algorithms on the training data (70% of the original dataset) and ranks them based on the best-performing model so that we don’t have to. Once Auto Modeling is complete, the Random Forest Classifier is her highest-ranked model, meaning that this model was the best performing based on the accuracy metric. Coming in at a 65.393% performance better than the baseline and an accuracy score of 91%, she is feeling good about the model quality.

Alex also has the option to leverage automated feature engineering and tweak advanced model settings to make further adjustments to enhance her modeling. In this case, Alex feels good about her model, so she continues on.

Evaluate Model

She moves on to the evaluate model step, where AYX ML will use the selected model to predict the class of each bean in the holdout data (The remaining 30% of the data that was not used to train the model). This will reveal how well the model can perform in practice.

In the General tab, Alex notes the general model overview, its performance, and pipeline highlights. She also takes a look at Feature Importance, which reveals the most predictive features or columns from the data. Area and AspectRation come in at the highest Importance, so now she knows those are a bean’s most predictive qualities for classifying its type.

Evaluate model.png

She takes a look at the Advanced Insights tab, where she sees a matrix complete with the actual vs. predicted values for the data.

For the first pass at modeling this data, Alex’s expectations are exceeded. A 91% accuracy score for a first-time model is an impressive starting point.

As she digs deeper into the metrics, she notices for bean type Sira, the model inaccurately predicted 64 of the beans as another class. This is her least accurate category coming in at 88% correctly classified. Since her company prioritizes accuracy, this margin of error is approaching their standards, but not quite there yet. Alex knows that machine learning models are only as good as the data used to train them. She decides to partner with the image sensor vendor to capture more data so she can retrain the model and improve its quality. Once she obtains the new data, she will repeat these steps and compare the new model to the original.

Export Model

It’s time to share the model with her boss and share her modeling journey progress. She decides to export the model visuals as a PowerPoint, a no-fuss option that allows her to immediately capture the important points to share with her boss in their next meeting.

Conclusion

After Alex meets with her boss, they decide that the level of accuracy of this model is near sufficient for their packaging process needs. At a score of 0.91, the model is already a powerful asset to her team. With some more tweaking and data collection, they will be ready to predict bean types (and future additional food items) with high accuracy in no time.

In just a short few minutes of deploying Alteryx Machine Learning, Alex has positioned her team for increased efficiency and success.