Data Science

nlane · ‎12-12-2022

After spending the last three weeks at your company’s office in Bora Bora, you’ve returned to your home office for your first day of work back in the “real world.” Not quite ready to jump into your inbox, which is screaming “341 unread messages” in an awful red notification bubble, you try to think of something productive you can do that doesn’t feel overwhelming. You’ve got it! Time to submit those piles of receipts you collected during your time abroad for reimbursement. Yes, something that needs to be done but allows you to reminisce about your time in Bora Bora a little bit longer. Perfect!

As you start shuffling through the receipts, though, that inbox is looking more and more appealing. How did you collect so many receipts in just three weeks? And none of them are standardized! You’ve got physical receipts, pictures of physical receipts, screenshots of email receipts, PDFs sent over email, and one napkin that has a handwritten note that says: “I owe you $5.” Not sure where the napkin came from, you throw it out, but the rest stays. You know that you are pretty savvy with Alteryx Designer and Intelligence Suite (thanks to their free trials!) and could write a couple of workflows to process the receipts once digitized, but you’ll need to know from inside your workflow which image/document is a digital receipt and which are pictures of physical receipts. The physical receipts will need slightly different logic for extracting totals since there is often a handwritten tip. You could manually go through and label each receipt as “email” vs. “picture,” but you’ve collected too many receipts to do that. Plus, you have another business trip to Tahiti in a couple of weeks, and you’d hate to have to start from scratch with those receipts. There’s got to be a way in Alteryx Designer to sort those receipts for you.

You take pictures of all your receipts you don’t already have pictures of, and you take all the screenshots and email PDFs and put them into one folder called “Receipts_And_Things_Bora_Bora_2022”. Time to get to work!

Source: GIPHY

The Approach

You have a feeling that the Image Recognition tool could be the ticket here. Given a good training set, this tool could create a model smart enough to look at an image and classify it as a picture of a receipt or an email receipt. So you decide to go with this approach:

Train Model
1. Pull in training data (you have receipts from your trip last month to Fiji already categorized).
2. Add their appropriate labels as a column (e.g., “picture” or “email”).
3. Standardize those images to get them ready for processing.
4. Split the data into two sets: training and validation.
5. Filter for three-channel images.
6. Feed those images with their classifications into the Image Recognition tool.
7. Run the workflow and analyze your results.
  1. If the results are less than optimal, you can tweak your Image Recognition configurations (e.g., epochs and batch size - more on these later) and run again to fine-tune your results until you’re happy to move on to step 2.
8. Input Data
  1. Import your receipts from the “Receipts_And_Things_Bora_Bora_2022” folder.
  2. Filter for three-channel images.
  3. Standardize the new receipts.
9. Predict
  1. Feed the model from the output anchor of the Image Recognition tool to a Predict tool.
  2. Feed the image data from the Input Data step to the same Predict tool to score your Bora Bora data.
  3. Click run!

Screen Shot 2022-12-01 at 1.54.05 PM.png

Image Recognition: Deep Dive

Let’s talk more about the Image Recognition tool since it isn’t one you use every day as an “Exotic Vacation Resort Reviewer” (or whatever job you have that lets you go to Bora Bora, Tahiti, and Fiji all within the same quarter 🤔).

Source: GIPHY

The Image Recognition tool builds a model that will attempt to recognize if a certain type of image is detected or not. For example, you can feed the tool mixed training data that includes pictures of irises and other flowers to create a model. This model allows you to then score on new images using the Predict tool. In other words, the model will categorize previously unseen images as looking like an iris or not, with a certain degree of accuracy. In your case, you want to be able to recognize if an image receipt is a picture of a physical receipt or a digital receipt.

There are a couple of rules to be aware of when using Image Recognition. The first is that all images going into the tool need to be a standardized size. It doesn’t matter what the dimensions are set to as long as the same dimensions are used across the entire data set for training data, validation data, and your new data. This makes sense because the model needs to be able to focus on what’s going on in the image and would get overburdened by trying to account for varying sizes.

Luckily you remember that the Image Processing tool allows you to easily scale all your images to the same size in one step. Below you can see an example configuration of the Image Processing tool that will get your images to a standardized height and width - note that your dimension configurations will be specific to your use case, and the numbers shown here are just an example. Don’t worry too much about locking the aspect ratio; the model is good at dealing with distorted (e.g., stretched or squashed) images.

Screen Shot 2022-12-01 at 2.24.11 PM.png

The second rule is a bit trickier to understand. The Image Recognition tool requires the images to be something known as “three-channel.” This means that the data behind the image file should be defined by red, green, and blue pixels. This usually shouldn’t be an issue, but you may come across some one-channel images (e.g., grayscale), two-channel images (e.g., some types of x-rays), and four-channel or “RGBA” images (it’s like a normal RGB image but with an extra transparency channel defined).

Luckily (again!) you have a tool to help you filter out the non three-channel images. The Image Profile tool extracts helpful metadata such as the mode of the image, which (as you might have guessed) can give us a clue to figuring out if the image is three-channel. The mode you want to see for images getting processed by Image Recognition is “RGB.” So you add a quick Filter tool to make sure any images that are not three-channel are removed.

When creating a model such as the one output from Image Recognition, you will want to give it data used for training and validating. Imagine you are giving the Image Recognition tool a set of flashcards (training set) to go through. The front of the flashcard has the receipt image, and the back has the label “picture” or “email.” It keeps going through the flashcards until it learns really well how to identify each. You then give the tool a set of new flashcards (validation set) to test if it really knows how to identify receipts well. Now of course you aren’t actually making flashcards in this case, but you are taking the old labeled data from Fiji receipts and splitting it into “training” and “validating” piles being sure to include a good mix of “picture” and “email” images in both.

Ideally, you want to have several hundred images in this data set so the model has enough data to train and validate from. You set up your Image Recognition config panel to take from the two data streams and leave most of the options as their defaults. Let’s talk more in depth about the options.

Source: GIPHY

Image Recognition Tool Configuration

There are three more advanced options you must set up in the tool’s configuration panel labeled “Epochs,” “Pre-Trained Model,” and “Batch Size.”

The term “epoch” here means the number of times the training set is passed through the model. Going back to the flashcard example, the more times you go through a set of flashcards, the better you’ll remember them. But, if you go through the cards too many times, you’ll have a much longer study session. Here with epochs, we want to strike a balance between running the data enough times that we create a good model but not so many times that it’s too computationally heavy. We recommend starting with the default “10” and tweaking from there if needed.

The middle option, “Pre-Trained Model,” is a way to use a pre-trained model for your data. This saves a lot of time as you aren’t training a model from scratch. Since a lot of image recognition problems have overlap, we choose to use a pre-trained model on the data set. Of the dropdown options listed, we choose the model that will run the fastest (InceptionV3).

Batch size is a configuration similar to epochs as it strikes a balance between accuracy and timing. Smaller batches are faster but less accurate. The default “32” is good for most cases. All three of these configuration options are well-documented in the tooltips on the Image Recognition tool. Next time you have the tool on your canvas, I recommend reading those tooltips to get a better understanding of the configurations. See below for our configuration of the Image Recognition tool.

Screen Shot 2022-11-18 at 2.51.15 PM.png

Image Recognition Output

The Image Recognition tool outputs to three anchors representing the model itself, model evaluation metrics, and a model report. You can use a Browse tool to look at the model report and get a sense of how accurate your model is after each epoch. This is a good way to assess if you should tweak your configurations (e.g., batch size, epochs, and model selection) before moving on.

Source: GIPHY

You split your receipts from Fiji into training and validation sets. You run the Image Recognition tool and see the below result in your model report. You feel confident to move on to the Bora Bora data!

Screen Shot 2022-12-08 at 1.23.23 PM.png

Using the Model for Scoring

You pull in the new files from the Bora Bora folder and apply the same standardization steps of scaling and filtering for RGB images only. You make a side note to go back and handle any receipts that may have gotten filtered out in the RGB step and then connect your standardized receipt images to the other input anchor of the Predict tool. You can click run and let Designer and AIS do their thing!

Screen Shot 2022-12-01 at 2.29.01 PM.png

An Important Caveat

In the above screenshots and example workflow, we put the model training (Image Recognition tool) and final scoring (Predict tool) all in the same workflow, but that’s only for ease of demonstration. In your real workflows, you will want to write the model output from Image Recognition to a .yxdb file, then pull it into your workflow with your new data to feed into the Predict tool. This is to avoid having to retrain your model every time you click run! See below for the ideal configuration you should follow in future workflows.

Screen Shot 2022-12-01 at 4.57.10 PM.png Screen Shot 2022-12-01 at 5.02.15 PM.png

Our Workflow’s Performance on New Data

Unfortunately, I wasn’t invited to Bora Bora this year, but I was able to test this scenario-driven workflow on some publicly found datasets. We used 240 images total with 40 e-receipts and 200 pictures of receipts to train and validate the model (i.e., the “Fiji” data). Note that ideally you want to have a more even distribution between your two tags (e.g., “receipt” and “email”) for training, but in our case, it was quite difficult to find a large data set of email receipts. Then we ran the workflow outlined above on a new data set containing 32 files (i.e., the “Bora Bora” data) and got results that were about 85-90% accurate depending on if you count multi-page files as separate predictions or one. Not too bad!

Screen Shot 2022-12-08 at 1.21.57 PM.png

Final Thoughts

While we may not all be lucky enough to travel to exotic places for work, I think most of us have experienced a situation where we have had to sort documents or images into different categories. Keep the Image Recognition tool and the entire Intelligence Suite in mind the next time you come across a problem like this - who knows, maybe this type of analysis wins you a spot at the company offsite to Hawaii next year 😉

Data Science

Automated Receipt Identification with Alteryx Intelligence Suite's Image Recognition Tool

The Approach

Image Recognition: Deep Dive

Image Recognition Tool Configuration

Image Recognition Output

Using the Model for Scoring

An Important Caveat

Our Workflow’s Performance on New Data

Final Thoughts

Instructions to Run Workflow

Data Sources

Resources