Data Science

nlane · ‎03-10-2022

As we kick off the new year 2022, many will make the conscious effort to start living healthier. This usually is composed of trying new exercise programs, focusing on mental health, and watching what you eat. To help with the latter, here is an Alteryx workflow and annotation template to help you identify exactly how much sugar, fat, and calories are in the food you’re eating. Disclaimer: I am not a doctor and would advise you consult with one before making any drastic changes to your diet. This article is just for fun and is not meant to influence your eating patterns, instead I hope it inspires you to make more Alteryx workflows.

Annotating a Nutrition Label

If you are in the US, I bet you’ve seen the “Nutrition Facts” label found on food packaging. Many foods that are covered by the FDA are required to have this iconic label to help consumers make educated decisions about what to eat.

Standard “Nutrition Facts” label Standard “Nutrition Facts” label

Because this format is standardized, it’s a perfect use case to implement the Image Template tool from the Alteryx Intelligence Suite (to learn more about how to use these tools you can take the Intelligence Suite Learning Path). I’ve split up the nutrition facts label into three sections: header, calories, and daily values. Now with this template, I can process multiple nutrition facts labels and split them up by these sections.

Image Template tool annotations for nutrition facts label Image Template tool annotations for nutrition facts label

Extracting the Text

The Image to Text tool can now do its magic by taking in the image inputs (I created a folder with a few nutrition label images to test) and our above annotation. It creates a data set of the text extracted from the labels split by our annotations. You can see in the below image that the columns “header”, “calories”, and “daily values” are the text extraction pieces. Each of those cells has multiple lines of text for us to parse in the next section. You may notice some seemingly random lines and odd characters in some of the cells, but don’t fear! The boundaries of my annotations align with physical lines on the labels so the Image to Text tool tries to interpret that as text; we will clean it up in the next two sections. To help avoid issues like this it is recommended that you use high resolution images (specifically the AIS tools are optimized to work with images of 300 DPI resolution).

Step 1: Input images and convert to text Step 1: Input images and convert to text

Pulling the Key Data Points

Next I use parsing tools to grab the exact keywords needed out of each section. For this part of the workflow, I focus on grabbing the serving size, calories, and sugar content for each label.

In simple cases you can use the Formula tool with string functions to parse out the content needed. In more complex cases I suggest the RegEx tool. For my example, I found that the wording for sugar content was pretty consistent across my data set, however the serving size and calories varied in wording and placement. I ended up using regex for the latter two to yield a better output.

Step 2: Extract serving size, sugars, and calories Step 2: Extract serving size, sugars, and calories

Working with the Daily Values

The “daily values” section of the nutrition labels is the trickiest. While all my example food labels have standard fields such as fat and sodium content, there is some variety in what gets included in the rest of the label. For example take the two below images. The one on the left provides four breakdowns for the type of fat whereas the one on the right only has two. The one on the right provides values for potassium whereas the left image has none. With differences like these, we have to take a bit more caution when processing the daily value fields.

Because the data for daily values is limited to one column, I need to split it out to actually see the individual nutritional values. The Text to Column tool gets the job done by taking the “daily values” column and splitting the data at each new line. This is best shown after transposing the data and filtering out the original “daily values” column (see below).

Step 3: Split the daily values column Step 3: Split the daily values column

We are getting close! Now it’s time to split the column “value” so the nutrient name (e.g. “Total Fat”) is separate from the value (e.g. “5g”). The Regex tools are perfect for this type of operation so I use two instances of the tool: one to grab the initial text and another to grab the following number. Finally I use the Cross Tab tool to find the overlapping nutrient names across the various labels. This is where the problem from above comes to light: not all nutrition labels will show the same nutrients! You can see the below columns “Soluble Fiber” and “Potassium” are very uncommon nutrient fields in our data set but “Saturated Fat” and “Sodium” are standard fields found in all the labels. Note: for those latter two columns you may notice a couple of null fields but in the next column a value of “0%”. This happens when the labels lists 0g for the nutrient but the Image to Text tool reads in “0” as the letter “O”. This is confirmed by the fact that you see the column name “Saturated Fat Og” and “Sodium Omg”. This edge case could be handled by adding a formula tool with a little logic to look for “Og” and “Omg” and convert them to “0”.

Step 4: Find the overlapping nutrient names Step 4: Find the overlapping nutrient names

Results

To see the results, let’s take our data and view it through the reporting tools.

Step 5: Output to reporting tools Step 5: Output to reporting tools

I’ve split the output into two formats so that we can see the individual facts for each label as well as a summarized view. The below is the data set for all the key nutrient values by label. It’s amazing that after starting with pictures of labels we are able to extract all this key data with an impressive accuracy!

Output 1: Nutrition info by label Output 1: Nutrition info by label

The second view is the summed values. Let’s say this data set was everything I ate in one day, what is the total number of calories I consumed? How much fat? Well the below values would give you a great estimate for those types of questions.

Output 2: Summed nutritional info from all labels Output 2: Summed nutritional info from all labels

Final Thoughts

What excites me most about this workflow is the many opportunities to expand on it in the future. Through this process of taking standard food packaging, placing it into a template, and extracting the needed text, you could grab any data you need to know from food packaging: nutrition information, ingredients, allergens, serving sizes, cooking instructions, and more. As someone who has a serious food allergy, being able to quickly scan the ingredients list or the “Contains” label on food would be a huge time-saver and potentially a life-saver! So the next time you pick up that box of crackers at the grocery store and see the nutrition facts label, just think of all the things Alteryx can help you do.

Running the Workflow

In order to run the workflow you’ll need Alteryx Designer running on your machine with the Alteryx Intelligence Suite installed. If you don’t have Designer you can download a free trial. Similarly if you need the Intelligence Suite, there’s a free trial for that too. Once your machine is set-up you can run the processing nutrition labels workflow available from Gallery.

kelvin_law1 · ‎03-10-2022

Hi @nlane

Thanks for this interesting sharing and I have downloaded the workflow from gallery.

I have one question though, these labels come in different sizes for each section, for example facts4 calories section position cannot be highlighted correctly using the facts2 template, therefore the calories cannot be extracted. Is there any way to improve this?

Thanks very much again!!

nlane · ‎03-11-2022

Hey @kelvin_law1! Thanks for downloading :)

This was something I fiddled around with a bit when creating my workflow: getting the annotation boxes in a sweet spot where they would capture the data in most use cases. If we are striving for perfection we have a few options in this situation:

Pre-process the images to make them all the same size.
Use bigger boundaries with some extra regex to capture all the needed data (think of one big box that covers the whole label for example).
Create a separate template (or templates) for the differently sized labels. This would mean using another instance of the Image Template tool and the Image to Text tools and then combining the data sets together before doing the rest of the processing.

Hope this helps!!

Data Science

Using Computer Vision to Extract Data From Nutrition Labels

Annotating a Nutrition Label

Extracting the Text

Pulling the Key Data Points

Working with the Daily Values

Results

Final Thoughts

Running the Workflow

Computer vision/IS on server

Re: Efficient Extraction of Company Insights: 10-K...

SAP Purchasing - Extract Invoice Receipt Data (Tab...

SAP Purchasing - Extract Purchasing Document Data ...

Extracting Data Help!!