Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Data Science

Machine learning & data science for beginners and experts alike.
nlane
Alteryx
Alteryx

As we kick off the new year 2022, many will make the conscious effort to start living healthier. This usually is composed of trying new exercise programs, focusing on mental health, and watching what you eat. To help with the latter, here is an Alteryx workflow and annotation template to help you identify exactly how much sugar, fat, and calories are in the food you’re eating. Disclaimer: I am not a doctor and would advise you consult with one before making any drastic changes to your diet. This article is just for fun and is not meant to influence your eating patterns, instead I hope it inspires you to make more Alteryx workflows.

 

gif.gif

 

Annotating a Nutrition Label

 

If you are in the US, I bet you’ve seen the “Nutrition Facts” label found on food packaging. Many foods that are covered by the FDA are required to have this iconic label to help consumers make educated decisions about what to eat.

 

Standard “Nutrition Facts” labelStandard “Nutrition Facts” label

 

 

Because this format is standardized, it’s a perfect use case to implement the Image Template tool from the Alteryx Intelligence Suite (to learn more about how to use these tools you can take the Intelligence Suite Learning Path). I’ve split up the nutrition facts label into three sections: header, calories, and daily values. Now with this template, I can process multiple nutrition facts labels and split them up by these sections.

 

Image Template tool annotations for nutrition facts labelImage Template tool annotations for nutrition facts label

 

 

Extracting the Text

 

The Image to Text tool can now do its magic by taking in the image inputs (I created a folder with a few nutrition label images to test) and our above annotation. It creates a data set of the text extracted from the labels split by our annotations. You can see in the below image that the columns “header”, “calories”, and “daily values” are the text extraction pieces. Each of those cells has multiple lines of text for us to parse in the next section. You may notice some seemingly random lines and odd characters in some of the cells, but don’t fear! The boundaries of my annotations align with physical lines on the labels so the Image to Text tool tries to interpret that as text; we will clean it up in the next two sections. To help avoid issues like this it is recommended that you use high resolution images (specifically the AIS tools are optimized to work with images of 300 DPI resolution).

 

Step 1: Input images and convert to textStep 1: Input images and convert to text

 

Pulling the Key Data Points

 

Next I use parsing tools to grab the exact keywords needed out of each section. For this part of the workflow, I focus on grabbing the serving size, calories, and sugar content for each label.

 

In simple cases you can use the Formula tool with string functions to parse out the content needed. In more complex cases I suggest the RegEx tool. For my example, I found that the wording for sugar content was pretty consistent across my data set, however the serving size and calories varied in wording and placement. I ended up using regex for the latter two to yield a better output.

 

Step 2: Extract serving size, sugars, and caloriesStep 2: Extract serving size, sugars, and calories

 

 

Working with the Daily Values

 

The “daily values” section of the nutrition labels is the trickiest. While all my example food labels have standard fields such as fat and sodium content, there is some variety in what gets included in the rest of the label. For example take the two below images. The one on the left provides four breakdowns for the type of fat whereas the one on the right only has two. The one on the right provides values for potassium whereas the left image has none. With differences like these, we have to take a bit more caution when processing the daily value fields.

 

NeilR_0-1646951824862.png

 

Because the data for daily values is limited to one column, I need to split it out to actually see the individual nutritional values. The Text to Column tool gets the job done by taking the “daily values” column and splitting the data at each new line. This is best shown after transposing the data and filtering out the original “daily values” column (see below).

 

Step 3: Split the daily values columnStep 3: Split the daily values column

 

We are getting close! Now it’s time to split the column “value” so the nutrient name (e.g. “Total Fat”) is separate from the value (e.g. “5g”). The Regex tools are perfect for this type of operation so I use two instances of the tool: one to grab the initial text and another to grab the following number. Finally I use the Cross Tab tool to find the overlapping nutrient names across the various labels. This is where the problem from above comes to light: not all nutrition labels will show the same nutrients! You can see the below columns “Soluble Fiber” and “Potassium” are very uncommon nutrient fields in our data set but “Saturated Fat” and “Sodium” are standard fields found in all the labels. Note: for those latter two columns you may notice a couple of null fields but in the next column a value of “0%”. This happens when the labels lists 0g for the nutrient but the Image to Text tool reads in “0” as the letter “O”. This is confirmed by the fact that you see the column name “Saturated Fat Og” and “Sodium Omg”. This edge case could be handled by adding a formula tool with a little logic to look for “Og” and “Omg” and convert them to “0”.

 

Step 4: Find the overlapping nutrient namesStep 4: Find the overlapping nutrient names

 

Results

 

To see the results, let’s take our data and view it through the reporting tools.

 

Step 5: Output to reporting toolsStep 5: Output to reporting tools

 

I’ve split the output into two formats so that we can see the individual facts for each label as well as a summarized view. The below is the data set for all the key nutrient values by label. It’s amazing that after starting with pictures of labels we are able to extract all this key data with an impressive accuracy!

 

Output 1: Nutrition info by labelOutput 1: Nutrition info by label

 

The second view is the summed values. Let’s say this data set was everything I ate in one day, what is the total number of calories I consumed? How much fat? Well the below values would give you a great estimate for those types of questions.

 

Output 2: Summed nutritional info from all labelsOutput 2: Summed nutritional info from all labels

 

Final Thoughts

 

What excites me most about this workflow is the many opportunities to expand on it in the future. Through this process of taking standard food packaging, placing it into a template, and extracting the needed text, you could grab any data you need to know from food packaging: nutrition information, ingredients, allergens, serving sizes, cooking instructions, and more. As someone who has a serious food allergy, being able to quickly scan the ingredients list or the “Contains” label on food would be a huge time-saver and potentially a life-saver! So the next time you pick up that box of crackers at the grocery store and see the nutrition facts label, just think of all the things Alteryx can help you do.

 

gif.gif

 

Running the Workflow

 

In order to run the workflow you’ll need Alteryx Designer running on your machine with the Alteryx Intelligence Suite installed. If you don’t have Designer you can download a free trial. Similarly if you need the Intelligence Suite, there’s a free trial for that too. Once your machine is set-up you can run the processing nutrition labels workflow available from Gallery.

Comments