Encoding Variables: Translating Your Data so the Computer Understands It

Question

Humans and computers don't understand data in the same way, and an active area of research in AI is determining how AI "thinks" about data. For example, the recent Quanta article Where We See Shapes, AI Sees Textures discusses an inherent disconnect between how humans and computer vision AI interpret images. The article addresses the implicit assumption many people have that when AI works with an image, it interprets the contents of the image the same way people do- by identifying the shapes of the objects. However, because most AI interprets images at a pixel level, it is more intuitive for the AI to label images by texture (i.e., more pixels in an image represent an object's texture than an object's outline or border) than by shape.

Another useful example of this is in language. Where humans communicate with one another using a complex language where words can take on multiple meanings, machines, at their most basic level, operate with machine code; a strictly numerical language that allows a person to program highly specific tasks on a machine, such as storing data. Machine code is tedious at best – for most of the population (even programmers) it is entirely unreadable.

Programming languages used in data science like Python or R are a mid-point. They are not as flexible or dynamic as spoken languages, but they are far easier for people to understand than machine code, and still have meaning for a computer. Programming languages need to be more rigid than spoken languages because the computer can’t interpret ambiguity. A computer will do exactly what it is told to and nothing else – anything left to interpretation can’t be guaranteed. This is why it is important to have clean, clear data with correctly typed variables when working with machine learning algorithms.

Categorical variables fall under the category of things that seem simple to humans but are less clear to computers. Categorical variables have discrete labels for values. Machine learning algorithms are based on mathematical equations – meaning that they (typically) work entirely in numbers.  Each text value in a categorical variable doesn’t necessarily mean anything to a computer or machine learning algorithm.

Some machine learning algorithms are implemented to handle categorical variables, but some implementations (like the algorithms in scikit-learn in Python) leave it up to the person on the keyboard-end of the interaction to convert the variables into an appropriate format.

There are many different strategies for converting categorical or string data into a format that is meaningful to an algorithm. These strategies fall under a group of processes called encoding, which means to convert something to code (specifically, computer code). Here is a sampling of some common strategies, ranging from the most simple to advanced, for encoding your data.

Label Encoding

Label encoding is the process of assigning a numeric label for each categorical label.

The process is simple: every value in your categorical variable gets assigned a number to represent it.

Although this is probably the most straightforward approach to encoding categorical variables, there is an important side-effect to consider. The encoded categorical values inherit the relationships between numbers. So, in this example one red is equal to two blues, blue and red combined equals green, and green has three times the weight of blue.

In a lot of cases, these relationships don’t make a ton of sense for the data and can cause goofy outcomes in a trained model. The model might leverage the relationship between the encodings, where no relationship actually exists in the data.

One-Hot Encoding

One-hot encoding is a strategy where instead of simply converting each variable value to a number, a new column is created for each variable, where the values are either one or zero, indicating if a given row belongs to that category.

The advantage of one-hot encoding is that each category is equidistant from one another. There is no notion of relationships between the categories.

One-hot encoding is the same thing as dummy coding, an older term that comes from the field of statistics. The process is identical, but typically after encoding the values of a predictor variable, creating dummy variables includes dropping one of the encoded columns because it is assumed that if the values are all zero for the remaining columns then the row must belong to the missing category. This works because there is an assumption with one-hot encoding and dummy variables that the values in the category are mutually exclusive (e.g., if an observation in blue, it cannot also be green or red).

A limitation of one-hot encoding (and dummy coding) is that the number of columns grows with the number of values in a given categorical variable, and you can end up with a sparse dataset, particularly if you have high cardinality (many different values) in your variable(s).

Embeddings in Deep Learning

Deep learning encounters the same limitations of working with categorical or text data as other machine learning algorithms. To a neural network, a string of text doesn’t mean anything, making it difficult to identify meaningful patterns or relationships between words in a body of text, or different categories. Some deep learning models will use one-hot encoding; however, a more sophisticated set of strategies called embedding has been developed.

Embeddings are learned vector (series of numbers) representations of data. What makes embeddings powerful is that the vector representations capture the relationships between words (in natural language processing) or individual categories or items (e.g., songs, movies, or fruit).

For example, if you wanted to create 2-dimensional embeddings for fruit, you might end up with something like this, where each type of fruit was represented by two numbers (think about them as coordinates), that organized the fruit along two axes.

Embeddings are learned with a neural network. The neural network is fed a dataset and given a fake task that (hopefully) emphasizes the relationships between the values in the data that you want to create embeddings for. For example, one variation of word2vec embeddings trains a neural network model to predict a word based on the other words in a sentence. The neural network then works on solving its assigned (fake) task. Once it's done, instead of looking at the outputs of the task, the weights from the hidden layer of the neural network are taken for each word or category in the dataset and used as a vector to represent that word or category.

With embeddings, words or categories that are similar to one another will have similar vectors, so if we were to create embeddings for food, fruits would be clustered together in vector space, and vegetables would be clustered together in vector space. Pretty neat, right?

In addition to capturing relationships, embeddings can reduce the dimensionality of the data over other methods like one-hot encoding. Instead of a sparse vector, where the length of the vector is equal to the number of every unique word or category in a dataset, you can instead specify the length of the vectors you'll use to represent your data.

Unsolicited Advice for Talking to Your Computer with Data

One-hot encoding is the most common strategy. For most use cases, you will use one-hot encoding or dummy variables to represent your categorical data.

Topically, if you are performing a regression it is typically better to drop one of your one-hot encoded/dummy variables to avoid what is known as the dummy variable trap. Including all the encoded variables creates perfect multicollinearity and redundant variables in your training dataset (where one predictor variable can be perfectly predicted by another predictor variable). There are also a variety of encoding methods developed in the field of statistics specifically for regression that you can read about here.

A consideration to make with one-hot encoding is that each value in a category creates a new column (dimension) in your dataset, which can set you up on the fast track to having to deal with the curse of dimensionality. Stay conscious of the cardinality of the categorical variables you want to use in your machine learning problem. It might make sense to combine or remove some categories that don’t represent a large chunk of your data.

Decision trees should be able to handle categorical variables without dealing with encoding, but it can depend on the specific implementation of the algorithm whether or not this is included as a feature. If it is not handled by the algorithm, one-hot encoding might not be the best choice for dealing with categorical variables. You can read further in the articles Visiting: Categorical Features and Encoding in Decision Trees and Are categorical variables getting lost in your random forests?

Label encoding is typically reserved for ordinal data, where one value is inherently “more” or “better” than another. Think about a ranking variable; high, medium, and low. In this case, saying low plus medium equals high might make sense, so using label encoding is a logical option.

As for embeddings, they are very powerful, but they require huge amounts of data and training time to learn the embeddings. If you have the data for it and your use case demands it, go for it, but in most scenarios that don’t involve deep learning, training embeddings is probably not necessary or feasible.

zakellaoui · Answer

Great post Sydney!