Introducing Woodwork - An Open Source Python Library for Rich Semantic Data Typing

Question

At Alteryx, we aim to create tools for advancing machine learning capabilities. To help everyone solve impactful problems, we're building innovative open source tools for each step of the machine learning pipeline, automating all parts of the machine learning process, and making it easy for anyone to gain insights and build predictive models. We are excited to announce a new addition to our open-source projects: Woodwork, a Python library that provides robust methods for managing and communicating data typing information. Woodwork can be used as a part of your existing workflow to properly type your data and communicate your data types to downstream processes. Background In machine learning and other data analysis processes, data is often stored in collections known as DataFrames and Series. DataFrames are tables that organize data into rows and columns, similar to a spreadsheet. In addition to the data, DataFrames normally contain information such as a labels or names to identify each column or row and information about the what the data represents such as integers, floating point numbers or strings/text. A Series is similar to a DataFrame, except that a Series contains only a single column of data. DataFrames and Series provide simple and convenient methods to store, access and manipulate data, making them very common in data science and machine learning applications. A DataFrame (left) is a structure used to store tabular collection of data while a Series (right) is a similar structure used to store only a single column of data. Alteryx has built several open-source libraries that leverage information stored in DataFrames and Series to help automate the machine learning process. The first library we built was Featuretools, a library for automated feature engineering. Featuretools helps users transform their temporal and relational datasets into feature matrices for machine learning. While building Featuretools, we realized that data scientists often start their modeling process by writing a specific labeling script to find the outcome that they want to predict. The labeling script contains many lines of code for processing raw data while also including any constraints to deal with challenges as they arise. This lead us to create Compose, a library for automated prediction engineering. Compose allows data scientists to structure prediction problems and generate labels. Data scientists can use Compose to efficiently generate training examples for many prediction problems, even across different domains. Using these two libraries, data scientists can use Compose to generate training examples, which can be passed to Featuretools to generate a feature matrix. After labeling data and generating a feature matrix, the next step in the modeling process is to create a model. Depending on the data, column types, and problem type to be solved, there may be a significant amount of preprocessing required. For example, data scientists may need to encode certain categorical columns, impute missing values or generate a baseline model. Most importantly, data scientists may want to optimize their model search for a domain-specific objective, such as minimizing the cost of fraud. This process can involve significant time and effort, which led us to wonder - How can we automate the modeling process to allow data scientists to build machine learning models in a simple and flexible way? This led us to create EvalML, a library for automated modeling. EvalML is an AutoML library that helps users build, optimize, and evaluate machine learning pipelines. Data scientists can use EvalML to automatically find the best pipeline for their prediction problem. Furthermore, EvalML helps users understand how the model behaves on their data, and examine the key factors influencing its predictions. Why Is Type Representation Important? During the development of EvalML, several data typing issues were encountered: * Unexpected input data types resulted in search algorithm failures * Imputer not working properly for certain missing value representations * Improper operations applied to categorical columns In short, we were struggling with data typing. As we developed Alteryx Machine Learning and integrated our open-source libraries, we wanted to bring our libraries together so that typing information could be passed seamlessly throughout the process. We needed a way to allow users to specify data types on the input data and then pass that to Featuretools for feature engineering and on to EvalML for AutoML. We looked at Featuretools, where we had already solved the problem of data typing for feature generation and realized this existing functionality could be improved and leveraged in other places to solve our problem. The same approach could be used across all of our libraries to standardize data typing in a way that didn’t exist before. With a new approach our libraries would be able to communicate typing information with each other by speaking the same typing language. Out of this came the Woodwork library, which has now been released and fully integrated into Featuretools, EvalML and Alteryx Machine Learning. Alteryx open-source tools can be used in every step of the machine learning process, starting with data typing in Woodwork, creating training labels with Compose, performing feature engineering with Featuretools and running autoML with EvalML. From Idea to Implementation Background Investigation When starting to develop Woodwork, we initially looked at the capabilities of existing libraries to determine if anything already available could provide the functionality needed. We first looked to pandas, but unfortunately the types are limited and don’t provide the level of detail needed to differentiate between similar but distinct types. For example, data using the pandas string data type could represent natural language or an email address, so labeling a column with only the string type doesn’t allow for distinguishing between these two types of information. We also looked to Featuretools, which has previously implemented a type system using a Featuretools Variable object that defined various types. However, Featuretools variables were set up using a parent-child relationship structure for all types. This generally worked well, but did introduce some problems. Take the Id variable type for example. ID columns can contain different types of data, including strings and integers, however not all string or integer columns are valid ID columns since the values must be unique. Setting up a proper parent-child relationship in this situation is difficult. The type system previously implemented in Featuretools, using an inheritance approach between types that made representing certain types difficult. Based on these background studies, when implementing the Woodwork type system, we moved away from a strict parent-child hierarchy for types and also introduced a tagging based approach to allow for more flexibility. For example the "index" tag can be applied to integers, floats or strings to designate the column as an index column and index columns are not restricted to deriving from a single parent type. Basic Concepts in Woodwork There are few concepts in Woodwork that are critical to understand. Woodwork defines three different types/tags, each defined on a per-column basis: * Physical Type: defines how the data is stored on disk or in memory. * Logical Type: defines how the data should be parsed or interpreted. * Semantic Tag(s): provides additional data about the meaning of the data or how it should be used. Woodwork is set up so that each logical type uses only one physical type for storing the data. When a column is set to a specific logical type, Woodwork will automatically convert the column physical type to the appropriate value. In this way, users can be assured that a column that is set or inferred as a Double logical type will use a pandas float64 physical type for storing the data, meaning the data in that column can be used for any operation that is valid for a float64. Some logical types share the same physical type. Take the EmailAddress and URL logical types, for example. Both of these types store data with a string physical type. However, by labeling these columns with a different logical type, downstream operations can leverage this information. To extract the domain from these types, we need to parse them differently. The logical type provides that information to the application and can be used to apply the appropriate parsing method to a column to extract the desired information. As an example of how semantic tags can be useful, consider a dataset with two date columns: a signup date and a user birth date. Both of these columns have the same physical type (datetime64[ns]), and both have the same logical type (Datetime). However, semantic tags can be used to differentiate these columns. For example, you might want to add the date_of_birth semantic tag to the user birth date column to indicate this column has special meaning and could be used to compute a user’s age. Computing an age from the signup date column would not make sense, so the semantic tag can be used to differentiate between what the dates in these columns mean. Storing information about the allowed usage of each column means there is a single source of truth for how each column should be used by your software, making downstream code simpler and more consistent and system behavior easier to understand. Why Woodwork Uses Custom Accessors After the key concepts in Woodwork were identified, the implementation began. Our first approach was to store the typing information on a DataTable class, and a reference to the DataFrame would be stored in this class. The limitations of this approach became apparent quickly. If users wanted to access the underlying DataFrame object they had to call a method or access a property on the DataTable class. This also meant that updating existing applications to use Woodwork would require swapping out all pandas.DataFrames with woodwork.DataTables. Even after doing that, accessing the underlying DataFrame methods could be cumbersome and we found that we needed to implement DataTable versions of DataFrame methods that in most cases just called the corresponding DataFrame method without adding any new functionality - an implementation and maintenance nightmare. Our goal for Woodwork was to focus on additive functionality for data typing, not wrapping DataFrame functions. The solution we settled on was using custom accessors on the DataFrame. With this approach we could add a custom namespace to pandas, Dask and Spark DataFrames, and users would not have to modify their existing workflows. They could continue using their existing DataFrames, and start taking advantage of Woodwork data typing in only the places they need it. By simply importing Woodwork, a custom namespace is added to existing DataFrame objects, and users can start to add rich typing information without the need to swap out the DataFrames in their code with new Woodwork objects. Benefits of Using Woodwork Woodwork offers multiple benefits to users in applications where typing information is critical. Woodwork typing is consistent and dependable, meaning that when a column is assigned a certain type users can be assured that any operation that is valid for that type can be performed on the column. Additionally Woodwork offers several convenience methods that make selecting and manipulating data based on types easy. Need to access all the integer columns in a wide data table? Woodwork lets users do that with a single, simple select method. Woodwork’s default type inference system automatically identifies data types for columns in a DataFrame with during initialization. No more combing through columns of data to manually identify the types - in most cases Woodwork can identify the proper type automatically, and when Woodwork doesn’t quite get it right, users can quickly and easily update the column to the proper type. Finally, since Woodwork is an open source library, users can benefit from community-driven enhancements. Improvements driven from one specific application of Woodwork become available to all users. Woodwork offers all these benefits with an easy to use interface. Woodwork extends DataFrames by using custom table and column accessors. As such, adding Woodwork’s rich typing information to existing DataFrame applications is often straightforward, and doesn’t require large-scale code updates. Woodwork ensures data is represented in a consistent manner across all rows based on the specified column logical type. Demo: Basic Operations using Woodwork and Pandas In this section we provide examples demonstrating the core functionality of Woodwork. Please note, this is not a comprehensive review, and users are encouraged to consult the Woodwork Documentation or one of our many Woodwork Guides for more thorough examples of how to use Woodwork. Installation Woodwork is fully released and available for use in your Python environment. Installation is simple and can be done with either pip or conda: pip users python -m pip install woodwork

conda users

conda install -c conda-forge woodwork Since Woodwork is now fully integrated into both EvalML and Featuretools, it is shipped and installed by default during installation of either of those libraries. Additionally, Woodwork is now used extensively throughout Alteryx Machine Learning for managing data types. Basic Initialization To demonstrate how simple it is to use Woodwork on an existing pandas DataFrame, we will first read some example retail data into a DataFrame and then initialize Woodwork. import pandas as pd filename = "https://api.featurelabs.com/datasets/online-retail-logs-2018-08-28.csv" df = pd.read_csv(filename) # Add a unique identifier column to represent the index # since one does not already exist df.insert(0, "idx", range(df.shape[0])) df.head() idx order_id product_id description quantity ... unit_price customer_name country total cancelled 0 0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 ... 4.2075 Andrea Brown United Kingdom 25.245 False 1 1 536365 71053 WHITE METAL LANTERN 6 ... 5.5935 Andrea Brown United Kingdom 33.561 False 2 2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 ... 4.5375 Andrea Brown United Kingdom 36.300 False 3 3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 ... 5.5935 Andrea Brown United Kingdom 33.561 False 4 4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 ... 5.5935 Andrea Brown United Kingdom 33.561 False This DataFrame contains several columns representing user orders. There are a variety of different columns including various id columns, customer names, product descriptions, order quantities and purchase price. Using a simple Woodwork initialization call, the types for all of these columns can automatically be inferred for us, saving time in manually studying each column and assigning a type. To infer the types with Woodwork we just need to import Woodwork and then run a simple initialization call. import woodwork as ww df.ww.init() # Access Woodwork through the `ww` namespace During the initialization call, Woodwork automatically inferred the logical type for each column in the DataFrame. Now we can take a look at the typing information added by Woodwork. The output shows the logical type that Woodwork inferred for each column, the pandas physical type being used to store the data, and any standard semantic tags added by Woodwork. As an example notice that the Integer and Double columns have all been labeled with the numeric semantic tag, indicating these columns are all valid for numeric operations. df.ww Physical Type Logical Type Semantic Tag(s) Column idx int64 Integer ['numeric'] order_id category Categorical ['category'] product_id category Categorical ['category'] description category Categorical ['category'] quantity int64 Integer ['numeric'] order_date datetime64[ns] Datetime [] unit_price float64 Double ['numeric'] customer_name category Categorical ['category'] country category Categorical ['category'] total float64 Double ['numeric'] cancelled bool Boolean [] Woodwork also has the ability for users to identify a specific column as the index column for a DataFrame. This can be done during initialization or afterwards. In this case we would like the added idx column to be identified as the Woodwork index. We can set this as our index by using Woodwork’s set_index method. Note, when calling a Woodwork method, the .ww accessor suffix needs to be added after the DataFrame name. This instructs pandas to call the Woodwork method instead of calling a pandas method. Let’s set our index column and take another look at the Woodwork typing information. df.ww.set_index("idx") df.ww Physical Type Logical Type Semantic Tag(s) Column idx int64 Integer ['index'] order_id category Categorical ['category'] product_id category Categorical ['category'] description category Categorical ['category'] quantity int64 Integer ['numeric'] order_date datetime64[ns] Datetime [] unit_price float64 Double ['numeric'] customer_name category Categorical ['category'] country category Categorical ['category'] total float64 Double ['numeric'] cancelled bool Boolean [] After setting the index, note that the typing information for the idx column has changed. Woodwork has removed the numeric semantic tag and replaced it with an index semantic tag. Behind the scenes Woodwork also verified that the idx column is a valid index column, namely checking that the data in the column is unique and does not contain any missing values. Updating Types and Tags Because no inference process is perfect, Woodwork will, at times, fail to identify the best logical type for a specific column. In situations like this, Woodwork provides a method to allow users to manually specify the logical type for a column. Woodwork also provides methods for users to add and remove custom semantic tags, specific to their applications. For more information on these topics, please refer to the Working with Types and Tags guide in the Woodwork documentation. Selecting and Manipulating Data Now that we have Woodwork configured, we can use the typing information when selecting and manipulating the data. For example, let’s assume we want to select all of the numeric and boolean columns for some downstream operation. With Woodwork, this is simple, and we can do it using the select method. numeric_and_bool = df.ww.select(include=["numeric", "Boolean"]) If we view the contents of the numeric_and_bool DataFrame that is returned, we will see this includes only the columns that contain numeric or boolean values. The include parameter can select based on semantic tags such as numeric or logical types such as Boolean or a combination of semantic tags and logical types as was done above. quantity unit_price total cancelled 0 6 4.2075 25.2450 False 1 6 5.5935 33.5610 False 2 8 4.5375 36.3000 False 3 6 5.5935 33.5610 False 4 6 5.5935 33.5610 False ... ... ... ... ... With some methods, such as DataFrame.rename, you could call this directly on the pandas DataFrame, or through the Woodwork .ww accessor. If maintaining typing information on the returned value is important, be sure to call all methods through the accessor. Whenever possible, Woodwork will maintain the typing information on the returned object, provided it is a DataFrame or Series. You can call any pandas method through the Woodwork accessor, just as you would directly on the DataFrame. To demonstrate, let’s assume want to rename the total column in our DataFrame to total_price. We could do that directly through pandas or through Woodwork. Let’s demonstrate both. renamed_pandas = df.rename(columns={"total": "total_price"}) If we look at the renamed DataFrame, we see that total has been renamed to total_price as expected. renamed_pandas.head() idx order_id product_id description quantity order_date unit_price customer_name country total_price cancelled 0 0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26:00 4.2075 Andrea Brown United Kingdom 25.245 False 1 1 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 5.5935 Andrea Brown United Kingdom 33.561 False 2 2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2010-12-01 08:26:00 4.5375 Andrea Brown United Kingdom 36.300 False 3 3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010-12-01 08:26:00 5.5935 Andrea Brown United Kingdom 33.561 False 4 4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010-12-01 08:26:00 5.5935 Andrea Brown United Kingdom 33.561 False However, if we attempt to access the Woodwork typing information on this DataFrame, we see that the operation has invalidated our typing information and Woodwork is no longer initialized on the renamed_pandas DataFrame. renamed_pandas.ww woodwork.exceptions.WoodworkNotInitError: Woodwork not initialized for this DataFrame. Initialize by calling DataFrame.ww.init Now, if the same operations are performed through the Woodwork accessor, we see that the column is renamed successfully, but our typing information is retained in this case. renamed_woodwork = df.ww.rename(columns={"total": "total_price"}) renamed_woodwork.ww.head() idx order_id product_id description quantity order_date unit_price customer_name country total_price cancelled 0 0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26:00 4.2075 Andrea Brown United Kingdom 25.245 False 1 1 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 5.5935 Andrea Brown United Kingdom 33.561 False 2 2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2010-12-01 08:26:00 4.5375 Andrea Brown United Kingdom 36.300 False 3 3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010-12-01 08:26:00 5.5935 Andrea Brown United Kingdom 33.561 False 4 4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010-12-01 08:26:00 5.5935 Andrea Brown United Kingdom 33.561 False renamed_woodwork.ww Physical Type Logical Type Semantic Tag(s) Column idx int64 Integer ['index'] order_id category Categorical ['category'] product_id category Categorical ['category'] description category Categorical ['category'] quantity int64 Integer ['numeric'] order_date datetime64[ns] Datetime [] unit_price float64 Double ['numeric'] customer_name category Categorical ['category'] country category Categorical ['category'] total_price float64 Double ['numeric'] cancelled bool Boolean [] If retaining typing information is important, methods should always be called through the Woodwork .ww accessor. Summary Woodwork is a Python library that provides users with a wide variety of methods for managing and communicating data typing information. Woodwork can be easily integrated into existing workflows to add rich typing information to your data, helping to improve your data analysis process. Want to learn more? Check out the Woodwork documentation for complete details on Woodwork, including several guides that explain the concepts discussed here in much more detail. If you need help, you can always reach out via Alteryx Open Source Slack or GitHub. Contributions Special thanks to the team that built Woodwork: Gaurav Sheni, Tamar Grey, Nate Parsons, and Jeff Hernandez. In addition, we would like to thank Max Kanter, Dylan Sherry, and Roy Wedge. None of this would have been possible without their help. If you have ideas for enhancing or improving Woodwork, open source contributions are welcome. To get started, check out the Woodwork Contributing Guide on GitHub. This article first appeared on the Innovation Labs blog.

IraWatt · Answer

Great to hear from you @veeliang thanks for the update 👍

veeliang · Answer

Hi @IraWatt,

Thanks for the feedback. Our team has been hard at work on the new Alteryx Machine Learning product to build a better, faster, stronger ML experience. The upcoming Intelligence Suite release will not include updates to libraries underlying the ML tools in Intelligence Suite but this may be revisited in the future.

IraWatt · Answer

Really interesting article thanks @gaurav5 ! I came across Alteryx's open source ML projects after looking at the code generated by the assisted modelling tool. Would be great for closer integration between these libraries and the intelligence suite as there functionality is great. From my initial investigation intelligence suite is two years our of date with the EvalML project which does reduce some functionality.