Data Science

DrDan · ‎05-08-2013

Since the 7.0 release of Alteryx, the number and breadth of predictive tools has expanded with each subsequent version, and the 8.5 release is no exception. At this point, there are three objectives that drive the development of the new predictive analytics capabilities in Alteryx:

Improve a data artisan's productivity in undertaking predictive analytics in Alteryx by providing tools that assist in data preparation and data investigation.
Incorporate new algorithms that allow data artisan's to solve important business problems, and do so in a way that expands the range of business users who can make use of predictive analytics.
Increase the scale of data that predictive analytics in Alteryx can address in a way that also increases their speed.

The new tools in the 8.5 release, and some "under the hood" changes we have made to a number of the existing predictive analytics tools, all reflect one or more of these three objectives. In addition, we have made a number of changes in how we package predictive tools in Alteryx that will enable us to be more flexible and nimble in rolling out new capabilities by making the predictive toolkit a bit more independent from the Alteryx Designer Desktop release schedule, but, at the same time, we have better integrated the installation of Alteryx Predictive into the Designer Desktop installation process.

Improving Data Artisan Productivity

Five of the new predictive tools in 8.5 were added explicitly to improve data artisan productivity. Two of these tools were added to directly improve users' efficiency in undertaking common data preparation tasks for predictive analytics, and three to improve users efficiency in gaining an initial understanding of their data. The tools designed to improve data preparation efficiency are:

Multi-field Binning: This tool that allows a data artisan to quickly bin (break into a number of contiguous groups) multiple fields at the same time using either equal records (in which each bin has, as close as is possible, an equal number of records) or equal intervals (each bin spans a range of values of equal size of the original numeric field) criteria.
Impute Values: A tool that allows a data artisan to replace Null or some other value in multiple fields with either a user specified value (0, "Missing") or, for numeric fields, the mean or median value of the non-missing values in a field. In addition, the user has the option of automatically adding a "flag" field to accompany each field that has imputed values that indicates which records have imputed values for the different fields.

The new tools in 8.5 to help data artisans quickly gain an understanding of their data are:

Histogram: A tool that provides a histogram plot for a numeric field, which provides visual information about the distribution of the values in a numeric field (e.g., whether it is "bell" shaped, has a long right-hand tail, etc.). Optionally, it provides a smoothed empirical density plot. Frequencies are displayed when a density plot is not selected, and probabilities when this option is selected. The number of breaks can be set by the user, or determined automatically using the method of Sturges.
Contingency Table: The Contingency Table tool provides the empirical joint distribution of two to four categorical fields. For example, it allows a data artisan to look at the relationship between the state a customer lives in and the cell phone service plan they have selected. For contingency tables with two fields (technically, a "two-way table"), the user can select to test whether there is a statistically significant relationship between the different levels of the two fields.
Frequency Table: This tool produces a frequency distribution for selected categorical fields, the outputs include a summary of the selected field(s) with frequency counts and percentages for each value in a field.

New Algorithms to Solve Business Problems

The Alteryx 8.5 release contains a number of new tools for conducting particular types of predictive analytics analyses. In this release, we have put particular emphasis on tools for A/B testing and market basket analysis. The A/B testing tools assist users in conducting market experiments to look at the potential returns to new promotional programs, staffing changes, pricing changes, new marketing communications programs, store remodeling programs, and a large number of other business activities. The use of experiments allow an organization to "test drive" a potential action with a small sample of customers or locations before (potentially) rolling those changes out to all customers or locations. The market basket analysis tools allow an organization to explore patterns pertaining to what products and services customers tend to purchase together through the extraction of association rules and frequent itemsets from customer transaction data (the tools in this release consider only a single transaction at a time, but in the future we will be incorporating tools that look at patterns in the sequence of customer transactions). In upcoming blog posts and demonstration videos, I will cover the tools in these two areas in greater depth.

Another important addition is the introduction of a tool to estimate count data regression models, which are applicable in cases where the target field consists of an integer number of items (e.g., the number of visits a patient makes to a doctors office in a year or the number of phone numbers assigned to a mobile telephone account), outcomes that are not consistent with the assumptions of either linear or logistic regression models. A number of other tools have also been added that are used within the A/B testing tools, but are of value to data artisans as standalone tools, so have been added to the predictive analytics toolbox in Alteryx. The list of new tools in this area are:

AB Treatments: The AB Treatments tool assists in selecting treatment units (e.g., stores or customers) for conducting A/B testing in cases where (for operational reasons) the treatment units are selected as a group. For example, selecting all customers or stores in a particular Dominant Market Area (or DMA) in order to implement a test that relies on the use of broadcast media to deliver marketing communications messages that will be seen or heard by all potential customers in the DMA.
AB Trend: This tool creates measures of trend and seasonal patterns that can be used in helping to match treatment to control units (e.g., stores or customers) for A/B testing. The trend measure is based on period to period percentage changes in the rolling average (taken over a one year period) in a performance measure of interest. The same measure is used to assess seasonal effects. In particular, the percentage of the total level of the measure in each reporting period is used to assess seasonal patterns.
AB Controls: This tool selects control observation units to match to treatment units on user specified criteria for conducting A/B tests based on how near (using a Euclidean distance metric) a treatment unit is to each of a set of possible control units.
AB Analysis: The AB Analysis tool performs an automated test of means analysis that compares treatment to control units in an A/B or market test. The analysis is based on comparing the percentage change in a performance measure (e.g., sales or traffic) to the same measure one year prior, and is done using both visual and statistical analysis methods.
MB Rules: Part of the market basket analysis set of tools, the MB Rules tool creates a base set of association rules and frequent itemsets for undertaking market basket analysis. Association rules imply a causation of the type "If items A and B are in the basket than item C is also more likely to be purchased," while frequent itemsets are sets of items that tend to be bought together in a group, but without the assumption that buying items A and B leads to buying item C.
MB Inspect: The MP Inspect tool helps the user to interpret and find "interesting" association rules or frequent itemsets from those created using the MB Rules tool by allowing rules to be filtered and ordered using the criteria of support (the percentage of transactions in a data base that contain the rule or itemset), confidence (the percentage of times that an association rule that is believed to lead to the purchase of one or more other items actually does so), and lift (the ratio of times the "precursor" and "results" of an association rule occur over what would be expected under random chance).
Count Regression: Regression models for count data (e.g., values like the number of numbers to a cell phone account, the number of visits a customer makes to our store in a given year) that are integer in nature. The tools use Poisson or negative binomial based regression models.
Find Near Neighbors: The Find Near Neighbors tool finds the selected number of nearest neighbors in the "data" stream that corresponds to each record in the "query" stream based on their Euclidean distance.
Test of Means: This tool compares the difference in mean values (using a Welch two sample t-test) for a numeric response field between different groups. For example, using this tool, a data artisan can test whether the average amount purchased from the company by its New York customers is statistically different from the average amount purchased by customers in Florida.

Increasing Data Scale and Speed

Changes to improve the amount of data that can be analyzed using our R-based predictive analytic tools, and the speed of doing the needed analysis, have largely been done "under the hood." These changes have been done in two different ways. First, Alteryx Predictive in the 8.5 release makes use of the just released R version 3 engine. The primary goal of the R core team in the first release of the version 3 series is to significantly improve internal memory management and to increase the maximum size of the base unit (an R vector) that can be addressed. The second change we are implementing is increasing the percentage of the workload, and, in particular, data volumes that are handled by Alteryx as opposed to R. All of the new tools for the 8.5 release reflect this "Alteryx first" development philosophy, and many of the pre-8.5 release tools have been altered to reflect this philosophy as well. The changes in the existing tools have been carried out in way that won't break existing modules, macros, and analytic apps that you and your organization may have already developed.

This really constitutes our first steps in the area of speed and scaling. We are in the process putting together a set of relationships that will allow Alteryx Predictive to scale to any data size with increased speed. This process is well enough developed at this point that I can safely say we are planing on making a number of important announcements over the next couple of months in this area.

Data Science

Alteryx Predictive Analytics: The 8.5 Release and a Bit Beyond

Improving Data Artisan Productivity

New Algorithms to Solve Business Problems

Increasing Data Scale and Speed