cancel
Showing results for 
Search instead for 
Did you mean: 

Guide to Creating Your Own R-Based Macro - Develop a Workflow that Uses an R Package

Alteryx
Alteryx

This post is part of the "Guide to Creating Your Own R-Based Macro" series.

 

Now that we have the needed R packages installed, we can use them in an Alteryx workflow. The real purpose of this workflow is to begin to put together the macro itself. As a result, there will be some minor differences between this workflow and the one you would likely create if you didn't plan on using as the basis of developing a macro. The starting workflow of the macro is show in Figure 1.

 

Figure 1: The Initial WorkflowFigure 1: The Initial Workflow

The data used in this macro (contained in a Text Input tool) is Fisher's well known Iris data set. This data consists of the length and width of both the petals and stamens of individuals from three species of the Iris flower family. In this instance we want to know how important these four measures are in determining what species to which a particular flower belongs. While this dataset is pretty far afield from a business application, it is a nice dataset to work with for creating this macro since it is small (150 rows and five fields), and represents the correct case (a categorical target, species, and numeric predictors, height and width measurements).

 

The basic workflow consists of only six tools. A Text Input tool contains the Iris data, which feeds into two Select tools. The upper of the Select tools selects out the target field (the field Species), while the second selects the potential predictor fields to be examined. The downstream Join tool is used to bring the data back together in a way where the first column contains the target, and the subsequent columns contain the potential predictors to be examined.

 

This combination of three tools would be somewhat out of place in a standard (non-macro) workflow. In general, column position does not matter, moreover, even if it did, a single Select tool could be used to alter column position. However, in this case we will alter the position of columns based on a user's choices in the final macro's user interface, and the use of two select tools allows us to accomplish this task.

 

The data flowing into the R tool now consists of only the target field (the first column) and the selected numeric predictors in the remaining columns. The R tool contains the following lines of code

# Load the FSelector package
suppressWarnings(library(FSelector))
# Read in the data from Alteryx into R
the_data <- read.Alteryx("#1")
# Create a string of the potential predictors seperated by plus signs
the_preds <- paste(names(the_data)[-1], collapse = " + ")
# Get the name of the target field
the_target <- names(the_data)[1]
# Create a formula expression from the names of the target and predictors
the_form <- as.formula(paste(the_target, the_preds, sep = " ~ "))
# Get the information gain measures
out1 <- information.gain(the_form, the_data)
# Prepare the results for output
out <- data.frame(a = names(the_data)[-1], b = out1[[1]])
names(out) <- c("Field", "Information Gain")
# Output the results
write.Alteryx(out)

The R code is fairly straightforward, with the possible exception of how the locations of values are indexed. For example, the code snippet names(the_data)[-1] takes all the provided field names except the first one (the [-1] index), which is the target field. The code snippet out[[1]] obtains the first (and only) column of the data frame returned by the information.gain R function.

 

The contents of the Browse tool (the sixth and last tool in the workflow) are the results of the analysis.

Comments
tshannon
Atom

So, I'm assuming the the connection labeled "#1" corresponds to the script we see here?  Does that imply that you can run up to 5 scripts (implicated by the outputs).