Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Data Science

Machine learning & data science for beginners and experts alike.
BridgetT
Alteryx Alumni (Retired)

After cranking out my first two blog posts, I was struggling a bit to settle on a topic for this one. I reviewed some of the previous blogs in the Engine Works series for some inspiration, and two ideas in particular struck me. The first was @TashaA’s mention of the popularity of Magic: The Gathering in our Broomfield office. She hit the nail on the head with that one: card and board games are quite popular with engineers in general, and engineers in our office in particular. The second was @RamnathV’s post about creating interactive Sankey charts using R and Alteryx. It seems that a lot of Alteryx users are eager to learn more about leveraging the power of external R packages for their own custom design purposes and would appreciate more guidance about how to use Alteryx and R together. So I decided to combine Alteryx developers’ fondness for strategy games and Alteryx users’ desires for more explanation about integrating R and Alteryx into a single post, which will focus on using the R packages plyr and ggplot2 in a fictional universe based on the board game Pandemic.

 

The packages

For those of you unfamiliar with R, it is a powerful statistical programming language. It even powers most of the Alteryx Predictive tools! The rest of this post assumes you have at least a passing familiarity with R, but here’s a good resource to get started with the language. Though base R can do some pretty interesting things, packages are what really allow R users to perform deep analysis. (Think of it this way: base R is like the Alteryx tools that appear in the “Favorites” section of the toolbar by default. Packages are all the other tools, as well as apps and macros you find on the Gallery.) Many of the most popular R packages today were written by Hadley Wickham, the chief scientist of RStudio. Hadley has a bit of a cult following in the R community, and some R programmers have dubbed his collection of packages “the Hadley-verse.” In this post, we’ll focus on two of Hadley’s packages: plyr (a package that allows you to split data into pieces, apply various functions to each piece, and then combine it back together) and ggplot (a plotting package that makes it easier to create beautiful plots).

 

BT_1.png

An example plot generated by ggplot2, from http://r-statistics.co/screenshots/ggplot_1.png

 

The game

BT_2.jpgPandemic is a collaborative strategy game where players cooperate to cure multiple simultaneous disease outbreaks before they spread too far. There are four different diseases (red, yellow, blue, and black). Each disease has a “home base” somewhere on the globe where its spread naturally begins, though it can spread outside this initial area. Additionally, each player has some sort of role (such as Scientist or Quarantine Specialist) that gives him or her additional powers beyond what the game ordinarily allows. The idea is that the players leverage their unique powers to achieve more together than they could on their own, just like Alteryx and R can work hand in hand to allow for even greater data insight! The rest of this post will be written from the perspective of the Pandemic universe.

 

The Pandemic game box, from https://cf.geekdo-images.com/images/pic1534148.jpg

 

 

The Data

Since our world has been thrown into chaos by the rapid spread of four new diseases, data about diseases no longer follows the same structure we previously took for granted. So we’ll have to do a bit of digging into our data and use Alteryx to shape it into a form we can use. We have two data sources we’ll need to join together. The first is a file with several columns coming from a hospital: patient ID, disease type, severity upon patient intake, survived, the number of people to whom the patient spread the disease, age, health before disease, sex, the presence of several different genes, and the concentration of Vitamin B12 and Vitamin D in the patient’s blood. The second is a table with a column for the patient ID and another for the city name.

 

We need to join these data sources together to provide the Researcher and the Quarantine Specialist with the information they need to do their jobs. We don’t know the exact modeling processes our two colleagues use, but they have each specifically requested certain aspects of the data to perform their modeling themselves. For each disease, the scientist has asked for a table with all of the information in the first table except the patient ID and the number of people who acquired the disease from the patient. The Quarantine Specialist requires the disease type, the severity of the disease, the number of people to whom the patient spread it, and the name of the city where the patient resides. Additionally, the Researcher requires graphs with the concentration of Vitamin B12 and Vitamin D on the X axis, and the severity of the disease on the Y axis. For each gene presence combination, there should be a separate curve. There should also be a separate graph for each disease type, and the data points should be colored by age group (in 15 year increments, and the maximum age group is 75+).

 

Blending the Data

The data blending step is relatively straightforward in this problem. We use two Input Data tools to bring in the two datasets. Each dataset has some fields that should be numeric types (double, int16, etc.) that are being read in as strings. So we then use an Auto Field tool on each to convert these types to their appropriate numeric types. Since each set has a Patient ID field (called pt_id in each), we can use a Join tool under its “Join by Specific Fields” configuration to join them. Then we use two Select tools; each one is used to select specific fields requested by each colleague. We connect each Select tool to an Output Data tool to write each desired dataset out to a csv. Finally, we connect the Select tool for the Researcher’s data to an R tool, which we’ll use for the graphing.

 

BT_3.png

Graphing the Data

Now it’s time to tackle the graphing portion! We’ll go through the attached R script section by section. The first section concerns the installation of the packages needed later in the script:

 

checkInstalls <- function(packages) {
  # See if the desired packages are installed, and install if they're not
  if (!all(packages %in% row.names(installed.packages()))) {
    # Use the IE based "Internet2" since it is most reliable for this action.
    # It will be switched back at the end
    setInternet2(use = TRUE)
    # Make sure the path to the users library is in place and create it if it
    # is not
    minor_ver <- strsplit(R.Version()$minor, "\\.")[[1]][1]
    R_ver <- paste(R.Version()$major, minor_ver, sep = ".")
    the_path <- paste0(normalizePath("~"), "\\R\\win-library\\", R_ver)
    # Create the user's personal folder if it doesn't already exist
    if (!dir.exists(the_path)) {
      dir.create(the_path, recursive = TRUE, showWarnings = FALSE)
    }
    # The set of possible repositories to use
    repos <- c("http://cran.revolutionanalytics.com", "https://cran.rstudio.com")
    # Select a particular repository
    repo <- sample(repos, 1)
    missingPackages <- packages[which(!(packages %in% row.names(installed.packages())))]
    install.packages(missingPackages, lib = the_path, repos = repo)
    setInternet2(use = FALSE)
  }
}

 

You don’t need to understand every line of this function to comprehend what it’s doing. The basic idea is that it’s a function that takes a vector of R package names, checks if the packages in the vector are installed on your machine, and installs any of the packages that are not already installed. The script is a bit more complicated than just calling install.packages on each one in order to account for machines with strict security permissions. But if your machine doesn’t have strict security requirements, you should just be able to call install.packages on each package you’d like to install.

The next section simply calls this function on all the packages we need for this script and then loads those packages using the library() function:

 

checkInstalls(c('plyr', 'ggplot2', 'AlteryxPredictive'))
library(plyr)
library(ggplot2)
library(AlteryxPredictive)

 

The next line of code might be a bit confusing if you’ve never used the AlteryxPredictive or AlteryxHelper packages before. @RamnathV created the AlteryxHelper package to ease the transition between working in R and working in the Alteryx R tool. It includes functions like read.Alteryx2 that can detect whether they’re being run in Alteryx; their behavior depends on whether they’re called in Alteryx or base R. The read.Alteryx2 function acts just like the read.Alteryx function when it’s run in Alteryx. However, it requires a default value when it’s run in R. In this case, we supply it with the .csv file we generated earlier in the workflow. (The R read.csv function reads in a .csv and loads it to R as a data.frame.)

 

input_data <- read.Alteryx2("#1", mode="data.frame", default = 
read.csv('Data_for_Researcher.csv', header = TRUE))

 

If you’re running this script in R, be sure to save the script and the .csv file to the same directory. Otherwise you’ll need to specify the path to the file.

 

The next few lines concern grouping the age variable as the researcher requested.

 

 

grouped_age <- cut(input_data$age, breaks = c(-Inf, 15, 30, 45, 60, 75, Inf), labels = 1:6)
input_data <- cbind(input_data, grouped_age)

 

Recall that we’ll have 6 age groups: under 16, 16-30, 31-45, 46-60, 61-75, and over 75. The cut function takes two arguments. The first is a numeric vector that we would like to “cut.” That is, we want to assign these numbers to subintervals. The second argument is a vector of breaks. So the first cut will be the interval from –infinity to 15, the next will be from 16 to 30, and so on. Finally, we use a cbind statement to bind the grouped_age vector to the input_data data.frame.

 

Now we finally have all the data in the form we need to graph it! The researcher requested one graph per disease/gene type/vitamin combination. So we’ll start with graphing the Yellow disease patients with gene type A1. We subset twice to get the patients in this group:

 

 

yellow_disease <- input_data[which(input_data$disease_type == 'yellow'),]
yellow_gene_type_a1 <- yellow_disease[which(yellow_disease$genetic_group == 'A1'),]

 

The first line subsets to get all the patients with the Yellow disease. The second line gives us all the Yellow disease patients who have the gene type A1.

 

Then we plot this subset using the ggplot function from the ggplot2 package. The basic idea of ggplot is to build your plot up in layers. Our first layer describes the data we’re plotting, and which variables of that data we’d like on which axis:

 

ggplot(data = yellow_gene_type_a1, aes(x = concentration_vitamin_b12, y = severity))

 

We then add another layer to specify which points we’d like to appear, and how we’d like them colored:

 

geom_point(colour = (yellow_gene_type_a1$grouped_age))

 

We add yet another layer to create a smoothed trend line. The “lm” specifies a linear trend line, as opposed to more complex options for generating trend curves:

 

geom_smooth(method = "lm", se = FALSE)

 

Finally, we add a layer to specify the title of the graph:

 

ggtitle('Yellow Disease Severity vs Vitamin B12 Concentration for gene type A1')

 

Putting it all together by adding “+” between each layer, we get this command:

 

 

plotObjYellowB12 <- ggplot(data = yellow_gene_type_a1, aes(x = concentration_vitamin_b12, y = severity)) +
  geom_point(colour = (yellow_gene_type_a1$grouped_age)) + geom_smooth(method = "lm", se = FALSE) +
  ggtitle('Yellow Disease Severity vs Vitamin B12 Concentration for gene type A1')

 

We then repeat the process for the graph with the Vitamin D concentration:

 

 

plotObjYellowD <- ggplot(data = yellow_gene_type_a1, aes(x = concentration_vitamin_d, y = severity)) +
  geom_point(colour = (yellow_gene_type_a1$grouped_age)) + geom_smooth(method = "lm", se = FALSE) +
  ggtitle('Yellow Disease Severity vs Vitamin D Concentration for gene type A1')

 

Finally, we use the AlteryxGraph2 function to actually plot these objects. The AlteryxGraph2 function is analogous to the read.Alteryx2 function. It also detects whether it’s being run in Alteryx or in R. When it’s run in Alteryx, it will output the graphs to the specified output. When it’s run in RStudio, it graphs the plots to the Plots panel.

 

Wow, that was exhausting! Do we really have to repeat this process for every disease/gene type combination? We have 4 diseases (Yellow, Black, Red, and Blue). We also have 8 gene type profiles (A1, A2, A3, A4, B1, B2, C1, and D1). So we’ll have to do this process 4*8 = 32 times! And what if another disease breaks out, or a new genetic profile is discovered? Then we’ll have to repeat this process even more times!

 

So what are our options at this point? We could loop through each disease/genetic profile combination using a for loop within another for loop. But remember: for loops are generally avoided among R programmers. Historically, they were considerably slower than other options. Now speed isn’t as much of an issue, but R programmers generally still prefer other methods. Why? Because the other methods often turn out to be much simpler once you master them! Have you ever programmed a complex for loop where you spent a lot of mental energy just keeping track of book-keeping issues? Something like needing to keep track of what row of a matrix/data.frame you should start on with each new loop iteration, where the process wasn’t straightforward? Luckily, the plyr function allows us to solve most of these problems without reverting to for loops!

 

The basic idea of plyr is that R has three main types of data structures: data.frames, lists, and arrays (which include matrices and vectors). Using plyr, you can start with one of these data structures, split it by some specified criterion, apply a function to each piece, and then combine the pieces and return a new data structure whose type you specify. So if you wanted to start with an array and return a list, you could use the alply function (the a refers to the array input, and the l refers to the list output). But in our problem, we don’t actually return any output! We just plot the graphs we create. Luckily, there is another type of function in the plyr package: functions that don’t return any output. We want the d_ply function, which takes a data.frame as input and returns no output.

 

But before we can actually call d_ply, we need to create the function with which we’re calling it. So we wrap the plotting process we just outlined in a function called create_graphs. This function takes in a data.frame and creates the same plots we discussed before using this data.frame. Since we’ll be calling it in d_ply where we’ll split by disease type and genetic profile, we can write create_graphs with the assumption that the data going into it all has the same disease type and genetic type. Here’s the function:

 

 

create_graphs <- function(data) {
  plotObjB12 <- ggplot(data = data, aes(x = concentration_vitamin_b12, y = severity)) +
    geom_point(colour = (data$grouped_age)) + geom_smooth(method = "lm", se = FALSE) +
    ggtitle(paste0(data$disease_type, ' Disease Severity vs Vitamin B12 Concentration for gene type ', data$genetic_group))
  AlteryxGraph2(plotObjB12, nOutput = 1)
  plotObjD <- ggplot(data = data, aes(x = concentration_vitamin_d, y = severity)) +
    geom_point(colour = (data$grouped_age)) + geom_smooth(method = "lm", se = FALSE) +
    ggtitle(paste0(data$disease_type, ' Disease Severity vs Vitamin D Concentration for gene type ', data$genetic_group))
  AlteryxGraph2(plotObjD, nOutput = 1)
}

 

There’s really nothing new here except the paste0 command with the title. Since we’re calling this function with different diseases and genetic profiles, we need to re-title the plot in each case appropriately. The paste0 function allows you to paste together multiple strings into one larger string. In this case, we want to paste the disease type, the main title block, and the genetic group together.

Now we’re finally ready to call the d_ply function. We’ll call it with three variables: the input dataset, the variables we’d like to split by, and the function we want to apply to each piece.

 

d_ply(.data = input_data, .variables = .(genetic_group, disease_type), .fun = create_graphs)

 

Well, that’s all I have for now! Feel free to look at the attached workflow. I covered a lot of material today, so please feel free to ask any questions you may have!

 

P.S. Here's the script I used to generate the data. Don't worry if you don't understand it right now! I'm just including it here in case anyone is interested because I can't attach R files separately. 

 

#call the necessary library.
library(plyr)
#Create the vector of cities in the game.
#Later, the script will randomly sample from this vector to get the city associated with each patient.
cities <- c('San Francisco', 'Chicago', 'Montreal', 'New York', 'Washington', 'Atlanta', 'Madrid', 'London', 'Paris', 'Essen', 'Milan', 'St. Petersburg', 
            'Algiers', 'Istanbul', 'Moscow', 'Cairo', 'Baghdad', 'Tehran', 'Delhi', 'Karachi', 'Riyadh', 'Mumbai', 'Chennai', 'Kolkata',
            'Beijing', 'Seoul', 'Tokyo', 'Shanghai', 'Hong Kong', 'Taipei', 'Osaka', "Bangkok", 'Ho Chi Minh City', 'Manila', 'Jarkarta',
            'Sydney', 'Khartoum', 'Johannesburg', 'Kinshasa', 'Lagos', 'Sao Paulo', 'Buenos Aires', 'Santiago', 'Lima', 'Bogota', 'Mexico City',
            'Los Angeles', 'Miami')
#Create the vectors of patient ID's, disease type, and age.
pt_id <- c(1:10000)
disease_type <- sample(c('red', 'yellow', 'blue', 'black'), size = length(pt_id), replace = TRUE)
age <- sample(c(0:90), size = length(pt_id), replace = TRUE)
#Create the function that will create the patient's health before getting sick.
#This function randomly determines the patient's health based on his/her age.
getHealth <- function(age) {
  if (age > 75) {
    min(max(0, rnorm(1, mean = 3, sd = 1)), 10)
  } else if (age > 60) {
    min(max(0, rnorm(1, mean = 5, sd = 1)), 10)
  } else if (age > 45) {
    min(max(0, rnorm(1, mean = 6.5, sd = 1)), 10)
  } else if (age > 30) {
    min(max(0, rnorm(1, mean = 8, sd = 1)), 10)
  } else {
    min(max(0, rnorm(1, mean = 9, sd = 1)), 10)
  }
}
#Use an sapply to get all the ages at once.
health <- sapply(age, getHealth)
#Create the vector of patient sexes.
sex <- sample(c('Male', 'Female'), size = length(pt_id), replace = TRUE)
#There are three genes under consideration. Create a vector for each one.
gene_a <- sample(c('Yes', 'No'), size = length(pt_id), replace = TRUE)
gene_b <- sample(c('Yes', 'No'), size = length(pt_id), replace = TRUE)
gene_c <- sample(c('Yes', 'No'), size = length(pt_id), replace = TRUE)
#Based on the presence/absence of genes A, B, and C, determine each patient's genetic group.
getGeneticGroup <- function(gene_a, gene_b, gene_c) {
  if (gene_a == 'Yes') {
    if (gene_b == 'Yes') {
      if (gene_c == 'Yes') {
        return('A1')
      } else {
        return('A2')
      }
    } else if (gene_c == 'Yes') {
      return('A3')
    } else {
      return('A4')
    }
  } else if (gene_b == 'Yes') {
    if (gene_c == 'Yes') {
      return('B1')
    } else {
      return('B2')
    }
  } else if (gene_c == 'Yes') {
    return('C1')
  } else {
    return('D1')
  }
}
#Use an mdply to get all genetic groups at once.
genetic_group <- mdply(.data = data.frame(gene_a = gene_a, gene_b = gene_b, gene_c = gene_c), .fun = getGeneticGroup)
colnames(genetic_group)[4] <- "Genetic_group"
#According to https://medlineplus.gov/ency/article/003705.htm, normal concentration of
#Vitamin B12 is between 200 to 900 pg/mL.
concentration_vitamin_b12 <- rnorm(length(pt_id), mean = 550, sd = 100)
concentration_vitamin_b12 <- sapply(concentration_vitamin_b12, function(concen) max(concen, 100))

#According to http://www.webmd.com/diet/guide/vitamin-d-deficiency#2, normal 
#concentration of Vitamin D is between 20 to 50 ng/mL
concentration_vitamin_d <- rnorm(length(pt_id), mean = 35, sd = 7)
concentration_vitamin_d <- sapply(concentration_vitamin_d, function(concen) max(concen, 8))

#Define the functions for severity for each disease. 
getBlueSeverity <- function(age, health, genetic_group, concen_b12, concen_d, sex) {
  if (genetic_group == 'A1') {
    return(max(min((age/20)  - (health/5) - (concen_b12/300) - (concen_d/35) + rnorm(1, mean = 8, sd = .75), 10), 0))
  } else if (genetic_group == 'B1') {
    if (sex == 'Female') {
      return(max(min((age/20)  - (health/4) - (concen_b12/350) - (concen_d/35) + rnorm(1, mean = 7, sd = .75), 10), 0))
    } else {
      return(max(min((age/22)  - (health/5) - (concen_b12/350) - (concen_d/35) + rnorm(1, mean = 6, sd = .75), 10), 0))
    }
  } else {
    return(max(min((age/25)  - (health/4) - (concen_b12/3000) - (concen_d/40) + rnorm(1, mean = 6, sd = .75), 10), 0))
  }
}

getYellowSeverity <- function(age, health, genetic_group, concen_b12, concen_d, sex) {
  if (genetic_group == 'A1') {
    return(max(min((age/24)  - (health/5) - (concen_b12/310) - (concen_d/35) + rnorm(1, mean = 7, sd = .75), 10), 0))
  } else if (genetic_group == 'B1') {
    if (sex == 'Male') {
      return(max(min((age/20)  - (health/4) - (concen_b12/250) - (concen_d/35) + rnorm(1, mean = 7, sd = .75), 10), 0))
    } else {
      return(max(min((age/22)  - (health/5) - (concen_b12/350) - (concen_d/35) + rnorm(1, mean = 6, sd = .75), 10), 0))
    }
  } else {
    return(max(min((age/20)  - (health/4) - (concen_b12/250) - (concen_d/40) + rnorm(1, mean = 8, sd = .75), 10), 0))
  }
}

getRedSeverity <- function(age, health, genetic_group, concen_b12, concen_d, sex) {
  if (genetic_group == 'D1') {
    return(max(min((age/21)  - (health/5) - (concen_b12/500) - (concen_d/35) + rnorm(1, mean = 8, sd = .75), 10), 0))
  } else if (genetic_group == 'A2') {
    if (sex == 'Female') {
      return(max(min((age/20)  - (health/4) - (concen_b12/550) - (concen_d/35) + rnorm(1, mean = 7, sd = .75), 10), 0))
    } else {
      return(max(min((age/22)  - (health/5) - (concen_b12/550) - (concen_d/35) + rnorm(1, mean = 6, sd = .75), 10), 0))
    }
  } else {
    return(max(min((age/22)  - (health/4) - (concen_b12/500) - (concen_d/30) + rnorm(1, mean = 6, sd = .75), 10), 0))
  }
}

getBlackSeverity <- function(age, health, genetic_group, concen_b12, concen_d, sex) {
  if (genetic_group == 'A4') {
    return(max(min((age/20)  - (health/5) - (concen_b12/100) - (concen_d/35) + rnorm(1, mean = 7, sd = .75), 10), 0))
  } else if (genetic_group == 'B3') {
    if (sex == 'Female') {
      return(max(min((age/20)  - (health/5) - (concen_b12/150) - (concen_d/35) + rnorm(1, mean = 9, sd = .75), 10), 0))
    } else {
      return(max(min((age/25)  - (health/5) - (concen_b12/150) - (concen_d/35) + rnorm(1, mean = 9, sd = .75), 10), 0))
    }
  } else {
    return(max(min((age/25)  - (health/4) - (concen_b12/150) - (concen_d/40) + rnorm(1, mean = 8, sd = .75), 10), 0))
  }
}
#The overall severity function. Note that we end up rounded the severities to the nearest integer.
getSeverity <- function(age, health, genetic_group, concen_b12, concen_d, sex, color) {
  if (color == 'red') {
    return(getRedSeverity(age, health, genetic_group, concen_b12, concen_d, sex))
  } else if (color == 'black') {
    return(getBlackSeverity(age, health, genetic_group, concen_b12, concen_d, sex))
  } else if (color == 'yellow') {
    return(getYellowSeverity(age, health, genetic_group, concen_b12, concen_d, sex))
  } else {
    return(getBlueSeverity(age, health, genetic_group, concen_b12, concen_d, sex))
  }
}
#Use an mdply to get all the severities at once.
severity <- mdply(.data = data.frame(age = age, health = health, genetic_group = genetic_group$Genetic_group, 
                                     concen_b12 = concentration_vitamin_b12, concen_d = concentration_vitamin_d, sex = sex, color = disease_type), 
                  getSeverity)
#Define the function that determines patient survival. As severity decreases, the probability of survival increases.
getSurvival <- function(severity) {
  sample(c('No', 'Yes'), size = 1, prob = c((severity/10), 1 - (severity/10)))
}
survival <- sapply(severity[,NCOL(severity)], getSurvival)
numberSpreadTo <- sample(c(0:7), size = length(pt_id), replace = TRUE, prob = c(.2, .2, .2, .1, .075, .075, .075, .075))
firstOutTable <- data.frame(pt_id = pt_id, disease_type = disease_type, severity = round(severity[,NCOL(severity)]), 
                            number_spread = numberSpreadTo, age = age, health_before = round(health), sex = sex, genetic_group = genetic_group$Genetic_group,
                            concentration_vitamin_b12 =  concentration_vitamin_b12, concentration_vitamin_d = concentration_vitamin_d)
write.csv(firstOutTable, file = "Big_table.csv", row.names = FALSE)
pts_to_cities <- data.frame(pt_id = pt_id, city = sample(cities, size = length(pt_id), replace = TRUE))
write.csv(pts_to_cities, file = "Pts_to_cities.csv", row.names = FALSE)


Comments