A fundamental concept in computer science, a data structure is a format with which to organize or store data. For the code friendly tools in Alteryx Designer (both R and Python), the mighty data frame is the reigning data structure.
Data frames can be thought of as tables, made up of rows and columns where each row represents an "observation" of a given phenomenon (e.g., customer, purchase, bumblebee sighting) and each column represents a different feature or variable. If you've worked with Alteryx or Excel, data frames are probably a familiar and intuitive concept. Data frames are just a tabular way of organizing statistical data so that the records in a dataset are easily compared, analyzed, and aggregated.
Because Alteryx Designer uses a data-frame-like structure in its data streams, it makes sense to use something analogous when moving data in and out of our code friendly tools; the R tool and the Python tool.
In this blog post, we will review the data frame structure in R as well as helpful functions for working with data frames. What's really exciting about this article is that we get to use the Data Camp Light plugin to create interactive code snippets. If you’re more of a Pythonista, there will be a sister article for you published in the coming weeks – stay tuned.
R Data Frames
From the R tool, there are two formats in which you can read data from Alteryx, a data frame or a list. Data frame is the default (and more typically used) format.
In R, data frames are a special type of named list, where all the elements that make up the data frame (i.e., vectors) have the same length. Data frames can contain many different types of data, and each column in a data frame can have a different data type (this is distinct from matrices in R, which require all data types be the same).
Data Structures in R Data Frames
Data frames are lists comprised of vectors. Vectors are the most basic data structure in R, and are effectively a group of data elements (e.g., strings or integers). Vectors exist as atomic vectors or lists. Atomic vectors must have values that are all the same data type; the most common types are logical (Boolean), integer, numeric (double), and character (string). Although lists are also a type of vector, it’s not common for people to refer to them that way. Lists are different from atomic vectors because the elements (values) in a list do not need to be the same type, and lists are also recursive, meaning lists can contain other lists (lists on lists on lists).
Because a column in a data frame can be a list, it is possible to have a column of lists or arrays, where each cell is a list or matrix. Although this is allowed in R, it does not play nicely when writing out a data frame back to Alteryx.
If you see an error like this, you should check the structure of your data frame to make sure you don’t have any goofy, multi-layered columns.
Factors are another special type of vector that is important to take specific note of. Factors are a vector that can only contain values from a predefined list (saved as an attribute of the vector) and are typically used to store categorical data. Factors have an attribute called “levels,” where the allowed values for the vector are defined.
If you’d like to know more details about data structures in R, check out this section from Advanced R by Hadley Wickham.
Creating a Data Frame in R
You can manually create a data frame in R using the function data.frame(). First, we'll create the input data for the data frame as a set of vectors (don’t worry, I’ve done this in the background for you, but here is the code I used):
user.name <- c(“NeilR”, “TaraM”, “MattD”, “CristonS”, “SydneyF”)
occupation <- c(“Sr. Community Content Manager”, “Creative Director”, “Community Data Engineer”, “Community Content Engineer”, “Sr. Data Science Content Engineer”)
current.team <- c(“Content”, “Creative Services”, “Community Platform”, “Content”, “Content”)
solutions <- c(25, 93, 7, 51, 57)
stars <- c(1159, 897, 942, 528, 824)
blogs <- c(62, 29, 33, 7, 30)
The c() function in R creates a vector or list from a series of arguments (c is for combine). The arrow <- is an assignment operator. All I am doing with this chunk of code is creating a bunch of named "lists" of data.
Also, a quick note on naming conventions in R – it’s totally kosher to include dots/periods/whatever you want to call them in variable names in R. In other languages (like Python), the period is reserved (for something called dot notation), and cannot be used for variable names.
user.name <- c("NeilR", "TaraM", "MattD", "CristonS", "SydneyF")
occupation <- c("Sr. Community Content Manager", "Creative Director", "Community Data Engineer", "Community Content Engineer", "Sr. Data Science Content Engineer")
current.team <- c("Content", "Creative Services", "Community Platform", "Content", "Content")
solutions <- c(25, 93, 7, 51, 57)
stars <- c(1159, 897, 942, 528, 824)
blogs <- c(62, 29, 33, 7, 30)
# Create Data Frame with Vectors
content.team <- data.frame(user.name, occupation, current.team, solutions, stars, blogs)
# Print out Data Frame
print(content.team)
Try it yourself!
If you want to read in data from Alteryx instead of creating a data frame by hand (which you probably do), you would use the function read.Alteryx() and specify which connection (by default, the first stream you connect to the R tool is “#1”) you’d like to read in. If you need to read in the data as a list instead of a data frame, you would also add the argument mode = “list”.
When you're ready to write a data frame back out to Alteryx, use the function write.Alteryx(). For additional help with Alteryx-specific R functions, please see A Cheat Sheet of Functions to Use in the R Tool.
When data is read into a data frame, columns with characters are read in as a Factor by default. This is fine when the data is truly categorical, and all possible values are present in the data, but less good when you are dealing with data like user.name, where each value in the column should be unique. For user.name, we should change the data type in the data frame to character with the function as.character().
user.name <- c("NeilR", "TaraM", "MattD", "CristonS", "SydneyF")
occupation <- c("Sr. Community Content Manager", "Creative Director", "Community Data Engineer", "Community Content Engineer", "Sr. Data Science Content Engineer")
current.team <- c("Content", "Creative Services", "Community Platform", "Content", "Content")
solutions <- c(25, 93, 7, 51, 57)
stars <- c(1159, 897, 942, 528, 824)
blogs <- c(62, 29, 33, 7, 30)
# Create Data Frame with Vectors
content.team <- data.frame(user.name, occupation, current.team, solutions, stars, blogs)
# Convert data type using as.character()
content.team$user.name <- as.character(content.team$user.name)
content.team$occupation <- as.character(content.team$occupation)
# Print updated data frame structure - note that user.name and occupation are now character vectors instead of a factor
print(str(content.team))
Try it yourself!
Checking the Structure of your Data Frame
There are several helpful functions for examining the structure of your data frame.
The str() function in R displays the structure of an arbitrary R object, including, but not limited to, data frames! Seriously, str() is one of my favorite R functions in the land. It just gets a little tricky switching back and forth from Python, where str() is how you coerce something into a string (I’ve been fooled one too many times not to mention this).
The class() function can be used to identify the object type of something in R. This can be helpful for confirming the type of object you’re working with, particularly if you need to drill down into a given column. Testing specifically if an object is a data frame with is.data.frame() works too.
The function names() returns the column names of a data frame. You can also use it to set the name of something with this function, but more on that later.
head() will return the first few rows of a data frame. This is another function I am a huge fan of for debugging purposes. Its counterpart is tail() which will return the last few rows of a data frame. You can specify exactly how many rows you would like to see with the argument n=
The last two functions that are helpful for understanding the structure of your data frame are ncol() and nrow(), which return the number of columns and number of rows in a data frame, respectively.
If you need to use any of these functions for debugging purposes in Alteryx, type them into the R tool wrapped with a print() command and the result will be output in the Results Window of your workflow.
user.name <- c("NeilR", "TaraM", "MattD", "CristonS", "SydneyF")
occupation <- c("Sr. Community Content Manager", "Creative Director", "Community Data Engineer", "Community Content Engineer", "Sr. Data Science Content Engineer")
current.team <- c("Content", "Creative Services", "Community Platform", "Content", "Content")
solutions <- c(25, 93, 7, 51, 57)
stars <- c(1159, 897, 942, 528, 824)
blogs <- c(62, 29, 33, 7, 30)
# Create Data Frame with Vectors
content.team <- data.frame(user.name, occupation, current.team, solutions, stars, blogs)
content.team$user.name <- as.character(content.team$user.name)
content.team$occupation <- as.character(content.team$occupation)
# Check the structure of an object
str(content.team)
# Check class of an object
class(content.team)
cat("Is it a data frame?", is.data.frame(content.team))
# Check the column names in your data set
names(content.team)
# Check the first or last few rows with head() or tail()
head(content.team, n=2)
# Check dimensions of data frame
cat("Number of columns:", ncol(content.team))
cat("Number of rows:", nrow(content.team))
Try it yourself!
Accessing and Modifying Values in an R Data Frame
Because a data frame is a list of vectors, it is a two-dimensional structure (rows and columns) while also being a list. This means that data frames share properties of both matrices and lists, and data can be accessed in a few different ways.
To select a column in a data frame, you can either use brackets or a dollar sign.
If you use brackets, you are accessing items in the data frame more like a matrix, so you need to consider both rows and columns. If you want to access all rows in a column, you simply leave the row argument blank. The bracket method allows you to access multiple rows or columns at once by providing a vector (using the c() function) of column names.
Something to note about R is that it indexes at One, which means the first row in your data frame will have the index of 1 (not 0, like it would be in Python).
user.name <- c("NeilR", "TaraM", "MattD", "CristonS", "SydneyF")
occupation <- c("Sr. Community Content Manager", "Creative Director", "Community Data Engineer", "Community Content Engineer", "Sr. Data Science Content Engineer")
current.team <- c("Content", "Creative Services", "Community Platform", "Content", "Content")
solutions <- c(25, 93, 7, 51, 57)
stars <- c(1159, 897, 942, 528, 824)
blogs <- c(62, 29, 33, 7, 30)
# Create Data Frame with Vectors
content.team <- data.frame(user.name, occupation, current.team, solutions, stars, blogs)
content.team$user.name <- as.character(content.team$user.name)
content.team$occupation <- as.character(content.team$occupation)
# Accessing values in data frame with brackets
# access one column
content.team[, "user.name"]
# access two columns
content.team[, c("user.name", "stars")]
#access a single row
content.team[2, ]
# access multiple rows
content.team[3:4, ]
# access a single cell by row and column
content.team[4, "user.name"]
Try it yourself!
The dollar sign accesses columns by their name and returns a vector instead of a data frame.
user.name <- c("NeilR", "TaraM", "MattD", "CristonS", "SydneyF")
occupation <- c("Sr. Community Content Manager", "Creative Director", "Community Data Engineer", "Community Content Engineer", "Sr. Data Science Content Engineer")
current.team <- c("Content", "Creative Services", "Community Platform", "Content", "Content")
solutions <- c(25, 93, 7, 51, 57)
stars <- c(1159, 897, 942, 528, 824)
blogs <- c(62, 29, 33, 7, 30)
# Create Data Frame with Vectors
content.team <- data.frame(user.name, occupation, current.team, solutions, stars, blogs)
content.team$user.name <- as.character(content.team$user.name)
content.team$occupation <- as.character(content.team$occupation)
# Access column by column name
content.team$user.name
Try it yourself!
You can reassign values in your data frame by indexing to the values you want to reassign, and providing the new values with an assignment operator. Components in a data frame can be deleted by assigning a NULL value to it.
user.name <- c("NeilR", "TaraM", "MattD", "CristonS", "SydneyF")
occupation <- c("Sr. Community Content Manager", "Creative Director", "Community Data Engineer", "Community Content Engineer", "Sr. Data Science Content Engineer")
current.team <- c("Content", "Creative Services", "Community Platform", "Content", "Content")
solutions <- c(25, 93, 7, 51, 57)
stars <- c(1159, 897, 942, 528, 824)
blogs <- c(62, 29, 33, 7, 30)
# Create Data Frame with Vectors
content.team <- data.frame(user.name, occupation, current.team, solutions, stars, blogs)
content.team$user.name <- as.character(content.team$user.name)
content.team$occupation <- as.character(content.team$occupation)
# Reassign value
content.team[3, "stars"] <- 1500
content.team
# Drop column by assigning the value as NULL
content.team[, "stars"] <- NULL
content.team
Try it yourself!
You can bind data frames together either row wise or column wise with rbind() or cbind() respectively.
user.name <- c("NeilR", "TaraM", "MattD", "CristonS", "SydneyF")
occupation <- c("Sr. Community Content Manager", "Creative Director", "Community Data Engineer", "Community Content Engineer", "Sr. Data Science Content Engineer")
current.team <- c("Content", "Creative Services", "Community Platform", "Content", "Content")
solutions <- c(25, 93, 7, 51, 57)
stars <- c(1159, 897, 942, 528, 824)
blogs <- c(62, 29, 33, 7, 30)
# Create Data Frame with Vectors
content.team <- data.frame(user.name, occupation, current.team, solutions, stars, blogs)
content.team$user.name <- as.character(content.team$user.name)
content.team$occupation <- as.character(content.team$occupation)
# Add a column with cbind
kb.articles <- c(20, 55, 16, 32, 55)
cbind(content.team, kb.articles)
# Add a row with rbind
intern <- data.frame(user.name = "AmoghG", occupation = "Data Science Intern", current.team = "Content", solutions = 0, stars = 0, blogs = 0)
rbind(content.team, intern)
Try it yourself!
Additional Resources on Data Frames in R
If this whirlwind tour of R data frames has left you wanting to learn more, here are some resources I recommend:
CRAN (the Comprehensive R Archive Network) has some great documentation on Lists and Data Frames.
The wikibook on R Programming has a section devoted to working with (R) data frames.
Data Camp has a great tutorial on R data frames called 15 Easy Solutions for Data Frame Problems in R.
I’ve already mentioned it once (or twice?), but Hadley Wickham’s section on Data Structures in Advanced R is well worth mentioning again.
The book R Programming for Data Science section R Nuts and Bolts reviews data structures in R and other important principles.
This R tutorial on Data Frames from William Kind at Coastal Carolina University is comprehensive and covers working with R data frames in more detail.
A geographer by training and a data geek at heart, Sydney joined the Alteryx team as a Customer Support Engineer in 2017. She strongly believes that data and knowledge are most valuable when they can be clearly communicated and understood. She currently manages a team of data scientists that bring new innovations to the Alteryx Platform.
A geographer by training and a data geek at heart, Sydney joined the Alteryx team as a Customer Support Engineer in 2017. She strongly believes that data and knowledge are most valuable when they can be clearly communicated and understood. She currently manages a team of data scientists that bring new innovations to the Alteryx Platform.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.