Data Science

SydneyF · ‎07-18-2019

A fundamental concept in computer science, a data structure is a format to organize or store data in. For the code friendly tools in Alteryx Designer (both R and Python), the mighty data frame is the reigning data structure.

Data frames are a tabular format of data, where rows are observations of data, and columns are the different variables that make up the dataset. If you've worked with Alteryx or Excel, data frames are probably a familiar and intuitive concept. Data frames are just a tabular way of organizing statistical data so that the records in a dataset are easily compared, analyzed, and aggregated.

Because Alteryx Designer uses a data-frame like structure in its data streams, it makes sense to use something analogous when moving data in and out of our code friendly tools; the R tool and the Python tool.

In this blog post, we will review the data frame structure in Python as well as helpful functions for working with data frames. What's really exciting about this article is that we get to use the Data Camp Light plugin to create interactive code snippets. If you’re more of an R Soul, check out last week’s Data Science Blog post: Code Friendly Data Structures: Data Frames in R.

Python Pandas DataFrames

R, built by statisticians, has a native data frame (tabular data format). Python does not. However, there is a very popular and powerful package called pandas that implements tabular data structures in Python.

Pandas (the name coming from the term “panel data”) is based around the numPy framework, which is used for linear algebra and matrix operations.

Pandas Data Structures

The most basic building block in pandas is a Series object. A Series is a one dimensional, labeled array that can hold any Python data type (e.g., integers, strings, floating point numbers, or Python objects). The labels are collectively called the index. When creating a series (or a data frame) you can specify a specific index, or you can let pandas generate one for you (fun fact, unlike R, Python indexes at 0).

In pandas, data alignment is intrinsic. The link between labels and data will not be broken unless you do so explicitly. The values of the Series is mutable, but the size of the Series is not.

A pandas DataFrame is a two-dimensional tabular structure that is comprised of Series. The Series that make up the DataFrame can have different data types.

You can think of the data structures in pandas as Russian Nesting Dolls. Individual numbers/objects (referred to as Scalars) go into Series, and Series go into DataFrames.

It is possible to have cells in a pandas Series or DataFrame that are lists or arrays. While this is allowed in Python, it will give you trouble if you try to write out a DataFrame like this back into Alteryx. If you get an error writing out a DataFrame from the Python tool, be sure to check the structure of your data.

Creating a Pandas DataFrame

To create a series, you use the function pandas.Series(data, index=index) where data can be a Python dictionary, list, n-dimensional array (ndarray), or a single (scalar) value.



import pandas as pd 

# Create a Series
userName = pd.Series( ['NeilR', 'TaraM', 'MattD', 'CristonS', 'SydneyF'])

# Print out Series
print(userName)

Try it yourself!

DataFrames are composed of multiple series. A pandas DataFrame can be created using the function DataFrame() with a dictionary (dict) of series, tuples, n-dimensional arrays (ndarray), or lists, from a structured or record array, or from a list of dictionaries.

For the sake of simplicity, we will create an example where the DataFrame is constructed from a dictionary of series. I’ve made this dictionary in the background for you, but in case you're curious, here is the code:

import pandas as pd

indx = ['NeilR', 'TaraM', 'MattD', 'CristonS', 'SydneyF']

d = {'occupation': pd.Series(['Sr. Community Content Manager', 'Creative Director', 'Community Data Engineer', 'Community Content Engineer', 'Sr. Data Science Content Engineer'], index=indx), 
'currentTeam':pd.Series(['Content', 'Creative Services', 'Community Platform', 'Content', 'Content'], index=indx), 
'soutions': pd.Series([25, 93, 7, 51, 57], index=indx),
'stars': pd.Series([1159, 897, 942, 528,  824], index=indx),
'blogs': pd.Series([62, 29, 33, 7, 30], index=indx),
'test': pd.Series([[1, 2, 3], [2, 3, 4], [3, 4, 5], [4, 5, 6], [6, 7, 8]], index=indx)}

And here, you can execute the final step and create a pandas DataFrame!



import pandas as pd

indx = ['NeilR', 'TaraM', 'MattD', 'CristonS', 'SydneyF']

d = {'occupation': pd.Series(['Sr. Community Content Manager', 'Creative Director', 'Community Data Engineer', 'Community Content Engineer', 'Sr. Data Science Content Engineer'], index=indx), 
'currentTeam':pd.Series(['Content', 'Creative Services', 'Community Platform', 'Content', 'Content'], index=indx), 
'soutions': pd.Series([25, 93, 7, 51, 57], index=indx),
'stars': pd.Series([1159, 897, 942, 528,  824], index=indx),
'blogs': pd.Series([62, 29, 33, 7, 30], index=indx)}



# Create Data Frame with the Dictionary of Series
contentTeam = pd.DataFrame(d)

# Print out Data Frame
print(contentTeam)

Try it yourself!

To read in data from Alteryx to the Python tool, you'll need to import the Alteryx package and use the command Alteryx.read(), providing the label of the data stream you'd like to read in in the parentheses. For more help working with the Python tool, check out the Python Tool Mastery article.

Checking the Structure of your DataFrame

The .head() and .tail() functions are analogous to their similarly named counterparts in R. You can use them to get the first or last few rows (respectively) of a DataFrame. By default, these functions will return 5 rows, but you can modify the number of rows returned by providing the number of rows you would like to be returned in the parentheses of the function.



import pandas as pd

indx = ['NeilR', 'TaraM', 'MattD', 'CristonS', 'SydneyF']

d = {'occupation': pd.Series(['Sr. Community Content Manager', 'Creative Director', 'Community Data Engineer', 'Community Content Engineer', 'Sr. Data Science Content Engineer'], index=indx), 
'currentTeam':pd.Series(['Content', 'Creative Services', 'Community Platform', 'Content', 'Content'], index=indx), 
'soutions': pd.Series([25, 93, 7, 51, 57], index=indx),
'stars': pd.Series([1159, 897, 942, 528,  824], index=indx),
'blogs': pd.Series([62, 29, 33, 7, 30], index=indx)}

contentTeam = pd.DataFrame(d)



# Print the first three rows in the DataFrame contentTeam

print(contentTeam.head(3))

Try it yourself!

.shape describes the shape of a DataFrame, returning a tuple with the dimensions of the data frame in a number of rows, number of columns format.

.index displays the index of a Series or DataFrame, and similarly, .columns displays the column names.



import pandas as pd

indx = ['NeilR', 'TaraM', 'MattD', 'CristonS', 'SydneyF']

d = {'occupation': pd.Series(['Sr. Community Content Manager', 'Creative Director', 'Community Data Engineer', 'Community Content Engineer', 'Sr. Data Science Content Engineer'], index=indx), 
'currentTeam':pd.Series(['Content', 'Creative Services', 'Community Platform', 'Content', 'Content'], index=indx), 
'soutions': pd.Series([25, 93, 7, 51, 57], index=indx),
'stars': pd.Series([1159, 897, 942, 528,  824], index=indx),
'blogs': pd.Series([62, 29, 33, 7, 30], index=indx)}

contentTeam = pd.DataFrame(d)



# Print the shape of the DataFrame contentTeam
print(contentTeam.shape)

# Print the index of the DataFrame contentTeam
print(contentTeam.index)

# Print the Columns
print(contentTeam.columns)

Try it yourself!

.info() provides a summary of the DataFrame including data types and memory usage, and .describe() is a cool function that creates a quick statistical summary of a DataFrame.



import pandas as pd

indx = ['NeilR', 'TaraM', 'MattD', 'CristonS', 'SydneyF']

d = {'occupation': pd.Series(['Sr. Community Content Manager', 'Creative Director', 'Community Data Engineer', 'Community Content Engineer', 'Sr. Data Science Content Engineer'], index=indx), 
'currentTeam':pd.Series(['Content', 'Creative Services', 'Community Platform', 'Content', 'Content'], index=indx), 
'soutions': pd.Series([25, 93, 7, 51, 57], index=indx),
'stars': pd.Series([1159, 897, 942, 528,  824], index=indx),
'blogs': pd.Series([62, 29, 33, 7, 30], index=indx)}

contentTeam = pd.DataFrame(d)



# Print out info of DataFrame
print(contentTeam.info())

# Print the statistical summary of the DataFrame
print(contentTeam.describe())

Try it yourself!

Accessing and Modifying Values in a DataFrame

A lot of selection in pandas can be accomplished with brackets. If you want to select a column, you can provide a column name in brackets after the name of your DataFrame. If you'd like to select rows, you can do so by providing the range of rows (as integers) you'd like to select.



import pandas as pd

indx = ['NeilR', 'TaraM', 'MattD', 'CristonS', 'SydneyF']

d = {'occupation': pd.Series(['Sr. Community Content Manager', 'Creative Director', 'Community Data Engineer', 'Community Content Engineer', 'Sr. Data Science Content Engineer'], index=indx), 
'currentTeam':pd.Series(['Content', 'Creative Services', 'Community Platform', 'Content', 'Content'], index=indx), 
'soutions': pd.Series([25, 93, 7, 51, 57], index=indx),
'stars': pd.Series([1159, 897, 942, 528,  824], index=indx),
'blogs': pd.Series([62, 29, 33, 7, 30], index=indx)}

contentTeam = pd.DataFrame(d)



# Select a Column
print(contentTeam['currentTeam'])


# Select Rows
print(contentTeam[0:2])

Try it yourself!

Note that when using single brackets a Series is returned, and when using double brackets, a DataFrame is returned:



import pandas as pd

indx = ['NeilR', 'TaraM', 'MattD', 'CristonS', 'SydneyF']

d = {'occupation': pd.Series(['Sr. Community Content Manager', 'Creative Director', 'Community Data Engineer', 'Community Content Engineer', 'Sr. Data Science Content Engineer'], index=indx), 
'currentTeam':pd.Series(['Content', 'Creative Services', 'Community Platform', 'Content', 'Content'], index=indx), 
'soutions': pd.Series([25, 93, 7, 51, 57], index=indx),
'stars': pd.Series([1159, 897, 942, 528,  824], index=indx),
'blogs': pd.Series([62, 29, 33, 7, 30], index=indx)}

contentTeam = pd.DataFrame(d)



# Series
print(type(contentTeam['occupation']))

# DataFrame
print(type(contentTeam[['occupation']]))

Try it yourself!

You can also you loc and iloc for data selection. loc does selection with labels, meaning you need to provide the name (labels) of the rows or columns you'd like to select. iloc is integer index function, meaning you can provide "coordinates" in integers of the data you'd like to select.

Data frames can be thought of as having two dimensions, or "axes" the rows of the DataFrame, and the columns. In pandas, these are referred to as axis 0 and axis 1. Axis 0 refers to the rows, and axis 1 refers to the columns. This StackOverflow answer has a helpful illustration for understanding the axes.

loc and iloc search both rows and columns. If you'd like to select a specific row, all you need to do is provide the index label or integer. To access a column and all rows, you'll need to provide a colon for the rows argument, indicating you want the entire range of rows, and the column(s) you'd like to select.



import pandas as pd

indx = ['NeilR', 'TaraM', 'MattD', 'CristonS', 'SydneyF']

d = {'occupation': pd.Series(['Sr. Community Content Manager', 'Creative Director', 'Community Data Engineer', 'Community Content Engineer', 'Sr. Data Science Content Engineer'], index=indx), 
'currentTeam':pd.Series(['Content', 'Creative Services', 'Community Platform', 'Content', 'Content'], index=indx), 
'soutions': pd.Series([25, 93, 7, 51, 57], index=indx),
'stars': pd.Series([1159, 897, 942, 528,  824], index=indx),
'blogs': pd.Series([62, 29, 33, 7, 30], index=indx)}

contentTeam = pd.DataFrame(d)



# Select a row with a label
print(contentTeam.loc['CristonS'])

# Select a row with an index
print(contentTeam.iloc[2])

# Select a column with a label
print(contentTeam.loc[:, 'occupation'])

# Select a column with an integer index
print(contentTeam.iloc[:, 2])

# Select by row and column with labels
print(contentTeam.loc[['NeilR', 'TaraM'], ['occupation', 'stars']])

# Select row and column with an integer index
print(contentTeam.iloc[1, 2])

Try it yourself!

If you're selecting a single cell in your DataFrame, the functions .at and .iat are faster than .loc and .iloc.



import pandas as pd

indx = ['NeilR', 'TaraM', 'MattD', 'CristonS', 'SydneyF']

d = {'occupation': pd.Series(['Sr. Community Content Manager', 'Creative Director', 'Community Data Engineer', 'Community Content Engineer', 'Sr. Data Science Content Engineer'], index=indx), 
'currentTeam':pd.Series(['Content', 'Creative Services', 'Community Platform', 'Content', 'Content'], index=indx), 
'soutions': pd.Series([25, 93, 7, 51, 57], index=indx),
'stars': pd.Series([1159, 897, 942, 528,  824], index=indx),
'blogs': pd.Series([62, 29, 33, 7, 30], index=indx)}

contentTeam = pd.DataFrame(d)



# Select a Cell by Labels
print(contentTeam.at['CristonS', 'occupation'])

# Select a Cell by an Integer Index
print(contentTeam.iat[1, 4])

Try it yourself!

To add a column to pandas, you can use loc and the name of the new column you'd like to add to specify the new column, and then use an equals sign to assign data to the column. You can reassign values in the same way.



import pandas as pd

indx = ['NeilR', 'TaraM', 'MattD', 'CristonS', 'SydneyF']

d = {'occupation': pd.Series(['Sr. Community Content Manager', 'Creative Director', 'Community Data Engineer', 'Community Content Engineer', 'Sr. Data Science Content Engineer'], index=indx), 
'currentTeam':pd.Series(['Content', 'Creative Services', 'Community Platform', 'Content', 'Content'], index=indx), 
'soutions': pd.Series([25, 93, 7, 51, 57], index=indx),
'stars': pd.Series([1159, 897, 942, 528,  824], index=indx),
'blogs': pd.Series([62, 29, 33, 7, 30], index=indx)}

contentTeam = pd.DataFrame(d)



# Add new column to DataFrame
contentTeam['knowledgeBaseArticles'] = [10, 20, 15, 8, 7]

contentTeam.at['SydneyF', 'stars'] = 30

print(contentTeam)

Try it yourself!

To drop values in a pandas DataFrame, you will want to use the function .drop(). To let the function know whether you are trying to drop rows or columns, you can either specify 0 (for row) or 1 (columns) in the axis= argument, or you can specify index = or columns = when feeding in your index (row) or column labels.



import pandas as pd

indx = ['NeilR', 'TaraM', 'MattD', 'CristonS', 'SydneyF']

d = {'occupation': pd.Series(['Sr. Community Content Manager', 'Creative Director', 'Community Data Engineer', 'Community Content Engineer', 'Sr. Data Science Content Engineer'], index=indx), 
'currentTeam':pd.Series(['Content', 'Creative Services', 'Community Platform', 'Content', 'Content'], index=indx), 
'soutions': pd.Series([25, 93, 7, 51, 57], index=indx),
'stars': pd.Series([1159, 897, 942, 528,  824], index=indx),
'blogs': pd.Series([62, 29, 33, 7, 30], index=indx)}

contentTeam = pd.DataFrame(d)



# Drop a column from the pandas DataFrame
print(contentTeam.drop(['stars'], axis=1))

# Drop a column from the pandas DataFrame
print(contentTeam.drop(columns = ['blogs']))

Try it yourself!

For more detail on selecting rows and columns in pandas, check out this tutorial from KDnuggets (complete with associated Jupyter Notebook).

Additional Pandas Resources

Here are a few online resources I hope you will find helpful in your pandas journey:

The pandas documentation has a helpful tutorial called 10 Minutes to pandas that quickly reviews the basic functionality of the package. Essential Basic Functionality is also from the pandas documentation and is a thorough overview of the essential functionality of pandas data structures.

The paper pandas: a Foundational Python Library for Data Analysis and Statistics by Wes McKinney is a great overview of the package.

Python Pandas Tutorial: A Complete Introduction for Beginners and Codecademy's Introduction to Pandas are both solid pandas tutorials for beginners.

Data Science

Code Friendly Data Structures: DataFrames in Python (Pandas)