Grouping data is an integral part of many data analysis projects. The functionality for grouping in pandas is vast, but can be tough to grasp initially. Have no fear...we will get through a short introduction together using some data from NYC's beloved bike share program, Citi Bike.
Why would I want to group data?
Most of the time, you have a set of data that lends itself to being categorized or grouped. As a general example, let's say we have data on a wide variety of people. We may perform an analysis where we compare groups in the data based on age, gender, birth month, shoe size, or birth city; the options are as numerous as the data points!
The pandas groupby functionality draws from the Split-Apply-Combine method as described by Hadley Wickham from the land of R. It's a great approach to solving data analysis problems, and his paper on the subject is worth a read (it's linked in the resources section).
To summarize, he states that a common methodology for analyzing data comes from splitting the data into categories or groups based on some criteria, applying some aggregation function to each group (sum, mean, count), then combining the results for analysis, visualization or other means of better understanding.
Here's a graphic I came across illustrating the process:
From Data Analysis in Python by Wes McKinney
Sounds handy, but how do I do it in pandas?
Codes and Stuff
Just so we're on the same page, I’m running pandas 0.18.1.
I was looking around for an intriguing dataset and came across this data from Citi Bike, which is the NYC bike share program. It's pretty medium data at ~250MB CSV for one month's worth of data, and there was potential for some compelling findings with data grouping. Let's start down the rabbit hole...
df = pd.read_csv('data.csv')
start station id
start station name
start station latitude
start station longitude
end station id
end station name
end station latitude
end station longitude
1 Ave & E 30 St
E 17 St & Broadway
Allen St & Hester St
Madison St & Clinton St
E 2 St & Avenue B
Norfolk St & Broome St
1 Ave & E 68 St
E 11 St & 2 Ave
W 33 St & 7 Ave
E 48 St & 3 Ave
It looks like there is a good opportunity to break the data down into groups to look for some interesting trends. Some ideas are:
Group on the gender column and see if there are more male or female riders.
Do specific stations get used more than others? We can group on the station start or finish id.
Group the data on the day of the week, to see if there is more utilization for a particular day, on average.
How about a few examples?
If we want to group by just the gender, then we pass this key (column name) to the groupby function as the sole argument. This example is the simplest form of grouping, so please check out the docs to get all the options!
groupedGender = df.groupby('gender')print groupedGender
<pandas.core.groupby.DataFrameGroupByobject at 0x1133854d0>
The output shows that groupby returns a pandas DataFrameGroupBy object. Pandas has just made some internal calculations about the new gender groups and is ready to apply some operation on each of these groups.
We can take a look at the available methods with the docstring/tab complete functionality of Rodeo!
Counts of groups
Getting back to the data, if we use the count method, we can see the total number of entries for each gender group. For reference, here's what the website says for the gender codes - "Gender (Zero=unknown; 1=male; 2=female)"
017871017837232249847# look at the size as a percentage of the whole (using the trip)
total = df.gender.count()
groupedGender.size()/ total *100
It looks like males make up the majority of Citi Bike riders (~65%). I was pretty surprised to see that male riders outnumbered female riders 3 to 1. I wonder if that's true of commuters in general, or if there's some other factor, like females tending to own their own bikes. A question for another post...
Mean and Std Dev of groups
We can use a single column from the DataFrameGroupBy object and apply some aggregation function on it - how about the mean and standard deviation of the trip durations for all three groups?
035.923658113.778720216.198230# Don't have to use the bracket notation
Although males make up the majority of Citi Bike riders, there's not much of a difference in their trip durations. Interestingly, gender unknown riders take 2x as long of rides on average. These riders are likely single-use customers (when you purchase a one time pass at a Citi Bike kiosk you are not asked for your gender).
More summary statistics
So there are some summary statistics for these groups (as an aside, you can use the describe function to get these statistics and more in one call). That's a whole lot of spread around the mean, which probably means there are some outliers in the data (maybe people that kept the bike for days). Just a brief look at this even though it's outside the scope, because I'm sure you were all interested 😊
One last example is looking at which are the five favorite start and end stations. We'll group the data based on the start and end station names, apply the count function, and sort the values is descending order. Here's the code for that:
groupedStart = df.groupby('start station name')
groupedStart['start station name'].count().sort_values(ascending=False)[:5]
start station name
PershingSquareNorth12775WestSt&ChambersSt10128LafayetteSt& E 8St9246
groupedEnd = df.groupby('end station name')
groupedEnd['end station name'].count().sort_values(ascending=False)[:5]end station name
PershingSquareNorth12511WestSt&ChambersSt10189LafayetteSt& E 8St9459
Hopefully, the above examples helped introduce some basic uses for the grouping process in pandas to help enhance your analysis and whet your appetite for more! What ideas do you have for further analysis on this dataset? Can you conquer the last idea of looking at the days of the week? Please, take a look at the resources linked below for further investigation!