My name is Ross, and I'm addicted to counting steps. The walking kind. This obsession entails frequently opening the counting app on my iPhone to watch the step count climb and ensure I'm getting over 10,000 (my mom says that's the magic number). Luckily, living in NYC makes the goal easily achievable on most days.
In this post I'll show you how I analyzed my iPhone step data using pandas timeseries and ggplot. I did all of my data-sciencing in Python using Rodeo, Yhat's IDE for data science.
Like any legitimate data nerd, I wanted to be able to export my data for analysis outside my phone. The smart people over at Quantified Self Labs put out an app called QS Access that makes retrieving this data a cinch!
Here are a couple of screenshots of exporting the data.
The QS Access app exports a CSV containing three columns: a start timestamp, a finish timestamp, and the steps (count) during that period. There's an option to produce rows of hourly or daily data. Why not start with hours and see how it goes - bigger data is always better, right?
TO THE DATAS!
Our analysis will draw on the time series tools built into pandas. When Wes McKinney started the pandas project, he was working for an investment management company and this industry relies extensively on time series analysis. As a result, pandas ships with comprehensive functionality in this area.
A couple other notes about importing this data while we're here.
First, we already know that we have time series data, so we can let pandas know by using the parse_dates parameter.
The end time data in the CSV isn't particularly interesting because we have the start time and know that we have hourly frequency so we can omit it with usecols.
Last, setting the start time (col 0) to be the index column gives a DateTimeIndex and will make our job easier later.
df_hour = pd.read_csv('health_data_hour.csv', parse_dates=[0,1], names=['start_time','steps'], usecols=[0,2], skiprows=1, index_col=0)# ensure the steps col are ints - weirdness going on with this one
df_hour.steps = df_hour.steps.apply(lambda x:int(float(x)))
Notice that the type of the start_time column: pandas.tseries.index.DatetimeIndex. This is due to setting the index column during the data ingest, and it gives us access to all sorts of goodies - resampling for one, as we'll see later.
Hourly Step Count
How about a quick (gg)plot to explore the data we have here.
Yuck! That's a little too busy. How can we improve our visualization? I've got an idea - pandas has a function called resample that will allow us to aggregate the data over a longer duration.
More precisely, this is called downsampling when you reduce the sampling rate of a given signal. For this example, we will take the hourly data, and resample it on a daily, weekly, and monthly basis using the mean and sum aggregations.
Downsampling to Daily Step Count
Let's start with the daily totals (notice that you can pass the dataframe __index__ into the ggplot function):
Ah-ha! We're getting somewhere now. That's a much more readable plot. And it looks like there's a nice upward trend (we'll get to that later).
Downsampling to Weekly and Monthly Step Count
Armed with this, we're able to do weekly and monthly resampling easily as well. Just pass 'W' or 'M' into the resample function.
Because I'm most interested in the daily step total metric, we can start using an average aggregation function to get a daily average during the weekly and monthly samples (got to get those 10,000 a day!). That just takes changing the sum() function after the resample to a mean().
Brief aside: Pandas can also do the opposite of what we just did; called upsampling. Take a look at the docs if you need that for your project.
Going (slightly) deeper
I'm curious if I'm getting more steps during the weekend than during the week. We can use the tab suggestions in Rodeo to take a look at the methods we have available on the DateTimeIndex.
There happen to be weekday and weekday_name methods, which sound useful. The former will give an integer corresponding to a day of the week, while the latter will give the string name of that day. After we make a new column with that info, applying a helper function to it can return a boolean value for if that is a weekend or not.
We can also group the data on this weekend_bool and run some aggregation methods to see the differences in the data. Have a look at my previous post on grouping in pandas for an explanation of this functionality.
weekend_grouped = df_daily.groupby('weekend')
False count 479.000000479.000000
max 37360.0000004.000000True count 192.000000192.000000
With an average of 11,621 steps on the weekend (and a median of 10,228) versus 10,146 (median = 9,742) on the weekdays, it looks like a slight edge goes to the weekends!
Let's get back to the upward trend.
I moved from Charlotte, NC, to New York City at the beginning of April to work as a software engineer at Yhat.
I'm curious what effect that this location change had on my daily step count. We can apply the same methodology as the weekend data to figure this out.
I'll just give you the good stuff:
That plot makes it look like there's a significant change after moving to NYC. There are more variables than just the location change, like the fact that I just started running more seriously, which would add some skew, but controlling for that effect would require more data. Maybe for another post!
Hopefully, this analysis was enough to get you started with checking out your step count data, using Rodeo to do some data-sciencing, and exploring the time series functionality of pandas! Have a look at my repo for this project if you want to check out the source.