Data Science

virajparekh94 · ‎04-06-2017

This is a basic tutorial using pandas and a few other packages to build a simple datapipe for getting NBA data. Even though this tutorial is done using NBA data, you don't need to be an NBA fan to follow along. The same concepts and techniques can be applied to any project of your choosing.

This is meant to be used as a general tutorial for beginners with some experience in Python or R.

Step One: What data do we need?

The first step to any data project is getting an idea of what you want. We're going to focus on getting NBA data at a team level on a game by game basis. From my experience, these team level stats usually exist in different places, making them harder to compare across games.

Our goal is to build box scores across a team level to easily compare them against each other. Hopefully this will give some insight as to how a team's play has changed over the course of the season or make it easier to do any other type of analysis.

On a high level, this might look something like:

Next step: Where is the data coming from?

stats.nba.com has all the NBA data that's out there, but the harder part is finding a quick way to fetch and manipulate it into the form that's needed (and what most of this tutorial will be about).

Analytics is fun, but everything around it can be tough.

We're going to use the nba_py package

Huge shoutout to https://github.com/seemethere for putting this together.

This is going to focus on team stats, so lets play around a little bit to get a sense of what we're working with.

Start by importing the packages we'll need:

import pandas as pd
from nba_py import team

If you're using jupyter notebooks notebooks you can pip-install any packages you don't have straight from the notebook using:

%%bash
pip install nba_py

If you're using Yhat's Python IDE, Rodeo you can install nba_py in the packages tab.

Install packages in the Packages tab. No surprises here.

So referring to the docs, it looks like we'll need some sort of roster id to get data for each team. This api hits an endpoint on the NBA"s website, so the IDs are most likely in the URL:

(Unapologetic Knicks bias) Looking at the team page for the on stats.nba.com, here's the url: http://stats.nba.com/team/#!/1610612752/

That number at the end looks like a team ID. Let's see how the passing data works:

class nba_py.team.TeamPassTracking(team_id, measure_type='Base', per_mode='PerGame', plus_minus='N', pace_adjust='N', rank='N', league_id='00', season='2016-17', season_type='Regular Season', po_round='0', outcome='', location='', month='0', season_segment='', date_from='', date_to='', opponent_team_id='0', vs_conference='', vs_division='', game_segment='', period='0', shot_clock_range='', last_n_games='0')

passes_made() passes_recieved()

knicks = team.TeamPassTracking(1610612752)

All the info is stored in the knicks object:

#the dataframe.head(N) command returns the first N rows of a dataframe
knicks.passes_made().head(10)

	TEAM_ID	TEAM_NAME	PASS_TYPE	G	PASS_FROM	PASS_TEAMMATE_PLAYER_ID	FREQUENCY	PASS	AST	FGM	FGA	FG_PCT	FG2M	FG2A	FG2_PCT	FG3M	FG3A	FG3_PCT
0	1610612752	New York Knicks	made	64	Rose, Derrick	201565	0.144	56.73	4.42	6.30	14.02	0.449	4.34	8.64	0.503	1.95	5.38	0.363
1	1610612752	New York Knicks	made	58	Jennings, Brandon	201943	0.111	48.22	4.93	7.09	15.47	0.458	5.31	10.50	0.506	1.78	4.97	0.358
2	1610612752	New York Knicks	made	66	Porzingis, Kristaps	204001	0.106	40.61	1.47	3.29	7.65	0.430	2.56	5.50	0.466	0.73	2.15	0.338
3	1610612752	New York Knicks	made	46	Noah, Joakim	201149	0.073	40.20	2.24	4.17	8.85	0.472	3.43	6.93	0.495	0.74	1.91	0.386
4	1610612752	New York Knicks	made	72	Anthony, Carmelo	2546	0.102	35.83	2.88	4.18	9.65	0.433	3.13	6.99	0.447	1.06	2.67	0.396
5	1610612752	New York Knicks	made	73	Lee, Courtney	201584	0.090	30.92	2.33	3.92	8.42	0.465	3.01	5.97	0.505	0.90	2.45	0.369
6	1610612752	New York Knicks	made	68	Hernangomez, Willy	1626195	0.076	28.26	1.25	2.32	5.50	0.422	1.74	3.93	0.442	0.59	1.57	0.374
7	1610612752	New York Knicks	made	46	Baker, Ron	1627758	0.045	24.93	1.87	2.61	5.72	0.456	1.93	3.80	0.509	0.67	1.91	0.352
8	1610612752	New York Knicks	made	46	Thomas, Lance	202498	0.042	23.24	0.76	1.93	4.67	0.414	1.70	3.78	0.448	0.24	0.89	0.268
9	1610612752	New York Knicks	made	75	O'Quinn, Kyle	203124	0.068	22.93	1.49	2.35	4.87	0.482	1.93	3.63	0.533	0.41	1.24	0.333

Did you know you can inspect, copy and save data frames in the History tab in Rodeo?

Referring back to the docs, this looks like per game averages for passes. Definitely a lot that can be done with this, but let's try to get it for a specific game. Referring to the docs:

knicks_last_game = team.TeamPassTracking(1610612752, last_n_games = 1)
knicks_last_game.passes_made().head(10)

	TEAM_ID	TEAM_NAME	PASS_TYPE	G	PASS_FROM	PASS_TEAMMATE_PLAYER_ID	FREQUENCY	PASS	AST	FGM	FGA	FG_PCT	FG2M	FG2A	FG2_PCT	FG3M	FG3A	FG3_PCT
0	1610612752	New York Knicks	made	1	Baker, Ron	1627758	0.212	72.0	6.0	7.0	15.0	0.467	7.0	11.0	0.636	0.0	4.0	0.000
1	1610612752	New York Knicks	made	1	Ndour, Maurice	1626254	0.135	46.0	1.0	3.0	9.0	0.333	3.0	4.0	0.750	0.0	5.0	0.000
2	1610612752	New York Knicks	made	1	Anthony, Carmelo	2546	0.126	43.0	2.0	5.0	16.0	0.313	4.0	13.0	0.308	1.0	3.0	0.333
3	1610612752	New York Knicks	made	1	O'Quinn, Kyle	203124	0.118	40.0	5.0	5.0	6.0	0.833	4.0	4.0	1.000	1.0	2.0	0.500
4	1610612752	New York Knicks	made	1	Lee, Courtney	201584	0.118	40.0	3.0	6.0	8.0	0.750	2.0	4.0	0.500	4.0	4.0	1.000
5	1610612752	New York Knicks	made	1	Hernangomez, Willy	1626195	0.082	28.0	3.0	4.0	8.0	0.500	4.0	6.0	0.667	0.0	2.0	0.000
6	1610612752	New York Knicks	made	1	Holiday, Justin	203200	0.071	24.0	3.0	4.0	7.0	0.571	4.0	6.0	0.667	0.0	1.0	0.000
7	1610612752	New York Knicks	made	1	Kuzminskas, Mindaugas	1627851	0.059	20.0	2.0	2.0	6.0	0.333	2.0	5.0	0.400	0.0	1.0	0.000
8	1610612752	New York Knicks	made	1	Randle, Chasson	1626184	0.044	15.0	0.0	0.0	1.0	0.000	0.0	1.0	0.000	0.0	0.0	NaN
9	1610612752	New York Knicks	made	1	Vujacic, Sasha	2756	0.035	12.0	1.0	2.0	3.0	0.667	2.0	2.0	1.000	0.0	1.0	0.000

This looks clean enough to be wrangled into a form that can be worked with.

If we're trying to create a team level box score, we're more than likely going to need to join tables together down the line, just something to keep in mind.

Hitting the ShotTracking endpoint looks interesting:

knicks_id = 1610612752
knicks_shots = team.TeamShotTracking(knicks_id, last_n_games = 1)


knicks_shots.closest_defender_shooting()

	TEAM_ID	TEAM_NAME	SORT_ORDER	G	CLOSE_DEF_DIST_RANGE	FGA_FREQUENCY	FGM	FGA	FG_PCT	EFG_PCT	FG2A_FREQUENCY	FG2M	FG2A	FG2_PCT	FG3A_FREQUENCY	FG3M	FG3A	FG3_PCT
0	1610612752	New York Knicks	1	1	0-2 Feet - Very Tight	0.091	4.0	8.0	0.500	0.500	0.091	4.0	8.0	0.500	0.000	0.0	0.0	NaN
1	1610612752	New York Knicks	2	1	2-4 Feet - Tight	0.318	15.0	28.0	0.536	0.536	0.295	15.0	26.0	0.577	0.023	0.0	2.0	0.000
2	1610612752	New York Knicks	3	1	4-6 Feet - Open	0.409	16.0	36.0	0.444	0.500	0.250	12.0	22.0	0.545	0.159	4.0	14.0	0.286
3	1610612752	New York Knicks	4	1	6+ Feet - Wide Open	0.182	7.0	16.0	0.438	0.500	0.102	5.0	9.0	0.556	0.080	2.0	7.0	0.286

Following along in Rodeo? Your view should look something like this.

This looks interesting! We wanted EFG% (effective field goal percentage) in our original table, but it looks like we can get EFG% for open and covered shots. Let's group 'Open' and 'Wide Open' together, along with 'Tight' and 'Very Tight.'

Effective field goal percentage is a statistic that adjusts field goal percentage to account for the fact that three-point field goals count for three points while field goals only count for two points:

This might help answer questions like "Do teams hit more open shots when they win?"

df_grouped = knicks_shots.closest_defender_shooting()

df_grouped['OPEN'] = df_grouped['CLOSE_DEF_DIST_RANGE'].map(lambda x : True if 'Open' in x else False)
##This creates a new column OPEN, mapped from the 'CLOSE_DEF_DIST_RANGE' column.
##http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.map.html


df_grouped

	TEAM_ID	TEAM_NAME	SORT_ORDER	G	CLOSE_DEF_DIST_RANGE	FGA_FREQUENCY	FGM	FGA	FG_PCT	EFG_PCT	FG2A_FREQUENCY	FG2M	FG2A	FG2_PCT	FG3A_FREQUENCY	FG3M	FG3A	FG3_PCT	OPEN
0	1610612752	New York Knicks	1	1	0-2 Feet - Very Tight	0.091	4.0	8.0	0.500	0.500	0.091	4.0	8.0	0.500	0.000	0.0	0.0	NaN	False
1	1610612752	New York Knicks	2	1	2-4 Feet - Tight	0.318	15.0	28.0	0.536	0.536	0.295	15.0	26.0	0.577	0.023	0.0	2.0	0.000	False
2	1610612752	New York Knicks	3	1	4-6 Feet - Open	0.409	16.0	36.0	0.444	0.500	0.250	12.0	22.0	0.545	0.159	4.0	14.0	0.286	True
3	1610612752	New York Knicks	4	1	6+ Feet - Wide Open	0.182	7.0	16.0	0.438	0.500	0.102	5.0	9.0	0.556	0.080	2.0	7.0	0.286	True

The last column 'OPEN' gives us the information we need. Now we can aggregate based off of it. Let's get the total number of open shots.

total_open_shots = df_grouped.loc[df_grouped['OPEN'] == True, 'FGA'].sum()
print total_open_shots

52.0

That looks like it worked. Similarly, we can get the total number of "covered" shots taken (looks like it's a lot higher...nothing surprising there.)

Keep in mind, this is a bit misleading, as layups and other shots near the basket are more likely to have a nearby defender.

Referring to the definition for EFG%:

We definitely have all the information we need to compute this for open and covered shots:

#Mapping the formula above into a column:
open_efg = (df_grouped.loc[df_grouped['OPEN']== True, 'FGM'].sum() + (.5 * df_grouped.loc[df_grouped['OPEN']== True, 'FG3M'].sum()))/(df_grouped.loc[df_grouped['OPEN']== True, 'FGA'].sum())
covered_efg = (df_grouped.loc[df_grouped['OPEN']== False, 'FGM'].sum() + (.5 * df_grouped.loc[df_grouped['OPEN']== False, 'FG3M'].sum()))/(df_grouped.loc[df_grouped['OPEN']== False, 'FGA'].sum())

print open_efg
print covered_efg


0.5
0.527777777778

Interesting... shooting better when there's a defender nearby makes it look like there's more to the story. Then again, nothing about the Knicks ever seems to makes sense.

Referring back to the original plan, it looks like we have most of the stats we set out to get. However, we still haven't addressed:

1) Who was the game against? Who won?

2) How many days rest did each team have?

3) How are we going to get all this data together?

From the looks of it, there isn't anything in the nba_py team modules we're using that can be directly used as an identifier.

However, it looks like we can get stats for date ranges. To test this, let's look at a single game the Knicks played on Sunday, January 29th:

date = '2017-01-29'
knicks_jan = team.TeamShotTracking(knicks_id, date_from = date, date_to =  date)

knicks_jan_shots =  knicks_jan.closest_defender_shooting()

knicks_jan_shots

	TEAM_ID	TEAM_NAME	SORT_ORDER	G	CLOSE_DEF_DIST_RANGE	FGA_FREQUENCY	FGM	FGA	FG_PCT	EFG_PCT	FG2A_FREQUENCY	FG2M	FG2A	FG2_PCT	FG3A_FREQUENCY	FG3M	FG3A	FG3_PCT
0	1610612752	New York Knicks	1	1	0-2 Feet - Very Tight	0.156	6.0	20.0	0.300	0.300	0.148	6.0	19.0	0.316	0.008	0.0	1.0	0.000
1	1610612752	New York Knicks	2	1	2-4 Feet - Tight	0.344	23.0	44.0	0.523	0.591	0.258	17.0	33.0	0.515	0.086	6.0	11.0	0.545
2	1610612752	New York Knicks	3	1	4-6 Feet - Open	0.320	13.0	41.0	0.317	0.390	0.156	7.0	20.0	0.350	0.164	6.0	21.0	0.286
3	1610612752	New York Knicks	4	1	6+ Feet - Wide Open	0.180	9.0	23.0	0.391	0.522	0.039	3.0	5.0	0.600	0.141	6.0	18.0	0.333

A quick check of the box score confirms that the Knicks shot a total of 128, so it looks like adding a date field will work out. We'll just need to figure out which dates to pass in:

We still don't know what the outcome was, so let's jump back into the docs to see if another module will help out.

#Hitting another endpoint
knicks_log = team.TeamGameLogs(knicks_id)

knicks_log.info()

	Team_ID	Game_ID	GAME_DATE	MATCHUP	WL	W	L	W_PCT	MIN	FGM	...	FT_PCT	OREB	DREB	REB	AST	STL	BLK	TOV	PF	PTS
0	1610612752	0021601160	APR 04, 2017	NYK vs. CHI	W	30	48	0.385	240	42	...	0.625	16	37	53	26	5	7	15	22	100
1	1610612752	0021601145	APR 02, 2017	NYK vs. BOS	L	29	48	0.377	240	33	...	0.840	8	24	32	20	12	2	11	20	94
2	1610612752	0021601133	MAR 31, 2017	NYK @ MIA	W	29	47	0.382	240	38	...	0.941	8	31	39	25	9	5	14	18	98
3	1610612752	0021601115	MAR 29, 2017	NYK vs. MIA	L	28	47	0.373	240	33	...	0.810	17	35	52	19	2	6	14	16	88
4	1610612752	0021601098	MAR 27, 2017	NYK vs. DET	W	28	46	0.378	240	45	...	0.923	4	33	37	26	13	5	12	16	109
5	1610612752	0021601085	MAR 25, 2017	NYK @ SAS	L	27	46	0.370	240	41	...	0.867	12	33	45	24	6	5	16	16	98
6	1610612752	0021601071	MAR 23, 2017	NYK @ POR	L	27	45	0.375	240	36	...	0.900	9	31	40	23	5	9	11	20	95
7	1610612752	0021601066	MAR 22, 2017	NYK @ UTA	L	27	44	0.380	240	38	...	0.889	9	27	36	19	5	1	11	26	101
8	1610612752	0021601050	MAR 20, 2017	NYK @ LAC	L	27	43	0.386	240	40	...	0.792	14	34	48	24	6	1	12	19	105
9	1610612752	0021601016	MAR 16, 2017	NYK vs. BKN	L	27	42	0.391	240	41	...	0.962	5	29	34	20	6	4	7	26	110
10	1610612752	0021601001	MAR 14, 2017	NYK vs. IND	W	27	41	0.397	240	35	...	0.615	11	41	52	21	8	4	14	15	87
11	1610612752	0021600986	MAR 12, 2017	NYK @ BKN	L	26	41	0.388	240	39	...	0.813	11	32	43	22	5	8	9	20	112
12	1610612752	0021600975	MAR 11, 2017	NYK @ DET	L	26	40	0.394	240	36	...	0.636	8	36	44	26	4	7	18	18	92
13	1610612752	0021600952	MAR 08, 2017	NYK @ MIL	L	26	39	0.400	240	39	...	0.667	10	33	43	22	4	6	15	20	93
14	1610612752	0021600935	MAR 06, 2017	NYK @ ORL	W	26	38	0.406	240	40	...	0.964	12	33	45	26	6	1	9	23	113
15	1610612752	0021600928	MAR 05, 2017	NYK vs. GSW	L	25	38	0.397	240	39	...	0.800	12	35	47	18	5	6	15	20	105
16	1610612752	0021600909	MAR 03, 2017	NYK @ PHI	L	25	37	0.403	240	33	...	0.879	9	32	41	14	10	3	10	20	102
17	1610612752	0021600895	MAR 01, 2017	NYK @ ORL	W	25	36	0.410	240	34	...	0.806	13	37	50	21	9	3	11	16	101
18	1610612752	0021600882	FEB 27, 2017	NYK vs. TOR	L	24	36	0.400	240	33	...	0.842	8	32	40	17	10	6	17	19	91
19	1610612752	0021600868	FEB 25, 2017	NYK vs. PHI	W	24	35	0.407	240	43	...	0.783	10	34	44	21	6	7	11	22	110
20	1610612752	0021600853	FEB 23, 2017	NYK @ CLE	L	23	35	0.397	240	42	...	0.706	16	34	50	24	4	7	12	19	104
21	1610612752	0021600845	FEB 15, 2017	NYK @ OKC	L	23	34	0.404	240	41	...	0.857	6	33	39	19	8	12	15	21	105
22	1610612752	0021600817	FEB 12, 2017	NYK vs. SAS	W	23	33	0.411	240	34	...	0.810	5	39	44	18	5	8	19	19	94
23	1610612752	0021600800	FEB 10, 2017	NYK vs. DEN	L	22	33	0.400	240	52	...	0.600	10	23	33	36	10	5	10	14	123
24	1610612752	0021600791	FEB 08, 2017	NYK vs. LAC	L	22	32	0.407	240	46	...	0.833	12	29	41	25	9	5	11	22	115
25	1610612752	0021600768	FEB 06, 2017	NYK vs. LAL	L	22	31	0.415	240	37	...	0.788	6	34	40	16	4	4	16	24	107
26	1610612752	0021600759	FEB 04, 2017	NYK vs. CLE	L	22	30	0.423	240	39	...	0.500	13	29	42	23	9	7	10	20	104
27	1610612752	0021600733	FEB 01, 2017	NYK @ BKN	W	22	29	0.431	240	35	...	0.613	21	37	58	23	16	7	13	18	95
28	1610612752	0021600724	JAN 31, 2017	NYK @ WAS	L	21	29	0.420	240	34	...	0.800	22	29	51	18	8	2	12	17	101
29	1610612752	0021600711	JAN 29, 2017	NYK @ ATL	L	21	28	0.429	340	51	...	0.826	15	48	63	32	9	11	12	39	139
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
48	1610612752	0021600456	DEC 25, 2016	NYK vs. BOS	L	16	14	0.533	240	41	...	0.889	17	32	49	11	5	6	17	23	114
49	1610612752	0021600438	DEC 22, 2016	NYK vs. ORL	W	16	13	0.552	240	41	...	0.882	18	34	52	26	9	9	15	18	106
50	1610612752	0021600421	DEC 20, 2016	NYK vs. IND	W	15	13	0.536	240	44	...	0.810	4	41	45	24	6	8	14	18	118
51	1610612752	0021600404	DEC 17, 2016	NYK @ DEN	L	14	13	0.519	240	35	...	0.923	9	26	35	18	6	4	11	27	114
52	1610612752	0021600388	DEC 15, 2016	NYK @ GSW	L	14	12	0.538	240	38	...	0.474	14	35	49	19	10	5	11	10	90
53	1610612752	0021600372	DEC 13, 2016	NYK @ PHX	L	14	11	0.560	265	38	...	0.737	11	32	43	23	10	4	13	27	111
54	1610612752	0021600360	DEC 11, 2016	NYK @ LAL	W	14	10	0.583	240	41	...	0.839	8	36	44	21	9	11	10	15	118
55	1610612752	0021600345	DEC 09, 2016	NYK @ SAC	W	13	10	0.565	240	36	...	0.840	12	42	54	22	3	6	16	25	103
56	1610612752	0021600327	DEC 07, 2016	NYK vs. CLE	L	12	10	0.545	240	35	...	0.867	13	30	43	22	6	3	16	22	94
57	1610612752	0021600316	DEC 06, 2016	NYK @ MIA	W	12	9	0.571	240	48	...	0.688	18	35	53	22	6	6	10	18	114
58	1610612752	0021600302	DEC 04, 2016	NYK vs. SAC	W	11	9	0.550	240	39	...	0.708	14	44	58	20	5	10	18	25	106
59	1610612752	0021600285	DEC 02, 2016	NYK vs. MIN	W	10	9	0.526	240	41	...	0.800	11	32	43	26	9	6	15	18	118
60	1610612752	0021600271	NOV 30, 2016	NYK @ MIN	W	9	9	0.500	240	41	...	0.733	12	27	39	24	8	3	13	26	106
61	1610612752	0021600255	NOV 28, 2016	NYK vs. OKC	L	8	9	0.471	240	36	...	0.893	12	28	40	20	9	11	5	16	103
62	1610612752	0021600241	NOV 26, 2016	NYK @ CHA	L	8	8	0.500	240	37	...	0.800	11	36	47	26	9	6	8	27	102
63	1610612752	0021600228	NOV 25, 2016	NYK vs. CHA	W	8	7	0.533	265	45	...	0.923	13	42	55	26	9	8	16	21	113
64	1610612752	0021600208	NOV 22, 2016	NYK vs. POR	W	7	7	0.500	240	45	...	1.000	10	33	43	26	8	5	13	23	107
65	1610612752	0021600193	NOV 20, 2016	NYK vs. ATL	W	6	7	0.462	240	42	...	0.714	11	39	50	21	8	1	15	23	104
66	1610612752	0021600169	NOV 17, 2016	NYK @ WAS	L	5	7	0.417	240	41	...	0.900	10	26	36	23	9	1	13	20	112
67	1610612752	0021600162	NOV 16, 2016	NYK vs. DET	W	5	6	0.455	240	42	...	0.632	19	33	52	24	8	9	9	11	105
68	1610612752	0021600146	NOV 14, 2016	NYK vs. DAL	W	4	6	0.400	240	34	...	0.889	14	37	51	18	5	5	17	16	93
69	1610612752	0021600131	NOV 12, 2016	NYK @ TOR	L	3	6	0.333	240	44	...	0.750	17	32	49	19	3	2	16	23	107
70	1610612752	0021600125	NOV 11, 2016	NYK @ BOS	L	3	5	0.375	240	33	...	0.882	21	36	57	19	7	11	25	26	87
71	1610612752	0021600106	NOV 09, 2016	NYK vs. BKN	W	3	4	0.429	240	44	...	0.706	9	41	50	25	11	5	14	21	110
72	1610612752	0021600087	NOV 06, 2016	NYK vs. UTA	L	2	4	0.333	240	42	...	0.895	10	29	39	18	8	5	12	26	109
73	1610612752	0021600073	NOV 04, 2016	NYK @ CHI	W	2	3	0.400	240	46	...	0.762	11	29	40	32	7	2	5	23	117
74	1610612752	0021600058	NOV 02, 2016	NYK vs. HOU	L	1	3	0.250	240	37	...	0.680	7	27	34	18	10	6	16	22	99
75	1610612752	0021600050	NOV 01, 2016	NYK @ DET	L	1	2	0.333	240	35	...	0.800	8	35	43	18	6	9	11	20	89
76	1610612752	0021600028	OCT 29, 2016	NYK vs. MEM	W	1	1	0.500	240	40	...	0.641	6	35	41	24	4	4	12	25	111
77	1610612752	0021600001	OCT 25, 2016	NYK @ CLE	L	0	1	0.000	240	32	...	0.750	13	29	42	17	6	6	18	22	88

78 rows × 27 columns

Looks like this can be manipulated to get rest days:

df_game_log = knicks_log.info()

df_game_log['GAME_DATE']=pd.to_datetime(df_game_log['GAME_DATE'])  ##converting the columns datatype

df_game_log['DAYS_REST'] = df_game_log['GAME_DATE']- df_game_log['GAME_DATE'].shift(-1)

df_game_log.head()

	Team_ID	Game_ID	GAME_DATE	MATCHUP	WL	W	L	W_PCT	MIN	FGM	...	OREB	DREB	REB	AST	STL	BLK	TOV	PF	PTS	DAYS_REST
0	1610612752	0021601160	2017-04-04	NYK vs. CHI	W	30	48	0.385	240	42	...	16	37	53	26	5	7	15	22	100	2 days
1	1610612752	0021601145	2017-04-02	NYK vs. BOS	L	29	48	0.377	240	33	...	8	24	32	20	12	2	11	20	94	2 days
2	1610612752	0021601133	2017-03-31	NYK @ MIA	W	29	47	0.382	240	38	...	8	31	39	25	9	5	14	18	98	2 days
3	1610612752	0021601115	2017-03-29	NYK vs. MIA	L	28	47	0.373	240	33	...	17	35	52	19	2	6	14	16	88	2 days
4	1610612752	0021601098	2017-03-27	NYK vs. DET	W	28	46	0.378	240	45	...	4	33	37	26	13	5	12	16	109	2 days

5 rows × 28 columns

df_game_log.dtypes




Team_ID                int64
Game_ID               object
GAME_DATE     datetime64[ns]
MATCHUP               object
WL                    object
W                      int64
L                      int64
W_PCT                float64
MIN                    int64
FGM                    int64
FGA                    int64
FG_PCT               float64
FG3M                   int64
FG3A                   int64
FG3_PCT              float64
FTM                    int64
FTA                    int64
FT_PCT               float64
OREB                   int64
DREB                   int64
REB                    int64
AST                    int64
STL                    int64
BLK                    int64
TOV                    int64
PF                     int64
PTS                    int64
DAYS_REST    timedelta64[ns]
dtype: object

#We have the information we need, but it's not the right data type. To switch it back:
df_game_log['DAYS_REST'] =  df_game_log['DAYS_REST'].astype('timedelta64[D]')
df_game_log.dtypes

Team_ID               int64
Game_ID              object
GAME_DATE    datetime64[ns]
MATCHUP              object
WL                   object
W                     int64
L                     int64
W_PCT               float64
MIN                   int64
FGM                   int64
FGA                   int64
FG_PCT              float64
FG3M                  int64
FG3A                  int64
FG3_PCT             float64
FTM                   int64
FTA                   int64
FT_PCT              float64
OREB                  int64
DREB                  int64
REB                   int64
AST                   int64
STL                   int64
BLK                   int64
TOV                   int64
PF                    int64
PTS                   int64
DAYS_REST           float64
dtype: object

This looks like we'll get all the info for all games. We'll start by appending the information for a single game and then try to do it for all dates:

df_game_log.head()

	Team_ID	Game_ID	GAME_DATE	MATCHUP	WL	W	L	W_PCT	MIN	FGM	...	OREB	DREB	REB	AST	STL	BLK	TOV	PF	PTS	DAYS_REST
0	1610612752	0021601160	2017-04-04	NYK vs. CHI	W	30	48	0.385	240	42	...	16	37	53	26	5	7	15	22	100	2.0
1	1610612752	0021601145	2017-04-02	NYK vs. BOS	L	29	48	0.377	240	33	...	8	24	32	20	12	2	11	20	94	2.0
2	1610612752	0021601133	2017-03-31	NYK @ MIA	W	29	47	0.382	240	38	...	8	31	39	25	9	5	14	18	98	2.0
3	1610612752	0021601115	2017-03-29	NYK vs. MIA	L	28	47	0.373	240	33	...	17	35	52	19	2	6	14	16	88	2.0
4	1610612752	0021601098	2017-03-27	NYK vs. DET	W	28	46	0.378	240	45	...	4	33	37	26	13	5	12	16	109	2.0

5 rows × 28 columns

#Get the dates from the game logs and pass them into the other functions:

dates = df_game_log['GAME_DATE']

print len(dates)


78

#We have the first date, so now to get the relevant passing and shot info we need:
date = dates[2] ##picking a random date

game_info = team.TeamPassTracking(knicks_id, date_from =date, date_to = date).passes_made()
game_info['GAME_DATE'] = date

game_info

	TEAM_ID	TEAM_NAME	PASS_TYPE	G	PASS_FROM	PASS_TEAMMATE_PLAYER_ID	FREQUENCY	PASS	AST	FGM	FGA	FG_PCT	FG2M	FG2A	FG2_PCT	FG3M	FG3A	FG3_PCT	GAME_DATE
0	1610612752	New York Knicks	made	1	Baker, Ron	1627758	0.238	67.0	6.0	7.0	18.0	0.389	5.0	14.0	0.357	2.0	4.0	0.5	2017-03-31
1	1610612752	New York Knicks	made	1	Vujacic, Sasha	2756	0.149	42.0	7.0	8.0	15.0	0.533	5.0	9.0	0.556	3.0	6.0	0.5	2017-03-31
2	1610612752	New York Knicks	made	1	Hernangomez, Willy	1626195	0.131	37.0	2.0	3.0	5.0	0.600	3.0	5.0	0.600	0.0	0.0	NaN	2017-03-31
3	1610612752	New York Knicks	made	1	Porzingis, Kristaps	204001	0.113	32.0	3.0	4.0	9.0	0.444	4.0	7.0	0.571	0.0	2.0	0.0	2017-03-31
4	1610612752	New York Knicks	made	1	Lee, Courtney	201584	0.106	30.0	1.0	4.0	6.0	0.667	4.0	5.0	0.800	0.0	1.0	0.0	2017-03-31
5	1610612752	New York Knicks	made	1	O'Quinn, Kyle	203124	0.096	27.0	3.0	3.0	5.0	0.600	3.0	5.0	0.600	0.0	0.0	NaN	2017-03-31
6	1610612752	New York Knicks	made	1	Holiday, Justin	203200	0.082	23.0	2.0	4.0	5.0	0.800	3.0	3.0	1.000	1.0	2.0	0.5	2017-03-31
7	1610612752	New York Knicks	made	1	Randle, Chasson	1626184	0.057	16.0	0.0	0.0	2.0	0.000	0.0	2.0	0.000	0.0	0.0	NaN	2017-03-31
8	1610612752	New York Knicks	made	1	Ndour, Maurice	1626254	0.028	8.0	1.0	1.0	2.0	0.500	1.0	2.0	0.500	0.0	0.0	NaN	2017-03-31

##Sum everything by GAME_DATE, similar to SQL-style aggregation/groupby:
df_sum = game_info.groupby(['GAME_DATE']).sum()

df_sum.reset_index(level =  0,  inplace =  True)
df_sum

	GAME_DATE	TEAM_ID	G	PASS_TEAMMATE_PLAYER_ID	FREQUENCY	PASS	AST	FGM	FGA	FG_PCT	FG2M	FG2A	FG2_PCT	FG3M	FG3A	FG3_PCT
0	2017-03-31	14495514768	9	7321056	1.0	282.0	25.0	34.0	67.0	4.533	28.0	52.0	4.984	6.0	15.0	1.5

When we merge this row back up to the bigger dataframe, we can drop the columns we don't need.

shot_info = team.TeamShotTracking(knicks_id, date_from = date, date_to = date).closest_defender_shooting()

shot_info

	TEAM_ID	TEAM_NAME	SORT_ORDER	G	CLOSE_DEF_DIST_RANGE	FGA_FREQUENCY	FGM	FGA	FG_PCT	EFG_PCT	FG2A_FREQUENCY	FG2M	FG2A	FG2_PCT	FG3A_FREQUENCY	FG3M	FG3A	FG3_PCT
0	1610612752	New York Knicks	1	1	0-2 Feet - Very Tight	0.200	8.0	16.0	0.500	0.500	0.200	8.0	16.0	0.500	0.000	0.0	0.0	NaN
1	1610612752	New York Knicks	2	1	2-4 Feet - Tight	0.450	15.0	36.0	0.417	0.417	0.425	15.0	34.0	0.441	0.025	0.0	2.0	0.0
2	1610612752	New York Knicks	3	1	4-6 Feet - Open	0.263	11.0	21.0	0.524	0.619	0.138	7.0	11.0	0.636	0.125	4.0	10.0	0.4
3	1610612752	New York Knicks	4	1	6+ Feet - Wide Open	0.088	4.0	7.0	0.571	0.714	0.038	2.0	3.0	0.667	0.050	2.0	4.0	0.5

#From  earlier
shot_info['OPEN'] = shot_info['CLOSE_DEF_DIST_RANGE'].map(lambda  x: True if 'Open' in x else False)
df_sum['OPEN_SHOTS'] = shot_info.loc[shot_info['OPEN']== True, 'FGA'].sum()
df_sum['OPEN_EFG'] = (shot_info.loc[shot_info['OPEN']== True, 'FGM'].sum() + (.5 * shot_info.loc[shot_info['OPEN']== True, 'FG3M'].sum()))/(shot_info.loc[shot_info['OPEN']== True, 'FGA'].sum())
df_sum['COVERED_EFG']= (shot_info.loc[shot_info['OPEN']== False, 'FGM'].sum() + (.5 * shot_info.loc[shot_info['OPEN']== False, 'FG3M'].sum()))/(shot_info.loc[shot_info['OPEN']== False, 'FGA'].sum())

df_sum

	GAME_DATE	TEAM_ID	G	PASS_TEAMMATE_PLAYER_ID	FREQUENCY	PASS	AST	FGM	FGA	FG_PCT	FG2M	FG2A	FG2_PCT	FG3M	FG3A	FG3_PCT	OPEN_SHOTS	OPEN_EFG	COVERED_EFG
0	2017-03-31	14495514768	9	7321056	1.0	282.0	25.0	34.0	67.0	4.533	28.0	52.0	4.984	6.0	15.0	1.5	28.0	0.642857	0.442308

Now to append the columns we need back up. This is going to work like a SQL left-join.

df_custom_boxscore  =  pd.merge(df_game_log, df_sum[['PASS', 'FG2M', 'OPEN_SHOTS' ,'OPEN_EFG', 'COVERED_EFG']], how = 'left', left_on = df_game_log['GAME_DATE'],  right_on =  df_sum['GAME_DATE'])

df_custom_boxscore.head(10)

	Team_ID	Game_ID	GAME_DATE	MATCHUP	WL	W	L	W_PCT	MIN	FGM	...	BLK	TOV	PF	PTS	DAYS_REST	PASS	FG2M	OPEN_SHOTS	OPEN_EFG	COVERED_EFG
0	1610612752	0021601160	2017-04-04	NYK vs. CHI	W	30	48	0.385	240	42	...	7	15	22	100	2.0	NaN	NaN	NaN	NaN	NaN
1	1610612752	0021601145	2017-04-02	NYK vs. BOS	L	29	48	0.377	240	33	...	2	11	20	94	2.0	NaN	NaN	NaN	NaN	NaN
2	1610612752	0021601133	2017-03-31	NYK @ MIA	W	29	47	0.382	240	38	...	5	14	18	98	2.0	282.0	28.0	28.0	0.642857	0.442308
3	1610612752	0021601115	2017-03-29	NYK vs. MIA	L	28	47	0.373	240	33	...	6	14	16	88	2.0	NaN	NaN	NaN	NaN	NaN
4	1610612752	0021601098	2017-03-27	NYK vs. DET	W	28	46	0.378	240	45	...	5	12	16	109	2.0	NaN	NaN	NaN	NaN	NaN
5	1610612752	0021601085	2017-03-25	NYK @ SAS	L	27	46	0.370	240	41	...	5	16	16	98	2.0	NaN	NaN	NaN	NaN	NaN
6	1610612752	0021601071	2017-03-23	NYK @ POR	L	27	45	0.375	240	36	...	9	11	20	95	1.0	NaN	NaN	NaN	NaN	NaN
7	1610612752	0021601066	2017-03-22	NYK @ UTA	L	27	44	0.380	240	38	...	1	11	26	101	2.0	NaN	NaN	NaN	NaN	NaN
8	1610612752	0021601050	2017-03-20	NYK @ LAC	L	27	43	0.386	240	40	...	1	12	19	105	4.0	NaN	NaN	NaN	NaN	NaN
9	1610612752	0021601016	2017-03-16	NYK vs. BKN	L	27	42	0.391	240	41	...	4	7	26	110	2.0	NaN	NaN	NaN	NaN	NaN

10 rows × 33 columns

Looks like everything joined correctly for exactly the date we chose. Let's make some modifications and then work on a script to join the rest of the dates.

df_custom_boxscore['PASS_AST'] = df_custom_boxscore['PASS'] / df_custom_boxscore['AST']

#It's easier to work with dichotomos variables as binaries instead of letters
df_custom_boxscore['RESULT'] = df_custom_boxscore['WL'].map(lambda x: 1 if 'W' in x else 0)

We should be good to go!

Put all the steps above into a function:

def custom_boxscore(roster_id😞

    game_logs  = team.TeamGameLogs(roster_id)

    df_game_logs = game_logs.info()
    df_game_logs['GAME_DATE'] =  pd.to_datetime(df_game_logs['GAME_DATE'])
    df_game_logs['DAYS_REST'] =  df_game_logs['GAME_DATE'] - df_game_logs['GAME_DATE'].shift(-1)
    df_game_logs['DAYS_REST'] =  df_game_logs['DAYS_REST'].astype('timedelta64[D]')

    ##Just like before, that should get us the gamelogs we need and the rest days column

    ##Now to loop through the list of dates for our other stats

    ##This will build up a dataframe of the custom stats and join that to the gamelogs
    df_all =pd.DataFrame() ##blank dataframe

    dates = df_game_logs['GAME_DATE']

    for date in dates:

        game_info = team.TeamPassTracking(roster_id,  date_from=date, date_to=date).passes_made()
        game_info['GAME_DATE'] = date ## We need to append the date to this so we can  join back

        temp_df = game_info.groupby(['GAME_DATE']).sum()
        temp_df.reset_index(level =  0,  inplace =  True)

        ##now to get the shot info. For the most part, we're just reusing code we've already written
        open_info =  team.TeamShotTracking(roster_id,date_from =date,  date_to =  date).closest_defender_shooting()
        open_info['OPEN'] = open_info['CLOSE_DEF_DIST_RANGE'].map(lambda x: True if 'Open' in x else False)

        temp_df['OPEN_SHOTS'] = open_info.loc[open_info['OPEN'] == True, 'FGA'].sum()
        temp_df['OPEN_EFG']= (open_info.loc[open_info['OPEN']== True, 'FGM'].sum() + (.5 * open_info.loc[open_info['OPEN']== True, 'FG3M'].sum()))/(open_info.loc[open_info['OPEN']== True, 'FGA'].sum())
        temp_df['COVERED_EFG']= (open_info.loc[open_info['OPEN']== False, 'FGM'].sum() + (.5 * open_info.loc[open_info['OPEN']== False, 'FG3M'].sum()))/(open_info.loc[open_info['OPEN']== False, 'FGA'].sum())

        ##append this to our bigger dataframe
        df_all = df_all.append(temp_df)

    df_boxscore =  pd.merge(df_game_logs, df_all[['PASS', 'FG2M', 'FG2_PCT', 'OPEN_SHOTS', 'OPEN_EFG', 'COVERED_EFG']], how = 'left', left_on = df_game_logs['GAME_DATE'], right_on = df_all['GAME_DATE'])
    df_boxscore['PASS_AST'] = df_boxscore['PASS'] /  df_boxscore['AST']
    df_boxscore['RESULT'] = df_boxscore['WL'].map(lambda x: 1 if 'W' in x else 0 )

    return df_boxscore

Let's see if this worked:

df_knicks_box_scores = custom_boxscore(knicks_id)

df_knicks_box_scores.head(10)

	Team_ID	Game_ID	GAME_DATE	MATCHUP	WL	W	L	W_PCT	MIN	FGM	...	PTS	DAYS_REST	PASS	FG2M	FG2_PCT	OPEN_SHOTS	OPEN_EFG	COVERED_EFG	PASS/ASSIST	RESULT
0	1610612752	0021600845	2017-02-15	NYK @ OKC	L	23	34	0.404	240	41	...	105	3.0	339.0	26.0	4.070	38.0	0.500000	0.572917	17.842105	0
1	1610612752	0021600817	2017-02-12	NYK vs. SAS	W	23	33	0.411	240	34	...	94	2.0	261.0	22.0	4.882	28.0	0.750000	0.437500	14.500000	1
2	1610612752	0021600800	2017-02-10	NYK vs. DEN	L	22	33	0.400	240	52	...	123	2.0	313.0	31.0	5.733	46.0	0.652174	0.638298	8.694444	0
3	1610612752	0021600791	2017-02-08	NYK vs. LAC	L	22	32	0.407	240	46	...	115	2.0	336.0	36.0	4.981	47.0	0.542553	0.544444	13.440000	0
4	1610612752	0021600768	2017-02-06	NYK vs. LAL	L	22	31	0.415	240	37	...	107	2.0	316.0	30.0	4.501	35.0	0.571429	0.445652	19.750000	0
5	1610612752	0021600759	2017-02-04	NYK vs. CLE	L	22	30	0.423	240	39	...	104	3.0	308.0	21.0	3.457	46.0	0.510870	0.486842	13.391304	0
6	1610612752	0021600733	2017-02-01	NYK @ BKN	W	22	29	0.431	240	35	...	95	1.0	305.0	23.0	4.150	37.0	0.297297	0.435484	13.260870	1
7	1610612752	0021600724	2017-01-31	NYK @ WAS	L	21	29	0.420	240	34	...	101	2.0	293.0	24.0	5.245	41.0	0.317073	0.460784	16.277778	0
8	1610612752	0021600711	2017-01-29	NYK @ ATL	L	21	28	0.429	340	51	...	139	2.0	479.0	32.0	4.137	64.0	0.437500	0.500000	14.968750	0
9	1610612752	0021600699	2017-01-27	NYK vs. CHA	W	21	27	0.438	240	46	...	110	2.0	350.0	31.0	6.167	37.0	0.594595	0.483051	15.909091	1

10 rows × 36 columns

I'm going to throw in a safeguard against divide by 0 errors just in case. This is a really janky, ugly fix, but it'll get the job done for the time being:

def custom_boxscore(roster_id😞
    game_logs  = team.TeamGameLogs(roster_id)

    df_game_logs = game_logs.info()

    df_game_logs['GAME_DATE'] =  pd.to_datetime(df_game_logs['GAME_DATE'])
    df_game_logs['days_rest'] =  df_game_logs['GAME_DATE'] - df_game_logs['GAME_DATE'].shift(-1)
    df_game_logs['days_rest'] =  df_game_logs['days_rest'].astype('timedelta64[D]')

    ##Just like before, that should get us the gamelogs we need and the rest days column

    ##Now to loop through the list of dates for our other stats

    ##Build up a  dataframe of our custom stats and join that to the gamelogs instead of joining  each individual row

    df_all =pd.DataFrame() ##blank dataframe

    dates = df_game_logs['GAME_DATE']

    for date in dates:

        game_info = team.TeamPassTracking(roster_id,  date_from=date, date_to=date).passes_made()
        game_info['GAME_DATE'] = date ## We need to append the date to this so we can  join back

        temp_df = game_info.groupby(['GAME_DATE']).sum()
        temp_df.reset_index(level =  0,  inplace =  True)

        ##now to get the shot info. For the most part, we're just reusing code we've already written
        open_info =  team.TeamShotTracking(roster_id,date_from =date,  date_to =  date).closest_defender_shooting()
        open_info['OPEN'] = open_info['CLOSE_DEF_DIST_RANGE'].map(lambda x: True if 'Open' in x else False)

        temp_df['OPEN_SHOTS'] = open_info.loc[open_info['OPEN'] == True, 'FGA'].sum()
        temp_df['COVERED_SHOTS'] = open_info.loc[open_info['OPEN'] == False, 'FGA'].sum()

        if open_info.loc[open_info['OPEN']== True, 'FGA'].sum() > 0:
            temp_df['OPEN_EFG']= (open_info.loc[open_info['OPEN']== True, 'FGM'].sum() + (.5 * open_info.loc[open_info['OPEN']== True, 'FG3M'].sum()))/(open_info.loc[open_info['OPEN']== True, 'FGA'].sum())
        else:
             temp_df['OPEN_EFG'] = 0

        if open_info.loc[open_info['OPEN']== False, 'FGA'].sum() > 0:
             temp_df['COVER_EFG']= (open_info.loc[open_info['OPEN']== False, 'FGM'].sum() + (.5 * open_info.loc[open_info['OPEN']== False, 'FG3M'].sum()))/(open_info.loc[open_info['OPEN']== False, 'FGA'].sum())
        else:
            temp_df['COVER_EFG'] = 0
        ##append this to our bigger dataframe

        df_all = df_all.append(temp_df)

    df_boxscore =  pd.merge(df_game_logs, df_all[['PASS', 'FG2M', 'FG2_PCT', 'OPEN_SHOTS','COVERED_SHOTS', 'OPEN_EFG', 'COVER_EFG']], how = 'left', left_on = df_game_logs['GAME_DATE'], right_on = df_all['GAME_DATE'])
    df_boxscore['PASS_ASSIST'] = df_boxscore['PASS'] /  df_boxscore['AST']
    df_boxscore['RESULT'] = df_boxscore['WL'].map(lambda x: 1 if 'W' in x else 0 )

    return df_boxscore

    df_knicks_box_scores = custom_boxscore(knicks_id)

Awesome! Looks like everything came out okay. With a team_id, we can do this with every team. We just need a get a list of team_ids and team names.

From the documentation:

http://nba-py.readthedocs.io/en/0.1a2/nba_py/

import nba_py

teams =  nba_py.Scoreboard()
df_teams = pd.concat([teams.east_conf_standings_by_day()[['TEAM','TEAM_ID']], teams.west_conf_standings_by_day()[['TEAM','TEAM_ID']]])

df_teams.head()

	TEAM	TEAM_ID
0	Boston	1610612738
1	Cleveland	1610612739
2	Toronto	1610612761
3	Washington	1610612764
4	Milwaukee	1610612749

Now we can pass in the team IDs to create custom boxscores for all teams.

from nba_py import team
import time
def game_summary_teams(roster_ids, team_names😞
    print roster_ids, team_names

    for i in range (0, len(roster_ids)):
        time.sleep(35)
        print roster_ids[i]
        print team_names[i]

        info = team.TeamGameLogs(roster_ids[i])
        df = info.info()
        df['GAME_DATE'] = pd.to_datetime(df['GAME_DATE'])
        df['DAYS_REST'] =   df['GAME_DATE'] - df['GAME_DATE'].shift(-1) ##this gives us our days rest column
        df['DAYS_REST']= df['DAYS_REST'].astype('timedelta64[D]')

        vals  = df['GAME_DATE']

        df_passes =  pd.DataFrame()
        for v in vals:
            time.sleep(.5)
            ##these values are  being over-written each time
            print 'GAME FOR' +  team_names[i] + ' for ' + str(v)
            game_info = team.TeamPassTracking(roster_ids[i], date_from =v, date_to = v).passes_made()
            time.sleep(5)

            game_info['EVENT_DATE'] = v
            df_sum = game_info.groupby(['EVENT_DATE']).sum()
            df_sum.reset_index(level = 0, inplace =  True)

            open_info = team.TeamShotTracking(roster_ids[i], date_from =v, date_to = v).closest_defender_shooting()
            time.sleep(5)
            open_info['OPEN'] = open_info['CLOSE_DEF_DIST_RANGE'].map(lambda  x: True if 'Open' in x else False)
            df_sum['OPEN_SHOTS'] = open_info.loc[open_info['OPEN']== True, 'FGA'].sum()
            df_sum['COVERED_SHOTS'] = open_info.loc[open_info['OPEN']== False, 'FGA'].sum()

            ##gotta figure out a better divide by 0 fix.
            if (open_info.loc[open_info['OPEN']== True, 'FGA'].sum() > 0😞
                df_sum['OPEN_EFG']= (open_info.loc[open_info['OPEN']== True, 'FGM'].sum() + (.5 * open_info.loc[open_info['OPEN']== True, 'FG3M'].sum()))/(open_info.loc[open_info['OPEN']== True, 'FGA'].sum())
            else:
                df_sum['OPEN_EFG'] = 0

            if (open_info.loc[open_info['OPEN']== False, 'FGA'].sum() > 0😞
                df_sum['COVERED_EFG']= (open_info.loc[open_info['OPEN']== False, 'FGM'].sum() + (.5 * open_info.loc[open_info['OPEN']== False, 'FG3M'].sum()))/(open_info.loc[open_info['OPEN']== False, 'FGA'].sum())
            else:
                df_sum['COVERED_EFG']=0
            df_passes = df_passes.append(df_sum)

            info = pd.merge(df, df_passes[['PASS', 'FG2M', 'FG2_PCT','OPEN_SHOTS', 'OPEN_EFG','COVERED_SHOTS', 'COVERED_EFG']], how = 'left',left_on = df['GAME_DATE'], right_on = df_passes['EVENT_DATE'])
            info['PASS_AST'] = info['PASS']/info['AST']

            info['RESULT'] = info['WL'].map(lambda x: 1 if 'W' in x else 0 )

            file_name = team_names[i]
            info.to_csv(file_name, index =  False)

teams = df_teams['TEAM']
roster_ids = df_teams['TEAM_ID']

Just feed in these two arrays into the function and we should be good to go.

I went ahead and did this for a few teams. The NBA's website cuts you off if you make too many requests too quickly (hence all the sleep statements above).

After fiddling with it for a while, I was finally able to get the data for each team. You might have to run the code above piece by piece, or just use the CSVs here:

Visualization

Let's see if we can visually represent anything about each team's offense:

These visualizations are going to be done in Plotly because I think it's the best vizualiation library out there for quickly and easily making graphs that are both visually appleaing and interactive, but feel free to use something else (although I can't imagine why you would).

This is going to break my heart a bit...but lets compare some of the stats fetched between the Knicks and teams that aren't the Knicks.

Something about not counting another man's money right?

knicks = pd.read_csv('New York.csv')
spurs =  pd.read_csv('San Antonio.csv')
warriors = pd.read_csv('Golden State.csv')
thunder = pd.read_csv('Oklahoma City.csv')
celtics =  pd.read_csv('Boston.csv')

import plotly.plotly as py
import plotly.graph_objs as go

trace0 = go.Box(
    y=knicks['PASS'],
    name='Knicks',
    boxmean='sd'
)
trace1 = go.Box(
    y=spurs['PASS'],
    name='Spurs',
    boxmean='sd'
)
trace2 = go.Box(
    y=warriors['PASS'],
    name='Warriors',
    boxmean='sd'
)
trace3 = go.Box(
    y=thunder['PASS'],
    name='Thunder',
    boxmean='sd'
)
trace4 = go.Box(
    y=celtics['PASS'],
    name='Celtics',
    boxmean='sd'
)

layout = go.Layout(
    title='Passing Box Plot',
)
data = [trace0, trace1, trace2, trace3, trace4]

fig = go.Figure(data=data, layout=layout)
py.iplot(fig)

Ignoring the outlier from the 4OT Knicks-Hawks game, this graph is pretty telling. Obviously this isn't the full story, but it looks like the Spurs and Thunder play pretty consistent but different offenses. What's really interesting is that despite the Spurs and Warriors having coaches and systems that emphasize ball movement, they throw FEWER passes than a team like the Knicks.

Let's look at if those passes translate to assists:

import plotly.plotly as py
import plotly.graph_objs as go

trace0 = go.Box(
    y=knicks['PASS_AST'],
    name='Knicks',
    boxmean='sd'
)
trace1 = go.Box(
    y=spurs['PASS_AST'],
    name='Spurs',
    boxmean='sd'
)
trace2 = go.Box(
    y=warriors['PASS_AST'],
    name='Warriors',
    boxmean='sd'
)
trace3 = go.Box(
    y=thunder['PASS_AST'],
    name='Thunder',
    boxmean='sd'
)
trace4 = go.Box(
    y=celtics['PASS_AST'],
    name='Celtics',
    boxmean='sd'
)
data = [trace0, trace1, trace2, trace3, trace4]

layout = go.Layout(
    title='Passes per Assist',
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig)

These two graphs in conjunction are pretty telling. On average, it takes the Knicks almost 2 more passes than the Spurs, and 5 more than the Warriors to get an assist.

Going by this graph every, ~10th pass the Warriors make results in an assist. From the previous graph, we see that they make an average of 313 passes a game. This almost lines up with their season average of roughly 31 assists/game.

The standard deviation of the above graph can be interpreted in a few ways. On one hand, it's a loose metric of playstyle consistency; teams that play the same way through the entire game are probably going to have a lower standard deviation than teams who pass the ball for 3 quarters and forget to in the 4th (cough cough, New York, cough cough). On the other hand, teams might have different playstyles depending on the lineups they have on the floor, resulting in a higher standard deviation (Spurs).

The Thunder probably fall into this, most likely due to Russell Westbrook averaging over 10 of the team's total 20 assists per game.

Obviously, there's a lot more to the story. How many passes led to FTs? Is there any correlation between passes per assist and wins? If anything, stats like these tell you more about what kind of offense a team runs, not how effectively they run it.

Now let's see if there's any noticeable difference in wins vs losses:

import plotly.plotly as py
import plotly.graph_objs as go

trace0 = go.Box(
    y=knicks.loc[knicks['WL'] == 'W']['PASS_AST'],
    name='Knicks Wins',
    boxmean='sd'
)
trace1 = go.Box(
    y=knicks.loc[knicks['WL'] == 'L']['PASS_AST'],
    name='Knicks Loss',
    boxmean='sd'
)
trace2 = go.Box(
    y=spurs.loc[spurs['WL'] == 'W']['PASS_AST'],
    name='Spurs Wins',
    boxmean='sd'
)
trace3 = go.Box(
    y=spurs.loc[spurs['WL'] == 'L']['PASS_AST'],
    name='Spurs Loss',
    boxmean='sd'

)
trace4 = go.Box(
    y=warriors.loc[warriors['WL'] == 'W']['PASS_AST'],
    name='Warriors Wins',
    boxmean='sd'
)
trace5 = go.Box(
    y=warriors.loc[warriors['WL'] == 'L']['PASS_AST'],
    name='Warriors Losses',
    boxmean='sd'
)
trace6 = go.Box(
    y=thunder.loc[thunder['WL'] == 'W']['PASS_AST'],
    name='Thunder Wins',
    boxmean='sd'
)
trace7 = go.Box(
    y=thunder.loc[thunder['WL'] == 'L']['PASS_AST'],
    name='Thunder Losses',
    boxmean='sd'
)
trace8 = go.Box(
    y=celtics.loc[celtics['WL'] == 'W']['PASS_AST'],
    name='Celtics Wins',
    boxmean='sd'
)
trace9 = go.Box(
    y=celtics.loc[celtics['WL'] == 'L']['PASS_AST'],
    name='Celtics Lossses',
    boxmean='sd'
)
layout = go.Layout(
    title='Passes per Assist in Wins vs Losses',
)
data = [trace0, trace1, trace2, trace3, trace4, trace5, trace6, trace7, trace8, trace9]
fig = go.Figure(data=data, layout=layout)
py.iplot(fig)

Apart from the Celtics, every team had to make more at least 1 more pass to get an assist in games they lost compared to games they lost. In one way, it's almost like they have to "work harder" for assists.

From the looks of this graph, the Warriors offense when its firing on all cylinders is a in a league of its own.

Just to reiterate once again, the purpose of the visualizations above is to ask, not answer questions.

But now, let's see if we can get any team specific insights from any of this:

The Clippers have played without Chris Paul and Blake Griffin, two of the best passers at their position in the league.

Do the boxscores show how their offense has had to adjust?

clippers  = pd.read_csv('LA Clippers.csv')
clippers  = clippers.sort(['GAME_DATE'], ascending = True)
clippers.head()

/home/virajparekh/anaconda2/lib/python2.7/site-packages/ipykernel/__main__.py:2: FutureWarning:

sort(columns=....) is deprecated, use sort_values(by=.....)

	Unnamed: 0	Team_ID	Game_ID	GAME_DATE	MATCHUP	WL	W	L	W_PCT	MIN	...	DAYS_REST	PASS	FG2M	FG2_PCT	OPEN_SHOTS	COVERED_SHOTS	OPEN_EFG	COVERED_EFG	PASS_AST	RESULT
77	53.0	1610612746	21600017	2016-10-27	LAC @ POR	W	1	0	1.00	240	...	NaN	301.0	21.0	4.047	41.0	50.0	0.463415	0.440000	25.083333	1.0
76	52.0	1610612746	21600035	2016-10-30	LAC vs. UTA	W	2	0	1.00	240	...	3.0	275.0	22.0	4.757	34.0	48.0	0.558824	0.375000	16.176471	1.0
75	51.0	1610612746	21600045	2016-10-31	LAC vs. PHX	W	3	0	1.00	240	...	1.0	276.0	28.0	4.795	38.0	42.0	0.565789	0.511905	13.142857	1.0
74	50.0	1610612746	21600064	2016-11-02	LAC vs. OKC	L	3	1	0.75	240	...	2.0	302.0	24.0	2.893	33.0	54.0	0.424242	0.435185	13.727273	0.0
73	49.0	1610612746	21600074	2016-11-04	LAC @ MEM	W	4	1	0.80	240	...	2.0	300.0	17.0	2.681	41.0	44.0	0.451220	0.397727	15.789474	1.0

5 rows × 38 columns

clippers_rolling = clippers[['GAME_DATE','PTS','TOV','PASS',  'OPEN_SHOTS', 'OPEN_EFG', 'AST',  'PASS_AST']].rolling(5).mean()

clippers_rolling.head(10)

	GAME_DATE	PTS	TOV	PASS	OPEN_SHOTS	OPEN_EFG	AST	PASS_AST
77	2016-10-27	NaN	NaN	NaN	NaN	NaN	NaN	NaN
76	2016-10-30	NaN	NaN	NaN	NaN	NaN	NaN	NaN
75	2016-10-31	NaN	NaN	NaN	NaN	NaN	NaN	NaN
74	2016-11-02	NaN	NaN	NaN	NaN	NaN	NaN	NaN
73	2016-11-04	100.0	12.8	290.8	37.4	0.492698	18.2	16.783881
72	2016-11-05	100.4	13.0	293.6	37.4	0.514649	20.6	14.392215
71	2016-11-07	105.6	12.4	302.4	39.2	0.544745	22.8	13.435492
70	2016-11-09	104.6	10.6	303.0	38.8	0.537143	23.4	13.131921
69	2016-11-11	110.0	9.2	308.8	41.2	0.563405	22.6	14.064244
68	2016-11-12	114.0	9.8	306.6	40.8	0.591110	23.8	13.218349

If we want to see all of this on the same graph, we need to normalize it. This means we're going to scale each value by subtracting it from the mean, and dividing by standard deviation.

This is a bit of a janky way to do so because it relies on the columns being in the same order.

clippers_norm = clippers[['GAME_DATE', 'PTS','TOV','PASS',  'OPEN_SHOTS', 'OPEN_EFG', 'AST',  'PASS_AST']]
for c in clippers_norm.columns[1:]:
    clippers_rolling[c] = (clippers_rolling[c] - clippers_norm[c].mean())/clippers_norm[c].std()

clippers_rolling.head(10)

	GAME_DATE	PTS	TOV	PASS	OPEN_SHOTS	OPEN_EFG	AST	PASS_AST
77	2016-10-27	NaN	NaN	NaN	NaN	NaN	NaN	NaN
76	2016-10-30	NaN	NaN	NaN	NaN	NaN	NaN	NaN
75	2016-10-31	NaN	NaN	NaN	NaN	NaN	NaN	NaN
74	2016-11-02	NaN	NaN	NaN	NaN	NaN	NaN	NaN
73	2016-11-04	-0.705388	0.033424	-0.403622	-0.504550	-0.714018	-0.878286	0.739290
72	2016-11-05	-0.672092	0.088895	-0.284434	-0.504550	-0.480817	-0.376102	0.083865
71	2016-11-07	-0.239256	-0.077516	0.090155	-0.202852	-0.161093	0.084234	-0.178321
70	2016-11-09	-0.322493	-0.576750	0.115695	-0.269896	-0.241857	0.209780	-0.261513
69	2016-11-11	0.126991	-0.965042	0.362583	0.132369	0.037146	0.042385	-0.006014
68	2016-11-12	0.459943	-0.798631	0.268936	0.065325	0.331470	0.293477	-0.237828

import plotly.plotly as py
import plotly.graph_objs as go

trace = go.Scatter(
    x = clippers_rolling['GAME_DATE'],
    y = clippers_rolling['PASS'],
    name = 'Pass to Assist'
)
trace1 = go.Scatter(
    x = clippers_rolling['GAME_DATE'],
    y = clippers_rolling['PTS'],
    name = 'Points'
)
trace2 = go.Scatter(
    x = clippers_rolling['GAME_DATE'],
    y = clippers_rolling['AST'],
    name = 'Assists'
)
trace3 = go.Scatter(
    x = clippers_rolling['GAME_DATE'],
    y = clippers_rolling['PASS_AST'],
    name = 'PASSES PER ASSIST'
)
data = [trace1,trace2 ,trace3]


layout = go.Layout(
    title  = 'Clippers Offense',
    showlegend=True,
    annotations=[
        dict(
            x='2016-12-20',
            y=-1,
            xref='x',
            yref='y',
            text='Blake Griffin Injury',
            showarrow=True,
            arrowhead=7,
            ax=0,
            ay=-330
        ),
          dict(
            x='2017-01-19',
            y=-1,
            xref='x',
            yref='y',
            text='Chris Paul Injury',
            showarrow=True,
            arrowhead=7,
            ax=0,
            ay=-275
        ),
        dict(
            x='2017-01-24',
            y=-1,
            xref='x',
            yref='y',
            text='Blake Griffin Returns',
            showarrow=True,
            arrowhead=7,
            ax=0,
            ay=-330
        ),
         dict(
            x='2017-02-24',
            y=-1,
            xref='x',
            yref='y',
            text='Chris Paul Returns',
            showarrow=True,
            arrowhead=7,
            ax=0,
            ay=-330
        )
    ]
)
fig = go.Figure(data=data, layout=layout)
py.iplot(fig)

According to this, Blake's abscence definitely had an effect on how the team runs their offense. Passes per assist went up just as total points went down after Blake's injury.

From the looks of it, the Clippers were starting to adjust to Blake being out just as CP3 got hurt, but for the most part, there's too much noise here.

Blake gets injured almost every year, so I wonder how this graph looks for 2015-2016 data.....

This is just the tip of the iceburg. Exploratory analysis like this is about finding interesting questions, not answering them.

Building the Pipe:

Now that we can what we can do with the data, let's get everything piping into a database.

If you see yourself doing some of your own analysis or just want some more experience moving and cleaning data, this section is a high level overview of how to do that:

I stored the output from the previous function in CSVs in the same directory as this notebook so they can be easily imported.

Pandas has a built in to_sql function that works with sqlalchemy:

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_sql.html

import glob, os
import time
files = []
os.chdir("YOUR_DIRECTORY_HERE")
for file in glob.glob("*.csv"😞
    files.append(file)

from sqlalchemy import *
for f in files:

    team_name = str(f)
    team_name = team_name.partition('.csv')[0].replace(' ','_').replace('.','').lower()

    df  = pd.read_csv(f)
    eng = create_engine("YOUR_CONNECTION_STRING_HERE")
    df.to_sql(con =  eng,  index = False, name = team_name, schema  =  'nba', if_exists = 'append')
    time.sleep(10)

The python sqlalchemy package supports most databases, so refer to the documentation for each db's connection string:

http://www.sqlalchemy.org/

What would really make analysis easier is if information updated after every game. Let's try using the info we have to update one of the CSVs, and then use that to do it for all the other files:

def update_info(file😞

    file_name =  str(file)
    name  =  file_name.partition('.csv')[0].replace(' ','_').replace('.','').lower()
    old_info = pd.read_csv(file_name)

    team_id = old_info['Team_ID'].max()
    new_logs = team.TeamGameLogs(team_id).info()

    old_info['GAME_DATE'] =  pd.to_datetime(old_info['GAME_DATE'])
    order = old_info.columns
    new_logs['GAME_DATE'] = pd.to_datetime(new_logs['GAME_DATE'])

    ## If there's no new  games for the team, return:
    if max(new_logs['GAME_DATE']) ==  max(old_info['GAME_DATE']):
        return

    new_logs['DAYS_REST']= new_logs['GAME_DATE'] - new_logs['GAME_DATE'].shift(-1) ##this gives us our days rest column
    new_logs['DAYS_REST']= new_logs['DAYS_REST'].astype('timedelta64[D]')

    ##keeping datatypes consistent
    new_logs['Game_ID'] = new_logs['Game_ID'].astype(str).astype(int)

    ##Append the info from the previously saved CSV and append it to the new game logs
    info =  pd.concat([old_info, new_logs], ignore_index = True)

    ##Drop the duplicates
    info = info.drop_duplicates(['Game_ID'], keep = 'first')

    ## Sort by date
    info = info.sort(['GAME_DATE'], ascending = [0])

    ##Reset the axis
    info =  info.reset_index(drop = True)

    ##Find the dates where there's  no values for any of the stats we fetched. We can make requests for only those dates
    updates = info.loc[np.isnan(info['COVERED_EFG'])]

    ##If the team's boxscore is up to date, return.
    if len(updates) == 0:
        return
    dates  =  updates['GAME_DATE']

    df_passes =  pd.DataFrame()
    for d in dates:
        ##All exactly the same as before

        game_info = team.TeamPassTracking(team_id, date_from =d, date_to = d).passes_made()
        game_info['EVENT_DATE'] = d

        df_sum = game_info.groupby(['EVENT_DATE']).sum()
        df_sum.reset_index(level = 0, inplace =  True)

        open_info = team.TeamShotTracking(team_id, date_from =d, date_to = d).closest_defender_shooting()

        open_info['OPEN'] = open_info['CLOSE_DEF_DIST_RANGE'].map(lambda  x: True if 'Open' in x else False)
        df_sum['OPEN_SHOTS'] = open_info.loc[open_info['OPEN']== True, 'FGA'].sum()
        df_sum['COVERED_SHOTS'] = open_info.loc[open_info['OPEN']== False, 'FGA'].sum()

        if (open_info.loc[open_info['OPEN']== True, 'FGA'].sum() > 0😞
            df_sum['OPEN_EFG']= (open_info.loc[open_info['OPEN']== True, 'FGM'].sum() + (.5 * open_info.loc[open_info['OPEN']== True, 'FG3M'].sum()))/(open_info.loc[open_info['OPEN']== True, 'FGA'].sum())
        else:
            df_sum['OPEN_EFG'] = 0

        if (open_info.loc[open_info['OPEN']== False, 'FGA'].sum() > 0😞
            df_sum['COVERED_EFG']= (open_info.loc[open_info['OPEN']== False, 'FGM'].sum() + (.5 * open_info.loc[open_info['OPEN']== False, 'FG3M'].sum()))/(open_info.loc[open_info['OPEN']== False, 'FGA'].sum())
        else:
            df_sum['COVERED_EFG']=0

        df_passes = df_passes.append(df_sum)

    df_passes = df_passes.reset_index(drop = True)

    ##Join the new stats with the old information.
    info.update(df_passes[['PASS', 'FG2M', 'FG2_PCT', 'OPEN_SHOTS', 'OPEN_EFG', 'COVERED_EFG','COVERED_SHOTS']])

    ##Calculate these two post join in case there were any stat corrections
    info['PASS_AST'] = info['PASS'] /  info['AST']
    info['RESULT'] = info['WL'].map(lambda x: 1 if 'W' in x else 0 )

    ##Reorder the columns in the dataframe
    info = info[order]

    ##Save to csv
    info.to_csv(file_name, index = False)

    ##Upload, and replace the information already there. Since we're working with a relatively small volume of data,
    ##upserts isn't worth the time.
    info.to_sql(con =  eng,  index = False, name = name, schema  =  'nba', if_exists = 'replace')

    print name

    return info

You can put the script above in a seperate .py file in the same directory as all of the team CSVs and schedule it via crontab to run automatically so your database and CSVs will always contain the latest information (assuming stats.nba.com doesn't change anything on their end).

Here's a great tutorial on how to do so: https://www.youtube.com/watch?v=hDJ3XQzW8nk

Wrapping it all up

That should be everything you need to get started.

If this was interesting to you and you want to test what you learned, here are a few excercises (in relatively increasing difficulty):

1) The current boxscores only have a team_id in each table. Find a way to insert a column for a team name.

2) Throw in some data about conested/unconested rebounding. This might be interesting when looking at different factors that contribute to wins.

3) There's no player specific data in any of this. Try throwing in a column in each team's boxscores with each game's leading scorer.

4) Compare data across seasons! Is it possible to visualize changes to the Thunder offense after KD left? How different is Tom Thibideau's offense from Sam Mitchell's scheme in Minnesota?

5) Do some data science! I'd love to see what sort of interesting models can be drummed up using this infastructure.

Thanks for reading! Send your suggestions, solutions, or anything else to viraj@astronomer.io

Data Science

Data Wrangling 101: Using Python to Fetch, Manipulate & Visualize NBA Data

Step One: What data do we need?

Next step: Where is the data coming from?

1) Who was the game against? Who won?

2) How many days rest did each team have?

3) How are we going to get all this data together?

Visualization

Building the Pipe:

Wrapping it all up

Need Help in Data manipulation

data manipulation

Alteryx + PowerBI: L'intelligence de la data scien...

Column data manipulation

Data manipulation