It's been somewhat of a sharky summer these past 3 months: there were a smattering of attacks up and down the east coast of the US, professional surfer Mick Fanning, had a close call at a competition in South Africa, and several beaches were closed in California after a Great White took a bite out of a surfboard! So now with the end of summer officially here (at least in the northern hemisphere), we thought it would be interesting to dig into some shark attack data. In this post, we'll look through the Global Shark Attack File, checkout some of the characteristics of shark attacks and then dive in to some geo-plotting with Matplotlib Basemap.
The Github repo with relevant files for this post is here
Watch out for Landsharks!
Let's get a quick feel for our data. After downloading the Excel file from the ISAF, available here we can read in the file and start looking around. Some important things to note:
import pandas as pd
import numpy as np
import re
# read in xls as a dataframe
df = pd.read_excel('GSAF5.xls')
# clean our columns
df['Activity'] = df['Activity'].str.lower()
df['Activity'] = df['Activity'].str.replace('-','')
df['Species '] = df['Species '].str.lower()
df['Country'] = df['Country'].str.upper()
df.rename(columns={'Fatal (Y/N)':'Fatal','Species ':'Species','Sex ':'Sex'}, inplace=True)
df.drop(['Case Number.1', 'original order', 'Unnamed: 21', 'Unnamed: 22','pdf', 'href formula', 'href'],inplace=True, axis=1)
df.head()
Case Number | Date | Year | Type | Country | Area | Location | Activity | Name | Sex | Age | Injury | Fatal | Time | Species | Investigator or Source | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ND.0001 | 1845-1853 | 0 | Unprovoked | CEYLON (SRI LANKA) | Eastern Province | Below the English fort, Trincomalee | swimming | male | M | 15 | FATAL. "Shark bit him in half... | Y | NaN | NaN | S.W. Baker |
1 | ND.0002 | 1883-1889 | 0 | Unprovoked | PANAMA | NaN | Panama Bay 8ºN, 79ºW | NaN | Jules Patterson | M | NaN | FATAL | Y | NaN | NaN | The Sun, 10/20/1938 |
2 | ND.0003 | 1900-1905 | 0 | Unprovoked | USA | North Carolina | Ocracoke Inlet | swimming | Coast Guard personnel | M | NaN | FATAL | Y | NaN | NaN | F. Schwartz, p.23; C. Creswell, GSAF |
3 | ND.0004 | Before 1903 | 0 | Unprovoked | AUSTRALIA | Western Australia | NaN | pearl diving | Ahmun | M | NaN | FATAL | Y | NaN | NaN | H. Taunton; N. Bartlett, pp. 233-234 |
4 | ND.0005 | Before 1903 | 0 | Unprovoked | AUSTRALIA | Western Australia | Roebuck Bay | diving | male | M | NaN | FATAL | Y | NaN | NaN | H. Taunton; N. Bartlett, p. 234 |
Before we start getting into our analyses, we'll need to do a bit of cleaning...
# clean our Fatal column up
def clean_Fatal(x😞
if x == 'Y':
return True
elif x == 'UNKNOWN':
return ''
else:
return False
df.Fatal = df.Fatal.map(clean_Fatal)
# these functions are included in the github repo
df['year'] = df['Date'].map(year)
df['month'] = df['Date'].map(month)
df['day'] = df['Date'].map(day)
The activity column in this dataset is both useful but difficult. Below is a smattering of attack descriptions:
df.Activity.value_counts()
surfing 854
swimming 779
fishing 383
spearfishing 303
bathing 152
wading 134
diving 120
standing 96
snorkeling 74
scuba diving 74
body boarding 54
body surfing 47
swimming 47
boogie boarding 42
treading water 32
pearl diving 32
....
spearfishing 7
kayak fishing 7
kite surfing 6
fishing for mackerel 6
spearfishing on scuba 6
murder 6
Tragically, six people have been killed by sharks by being thrown or forced overboard:
df[(df.Activity=='murder')][['Date','Country','Activity','Injury']]
Date | Country | Activity | Injury | |
---|---|---|---|---|
146 | Reported 1776 | GUINEA | murder | FATAL |
546 | Reported 20-Oct-1893 | INDIA | murder | FATAL |
3254 | 17-Mar-1984 | SOMALIA | murder | Forced at gunpoint to jump overboard. Presume... |
4720 | Oct-2006 | GULF OF ADEN | murder | FATAL, beaten & thrown overboard by smugglers... |
4752 | 22-Mar-2007 | YEMEN | murder | FATAL, beaten & thrown overboard by smugglers... |
4780 | Jul-2007 | SENEGAL | murder | NaN |
While the granularity of the Activity column is good to have, it makes it fairly difficult to group attacks by type. For example, let's say we decided that "diving" was a distinct activity type. Below we see that there are 194 different "Activities" that contain the word diving! From here, we'd need to decide whether the various types of diving are different, i.e. is "diving for Bêche-de-mer" vs "diving for trochus" different, or somehow related to the frequency in which sharks attack divers, and should therefore be considered a separate "activity".
len(df[df['Activity'].str.contains('diving')==True]['Activity'].value_counts())
194
diving 120
scuba diving 74
pearl diving 32
free diving 26
freediving 12
...
diving for trochus 9
free diving / modeling 1
pearl diving from lugger whyalla 1
diving from the lugger san, operated by the protector of the aborigines 1
diving for bechedemer 1
To better group the more common activity types, I've created a function to handle for specific cases: its messy and long, so I've included it in the repo.
But now that we have usable activity types, we can make some plots!
# create a list of top 10 Activities where the attack type is 'Unprovoked'
top_activities = df[df.Type=='Unprovoked'].Activity.value_counts().index.tolist()[:10]
df_top_activities = df[df.Activity.isin(top_activities) & (df.year >1950) & (df.Type=='Unprovoked')].dropna(axis=0,subset=['year'])
df_top_activities.groupby(['Activity','year']).count().reset_index()
# because we have years without attacks, we need to fill them with 0's
years = range(1950, 2016)
activities = df_top_activities.Activity.unique()
import itertools
# http://stackoverflow.com/questions/12130883/r-expand-grid-function-in-python
def expandgrid(*itrs😞
product = list(itertools.product(*itrs))
return {'Var{}'.format(i+1):[x[i] for x in product] for i in range(len(itrs))}
all_combos = expandgrid(activities, years)
all_combos = pd.DataFrame(all_combos)
all_combos.columns = ["Activity", "year"]
all_years = pd.merge(all_combos, df_top_activities, on=["year", "Activity"], how='left')
all_years = all_years.groupby(['Activity','year']).count().reset_index()
from ggplot import *
ggplot(aes(x='year', y='Case Number', color='Activity'),all_years) + geom_line() + ylab("Number of Attacks") +\
xlab("Year") + ggtitle("Shark Attacks by Activity Type")
The most significant trend we can see here is the increase in the number of attacks on surfers over the past 65 years. This is likely due to the increase in the popularity in surfing in places such as Australia, South Africa, and the USA, where sharks such as Great Whites and Bull Sharks are fairly common.
We can also filter this data to look just at fatal attacks. Based on the two charts, we see that despite rate of attacks on surfers and swimmers increasing over the past 65 years, fatal attacks have not increased at the same rate.
The ocean...duh brah!...but let's get a bit more specific. As you'll note from the table above, we have columns indicating the Country, Area and Location of the attacks, but no numerical or geo-spacial data to plot attacks with. Thankfully, the python package GeoPy has some great features for accessing API's such as the Google Maps API to pull in coordinates of points. We'll use the text descriptions of the attack location in combination with the API to get the longitude and latitude.
import time
from geopy.geocoders import GoogleV3
from geopy.geocoders import Nominatim
geolocator = GoogleV3()
geolocator2 = Nominatim()
As a quick example, I'll query the coordinates of the Yhat office:
geolocator.geocode('66 w 39th street new york city')
Location(66 West 39th Street, New York, NY 10018, USA, (40.7526707, -73.98517, 0.0))
Bam! Fast, super easy and accurate! Scaling this concept out, I've created a function to take the text descriptions of the attack locations, query the API for the coordinates and then write the results back to a latitude
and longitude
column. I've used two geocoders here in case of a timeout or some sort of error. This way, we should get a coordinates of each event.
# plotting unprovoked attacks in the USA post 1900
df2 = df[(df.Country=='USA') & (df.year>1900) & (df.Type=='Unprovoked')]
def get_location(row😞
time.sleep(.02)
loc = str(row.Location) + ', ' + str(row.Area) + ', ' + str(row.Country)
for _ in range(1😞
try:
# try our first geocoder
location = geolocator.geocode(loc)
print(location)
print(row.index)
if not location==None:
lat = location[1][0]
long = location[1][1]
return lat, long
else:
try:
# try our second geocoder if first one fails
location = geolocator.geocode2(loc)
print(location)
print(row.index)
if not location==None:
lat = location[1][0]
long = location[1][1]
return lat, long
else:
return None, None
except:
return None, None
continue
except:
return None, None
continue
return row.latitude, row.longitude
# get our coordinates
df2['latitude'], df2['longitude'] = zip(*df2.apply(get_location,axis=1))
# write to a csv for future use b/c the API lookups take a longggg time
df2.to_csv('./sharks_coords.csv',index=False)
Taking a look at our results:
Area | Location | Country | latitude | longitude | |
---|---|---|---|---|---|
0 | North Carolina | Somewhere between Hatteras and Beaufort | USA | 34.718217 | -76.663819 |
1 | Florida | Gadsden Point, Tampa Bay | USA | 27.822248 | -82.473150 |
2 | Florida | Palm Beach, Palm Beach County | USA | 26.705621 | -80.036430 |
3 | California | Capistrano, Orange County | USA | 33.723150 | -117.776661 |
4 | Florida | Mosquito Inlet (Ponce Inlet), Volusia County | USA | 29.096373 | -80.936998 |
To simplify things, we're only going to look at attacks in the USA. For plotting, we'll use the Matplotlib Basemap library. Matplotlib Basemap isn't as pretty as something like D3, (something to potentially revisit in a future post), but it's really easy to use and quickly lets us start visualizing our data.
# lets plot our attacks in the USA
from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt
%matplotlib inline
# create a new df with only locations we have coords for
dfc = df2.dropna(axis=0, subset=['latitude'])
# USA Plot
USA_map = Basemap(projection='mill', resolution = 'l', llcrnrlon=-180, llcrnrlat=2, urcrnrlon=-62, urcrnrlat=60)
plt.figure(figsize=(18,18))
x1,y1 = USA_map(dfc['longitude'].tolist(), dfc['latitude'].tolist())
USA_map.plot(x1,y1, 'bo', markersize=7)
USA_map.drawlsmask(lakes=False)
USA_map.drawcoastlines(color='gray')
USA_map.fillcontinents(color = 'lightgrey')
USA_map.drawmapboundary()
USA_map.drawstates()
USA_map.bluemarble(alpha=.2)
USA_map.drawcountries()
Nice!!! Shark attacks seem to be pretty coastal....
Looking at our map above a bit more closely, we might notice some strange behavior....LANDSHARKS!
As comical as the idea of landsharks may be, they actually do exist!
Well kinda.
A specific species of shark, the Bull Shark is one of the only species of sharks that can swim in brackish water. They're commonly found in bays and canals around Sydney, Australia, have been found over 2,500 miles inland up the Amazon River, been held responsible for nibbles in the Ganges River in India, and even been seen in Alton, Illinois, 1,750 miles from the coast.
Frankly, its terrifying.
But it also complicates our analyses as we can no longer assume with complete confidence that all shark attacks occur on the coast.
Despite this possibility of a "landshark shark attack" occurring, we can reasonably assume that the vast majority of attacks occur in the ocean. To start detecting our geographic outliers, we'll use a clustering algorithm that builds on this assumption.
Note: There are tons of different ways to do this. Outlier detection is a science in itself and I encourage readers to test out other methods and models to better capture landshark outliers!
For outlier detection, we're going to use an algorithm called KDTree. Put simply, KDTree (k-dimensional tree), builds a k-dimensional binary search tree, that we can use to determine neighbors and distances between neighbors. More detail on the algorithm can be found here.
The plan is to use this algorithm to:
The assumption is that the landshark points are furthest from all other points, so the sum of their nearest neighbors should be largest.
from sklearn import neighbors
coords = np.column_stack((dfc.latitude,dfc.longitude))
# create our tree
tree = neighbors.KDTree(coords,leaf_size=2)
dist, ind = tree.query(coords[7], k=6) # look at row 7
print(ind[0]) # indices of 5 closest neighbors
print(sum(dist[0][1:])) # distances to 5 closest neighbors, excluding the point itself
[ 7 379 251 152 420]
0.130272705615
# to add the sum of distances, we'll loop through our list
sums = []
for i in range(len(coords)):
dist, ind = tree.query(coords[i], k=6)
sums.append(sum(dist[0][1:]))
# add the sums back to our dataframe
dfc['dist_sums'] = pd.Series(sums, index=dfc.index)
#sort them so we can find the outlier-est outlier
outliers = dfc.sort(['dist_sums'], ascending=False)
Now that we've create our sorted outlier list, we can plot them to see how many we've nabbed.
# USA Plot
USA_map = Basemap(projection='mill', resolution = 'l', llcrnrlon=-180, llcrnrlat=2, urcrnrlon=-62, urcrnrlat=60)
plt.figure(figsize=(18,18))
# we'll plot the top 25 outliers in red and the rest in blue
x1,y1 = USA_map(outliers['longitude'][25:].tolist(), outliers['latitude'][25:].tolist())
x2,y2 = USA_map(outliers['longitude'][:25].tolist(), outliers['latitude'][:25].tolist())
USA_map.plot(x1,y1, 'bo', markersize=7, alpha=.2)
USA_map.plot(x2,y2, 'ro', markersize=7)
USA_map.drawlsmask(lakes=False)
USA_map.drawcoastlines(color='gray')
USA_map.fillcontinents(color = 'lightgrey')
USA_map.drawmapboundary()
USA_map.drawcountries()
USA_map.drawstates()
USA_map.bluemarble(alpha=.2)
Not bad. We've clearly more work to do but this definitely gets us well on our way.
We can also look specific areas in more detail simply by changing the longitude and latitude of the map corners.
CA_map = Basemap(projection='mill', resolution='l',
llcrnrlon=-128, llcrnrlat=33,
urcrnrlon=-113, urcrnrlat=44)
plt.figure(figsize=(18,10))
# coords for fatal attacks
x1,y1 = CA_map(outliers[(outliers.Area=='California')]['longitude'][:10].tolist(),outliers[(outliers.Area=='California')]['latitude'][:10].tolist())
x2,y2 = CA_map(outliers[(outliers.Area=='California')]['longitude'][10:].tolist(),outliers[(outliers.Area=='California')]['latitude'][10:].tolist())
CA_map.plot(x2,y2, 'bo', markersize=5,alpha=.5)
CA_map.plot(x1,y1, 'ro', markersize=5)
CA_map.drawlsmask(lakes=False)
CA_map.drawcoastlines(color='gray')
CA_map.fillcontinents(color = 'lightgrey')
CA_map.drawmapboundary()
CA_map.drawstates()
CA_map.bluemarble(alpha=.2)
plt.show()
Again, we'll need to improve our outlier detection algorithm a bit, or perhaps use a different model altogether, but we've definitely made some solid progress.
Sharks are fascinating, incredible animals and I would be remiss to not mention in this post how dramatically disproportionate our fear of them is. Tens of millions of people go swimming, surfing, and diving every year and on average, only 7 people are killed by sharks every year in unprovoked attacks.
Geoplotting/mapping is a huge field and there are tons of fantastic resources out there to get started plotting things.
Python: Basemap tutorials
For more advanced plotting:
Python: So You’d Like To Make a Map Using Python by Sensitive Cities
D3: Let's Make a Map,from the legendary Mike Bostock
The Source Guide to Better Mapping from Open News
For all the scripts used to make this post, visit the github repo
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.