Data Science

ColinR · ‎09-14-2015

It's been somewhat of a sharky summer these past 3 months: there were a smattering of attacks up and down the east coast of the US, professional surfer Mick Fanning, had a close call at a competition in South Africa, and several beaches were closed in California after a Great White took a bite out of a surfboard! So now with the end of summer officially here (at least in the northern hemisphere), we thought it would be interesting to dig into some shark attack data. In this post, we'll look through the Global Shark Attack File, checkout some of the characteristics of shark attacks and then dive in to some geo-plotting with Matplotlib Basemap.

The Github repo with relevant files for this post is here

Watch out for Landsharks!

The Data

Let's get a quick feel for our data. After downloading the Excel file from the ISAF, available here we can read in the file and start looking around. Some important things to note:

There are 5759 events in the dataset
Attacks date back to 1845
Attack Types and counts are:
- Unprovoked - 4239
- Provoked - 521
- Invalid - 489
- Boating - 291
- Sea Disaster - 219
1527 Fatal Attacks
The database is actively maintained by researchers at the ISAF

import pandas as pd
import numpy as np
import re

# read in xls as a dataframe
df = pd.read_excel('GSAF5.xls')

# clean our columns
df['Activity'] = df['Activity'].str.lower()
df['Activity'] = df['Activity'].str.replace('-','')
df['Species '] = df['Species '].str.lower()
df['Country'] = df['Country'].str.upper()

df.rename(columns={'Fatal (Y/N)':'Fatal','Species ':'Species','Sex ':'Sex'}, inplace=True)
df.drop(['Case Number.1', 'original order', 'Unnamed: 21', 'Unnamed: 22','pdf', 'href formula', 'href'],inplace=True, axis=1)

df.head()

	Case Number	Date	Type	Country	Area	Location	Activity	Name	Sex	Age	Injury	Fatal	Time	Species	Investigator or Source
0	ND.0001	1845-1853	Unprovoked	CEYLON (SRI LANKA)	Eastern Province	Below the English fort, Trincomalee	swimming	male	M	15	FATAL. "Shark bit him in half...	Y	NaN	NaN	S.W. Baker
1	ND.0002	1883-1889	Unprovoked	PANAMA	NaN	Panama Bay 8ºN, 79ºW	NaN	Jules Patterson	M	NaN	FATAL	Y	NaN	NaN	The Sun, 10/20/1938
2	ND.0003	1900-1905	Unprovoked	USA	North Carolina	Ocracoke Inlet	swimming	Coast Guard personnel	M	NaN	FATAL	Y	NaN	NaN	F. Schwartz, p.23; C. Creswell, GSAF
3	ND.0004	Before 1903	Unprovoked	AUSTRALIA	Western Australia	NaN	pearl diving	Ahmun	M	NaN	FATAL	Y	NaN	NaN	H. Taunton; N. Bartlett, pp. 233-234
4	ND.0005	Before 1903	Unprovoked	AUSTRALIA	Western Australia	Roebuck Bay	diving	male	M	NaN	FATAL	Y	NaN	NaN	H. Taunton; N. Bartlett, p. 234

Before we start getting into our analyses, we'll need to do a bit of cleaning...

# clean our Fatal column up
def clean_Fatal(x😞
    if x == 'Y':
        return True
    elif x == 'UNKNOWN':
        return ''
    else:
        return False

df.Fatal = df.Fatal.map(clean_Fatal)

# these functions are included in the github repo
df['year'] = df['Date'].map(year)
df['month'] = df['Date'].map(month)
df['day'] = df['Date'].map(day)

Attacks by Activity

The activity column in this dataset is both useful but difficult. Below is a smattering of attack descriptions:

    df.Activity.value_counts()

    surfing              854
    swimming             779
    fishing              383
    spearfishing         303
    bathing              152
    wading               134
    diving               120
    standing              96
    snorkeling            74
    scuba diving          74
    body boarding         54
    body surfing          47
    swimming              47
    boogie boarding       42
    treading water        32
    pearl diving          32
    ....
    spearfishing           7
    kayak fishing          7
    kite surfing           6
    fishing for mackerel   6
    spearfishing on scuba  6
    murder                 6

Tragically, six people have been killed by sharks by being thrown or forced overboard:

df[(df.Activity=='murder')][['Date','Country','Activity','Injury']]

	Date	Country	Activity	Injury
146	Reported 1776	GUINEA	murder	FATAL
546	Reported 20-Oct-1893	INDIA	murder	FATAL
3254	17-Mar-1984	SOMALIA	murder	Forced at gunpoint to jump overboard. Presume...
4720	Oct-2006	GULF OF ADEN	murder	FATAL, beaten & thrown overboard by smugglers...
4752	22-Mar-2007	YEMEN	murder	FATAL, beaten & thrown overboard by smugglers...
4780	Jul-2007	SENEGAL	murder	NaN

While the granularity of the Activity column is good to have, it makes it fairly difficult to group attacks by type. For example, let's say we decided that "diving" was a distinct activity type. Below we see that there are 194 different "Activities" that contain the word diving! From here, we'd need to decide whether the various types of diving are different, i.e. is "diving for Bêche-de-mer" vs "diving for trochus" different, or somehow related to the frequency in which sharks attack divers, and should therefore be considered a separate "activity".

Trochus vs. some nice, dried Bêche-de-mer (or Sea Cucumbers for our non-French readers)

    len(df[df['Activity'].str.contains('diving')==True]['Activity'].value_counts())

    194

    diving                                                                 120
    scuba diving                                                            74
    pearl diving                                                            32
    free diving                                                             26
    freediving                                                              12
    ...
    diving for trochus                                                       9
    free diving / modeling                                                   1
    pearl diving from lugger whyalla                                         1
    diving from the lugger san, operated by the protector of the aborigines  1
    diving for bechedemer                                                    1

To better group the more common activity types, I've created a function to handle for specific cases: its messy and long, so I've included it in the repo.

But now that we have usable activity types, we can make some plots!

# create a list of top 10 Activities where the attack type is 'Unprovoked'
top_activities = df[df.Type=='Unprovoked'].Activity.value_counts().index.tolist()[:10]
df_top_activities = df[df.Activity.isin(top_activities) & (df.year >1950) & (df.Type=='Unprovoked')].dropna(axis=0,subset=['year'])
df_top_activities.groupby(['Activity','year']).count().reset_index()

# because we have years without attacks, we need to fill them with 0's
years = range(1950, 2016)
activities = df_top_activities.Activity.unique()

import itertools

# http://stackoverflow.com/questions/12130883/r-expand-grid-function-in-python
def expandgrid(*itrs😞
   product = list(itertools.product(*itrs))
   return {'Var{}'.format(i+1):[x[i] for x in product] for i in range(len(itrs))}

all_combos = expandgrid(activities, years)
all_combos = pd.DataFrame(all_combos)
all_combos.columns = ["Activity", "year"]
all_years = pd.merge(all_combos, df_top_activities, on=["year", "Activity"], how='left')
all_years = all_years.groupby(['Activity','year']).count().reset_index()

from ggplot import *

ggplot(aes(x='year', y='Case Number', color='Activity'),all_years) + geom_line() + ylab("Number of Attacks") +\
    xlab("Year") + ggtitle("Shark Attacks by Activity Type")

The most significant trend we can see here is the increase in the number of attacks on surfers over the past 65 years. This is likely due to the increase in the popularity in surfing in places such as Australia, South Africa, and the USA, where sharks such as Great Whites and Bull Sharks are fairly common.

We can also filter this data to look just at fatal attacks. Based on the two charts, we see that despite rate of attacks on surfers and swimmers increasing over the past 65 years, fatal attacks have not increased at the same rate.

Where do shark attacks occur?

The ocean...duh brah!...but let's get a bit more specific. As you'll note from the table above, we have columns indicating the Country, Area and Location of the attacks, but no numerical or geo-spacial data to plot attacks with. Thankfully, the python package GeoPy has some great features for accessing API's such as the Google Maps API to pull in coordinates of points. We'll use the text descriptions of the attack location in combination with the API to get the longitude and latitude.

import time
from geopy.geocoders import GoogleV3
from geopy.geocoders import Nominatim
geolocator = GoogleV3()
geolocator2 = Nominatim()

As a quick example, I'll query the coordinates of the Yhat office:

geolocator.geocode('66 w 39th street new york city')

Location(66 West 39th Street, New York, NY 10018, USA, (40.7526707, -73.98517, 0.0))

Bam! Fast, super easy and accurate! Scaling this concept out, I've created a function to take the text descriptions of the attack locations, query the API for the coordinates and then write the results back to a latitude and longitude column. I've used two geocoders here in case of a timeout or some sort of error. This way, we should get a coordinates of each event.

# plotting unprovoked attacks in the USA post 1900
df2 = df[(df.Country=='USA') & (df.year>1900) & (df.Type=='Unprovoked')]

def get_location(row😞
    time.sleep(.02)
    loc = str(row.Location) + ', ' + str(row.Area) + ', ' + str(row.Country)
    for _ in range(1😞
        try:
            # try our first geocoder
            location = geolocator.geocode(loc)
            print(location)
            print(row.index)
            if not location==None:
                lat = location[1][0]
                long = location[1][1]
                return lat, long
            else:
                try:
                    # try our second geocoder if first one fails
                    location = geolocator.geocode2(loc)
                    print(location)
                    print(row.index)
                    if not location==None:
                        lat = location[1][0]
                        long = location[1][1]
                        return lat, long
                    else:
                        return None, None
                except:
                    return None, None
                    continue
        except:
            return None, None
            continue
        return row.latitude, row.longitude

# get our coordinates
df2['latitude'], df2['longitude'] = zip(*df2.apply(get_location,axis=1))

# write to a csv for future use b/c the API lookups take a longggg time
df2.to_csv('./sharks_coords.csv',index=False)

Taking a look at our results:

	Area	Location	Country	latitude	longitude
0	North Carolina	Somewhere between Hatteras and Beaufort	USA	34.718217	-76.663819
1	Florida	Gadsden Point, Tampa Bay	USA	27.822248	-82.473150
2	Florida	Palm Beach, Palm Beach County	USA	26.705621	-80.036430
3	California	Capistrano, Orange County	USA	33.723150	-117.776661
4	Florida	Mosquito Inlet (Ponce Inlet), Volusia County	USA	29.096373	-80.936998

Now for some geo-plotting

To simplify things, we're only going to look at attacks in the USA. For plotting, we'll use the Matplotlib Basemap library. Matplotlib Basemap isn't as pretty as something like D3, (something to potentially revisit in a future post), but it's really easy to use and quickly lets us start visualizing our data.

# lets plot our attacks in the USA
from mpl_toolkits.basemap import Basemap
import matplotlib.pyplot as plt
%matplotlib inline

# create a new df with only locations we have coords for 
dfc = df2.dropna(axis=0, subset=['latitude'])

# USA Plot
USA_map = Basemap(projection='mill', resolution = 'l', llcrnrlon=-180, llcrnrlat=2, urcrnrlon=-62, urcrnrlat=60)

plt.figure(figsize=(18,18))

x1,y1 = USA_map(dfc['longitude'].tolist(), dfc['latitude'].tolist())
USA_map.plot(x1,y1, 'bo', markersize=7)

USA_map.drawlsmask(lakes=False)
USA_map.drawcoastlines(color='gray')
USA_map.fillcontinents(color = 'lightgrey')
USA_map.drawmapboundary()
USA_map.drawstates()
USA_map.bluemarble(alpha=.2)
USA_map.drawcountries()

Nice!!! Shark attacks seem to be pretty coastal....

Dealing with Landsharks!

Looking at our map above a bit more closely, we might notice some strange behavior....LANDSHARKS!

Landsharks, as seen in the 1975 SNL sketch

As comical as the idea of landsharks may be, they actually do exist!

Well kinda.

A specific species of shark, the Bull Shark is one of the only species of sharks that can swim in brackish water. They're commonly found in bays and canals around Sydney, Australia, have been found over 2,500 miles inland up the Amazon River, been held responsible for nibbles in the Ganges River in India, and even been seen in Alton, Illinois, 1,750 miles from the coast.

Frankly, its terrifying.

But it also complicates our analyses as we can no longer assume with complete confidence that all shark attacks occur on the coast.

Despite this possibility of a "landshark shark attack" occurring, we can reasonably assume that the vast majority of attacks occur in the ocean. To start detecting our geographic outliers, we'll use a clustering algorithm that builds on this assumption.

Note: There are tons of different ways to do this. Outlier detection is a science in itself and I encourage readers to test out other methods and models to better capture landshark outliers!

For outlier detection, we're going to use an algorithm called KDTree. Put simply, KDTree (k-dimensional tree), builds a k-dimensional binary search tree, that we can use to determine neighbors and distances between neighbors. More detail on the algorithm can be found here.

The plan is to use this algorithm to:

Create a tree of N-nearest neighbors
Create a vector with the distance to N-nearest neighbors for each attack
Sum the distances and sort them to determine the points furthest from all others

The assumption is that the landshark points are furthest from all other points, so the sum of their nearest neighbors should be largest.

from sklearn import neighbors

coords = np.column_stack((dfc.latitude,dfc.longitude))

# create our tree
tree = neighbors.KDTree(coords,leaf_size=2)
dist, ind = tree.query(coords[7], k=6) # look at row 7    
print(ind[0])  # indices of 5 closest neighbors
print(sum(dist[0][1:]))  # distances to 5 closest neighbors, excluding the point itself

[  7 379 251 152 420]
0.130272705615

# to add the sum of distances, we'll loop through our list 
sums = []
for i in range(len(coords)):
    dist, ind = tree.query(coords[i], k=6)
    sums.append(sum(dist[0][1:]))

# add the sums back to our dataframe
dfc['dist_sums'] = pd.Series(sums, index=dfc.index)

#sort them so we can find the outlier-est outlier
outliers = dfc.sort(['dist_sums'], ascending=False)

Now that we've create our sorted outlier list, we can plot them to see how many we've nabbed.

# USA Plot
USA_map = Basemap(projection='mill', resolution = 'l', llcrnrlon=-180, llcrnrlat=2, urcrnrlon=-62, urcrnrlat=60)

plt.figure(figsize=(18,18))

# we'll plot the top 25 outliers in red and the rest in blue
x1,y1 = USA_map(outliers['longitude'][25:].tolist(), outliers['latitude'][25:].tolist())
x2,y2 = USA_map(outliers['longitude'][:25].tolist(), outliers['latitude'][:25].tolist())

USA_map.plot(x1,y1, 'bo', markersize=7, alpha=.2)
USA_map.plot(x2,y2, 'ro', markersize=7)

USA_map.drawlsmask(lakes=False)
USA_map.drawcoastlines(color='gray')
USA_map.fillcontinents(color = 'lightgrey')
USA_map.drawmapboundary()
USA_map.drawcountries()
USA_map.drawstates()
USA_map.bluemarble(alpha=.2)

Not bad. We've clearly more work to do but this definitely gets us well on our way.

We can also look specific areas in more detail simply by changing the longitude and latitude of the map corners.

CA_map = Basemap(projection='mill', resolution='l',
llcrnrlon=-128, llcrnrlat=33,
urcrnrlon=-113, urcrnrlat=44)

plt.figure(figsize=(18,10))

# coords for fatal attacks 
x1,y1 = CA_map(outliers[(outliers.Area=='California')]['longitude'][:10].tolist(),outliers[(outliers.Area=='California')]['latitude'][:10].tolist())
x2,y2 = CA_map(outliers[(outliers.Area=='California')]['longitude'][10:].tolist(),outliers[(outliers.Area=='California')]['latitude'][10:].tolist())


CA_map.plot(x2,y2, 'bo', markersize=5,alpha=.5)
CA_map.plot(x1,y1, 'ro', markersize=5)

CA_map.drawlsmask(lakes=False)
CA_map.drawcoastlines(color='gray')
CA_map.fillcontinents(color = 'lightgrey')
CA_map.drawmapboundary()
CA_map.drawstates()
CA_map.bluemarble(alpha=.2)

plt.show()

Again, we'll need to improve our outlier detection algorithm a bit, or perhaps use a different model altogether, but we've definitely made some solid progress.

Final Thoughts

Sharks are fascinating, incredible animals and I would be remiss to not mention in this post how dramatically disproportionate our fear of them is. Tens of millions of people go swimming, surfing, and diving every year and on average, only 7 people are killed by sharks every year in unprovoked attacks.

Geoplotting/mapping is a huge field and there are tons of fantastic resources out there to get started plotting things.

Python: Basemap tutorials

For more advanced plotting:

Python: So You’d Like To Make a Map Using Python by Sensitive Cities

D3: Let's Make a Map,from the legendary Mike Bostock

The Source Guide to Better Mapping from Open News

For all the scripts used to make this post, visit the github repo