Data Science

meljaafari · ‎08-01-2019

First of all, what is DBSCAN?

DBSCAN (density-based spatial clustering of applications with noise) is a clustering method used in machine learning to group points that are closely packed together. It helps in defining areas of high-density points from areas of low-density points.

The algorithm takes two important variables into account: the distance (ε) between points and the minimum points (MinPts) that need to be within the given distance for a point to be considered to form a cluster.

The algorithm iterates on each of the points in the dataset. Starting at a randomly selected point, it will find all the other points in the dataset that are within the provided distance (ε) of that point. This is the point's "neighborhood". The algorithm will then calculate the "neighbors" for each of the points in that neighborhood. If any points in the neighborhood have at least MinPts points in their respective neighborhood, it will then calculate the neighborhoods for each of their neighbors that meet the MinPts criteria, repeating until the cluster cannot be extended anymore. If a point doesn't have enough neighbors to be a part of a dense cluster, it forms part of the low-density area (I call them the unclustered).

K-means clustering is the most popular clustering method, but there is no single algorithm that fits all use cases (there is no free lunch in data science!) This means there are situations where the k-means method just doesn't answer the question we are asking.

Looking at the example below, k-means does a good job separating points that are clearly far from each other, but as dense zones of points get close to each other, it's failing to interpret what our human intuition tells us. As it's calculating distances between points and centroids, it's failing to group points that are close to each other as seen by the naked eye. DBSCAN however, does a great interpretation of how we see things with our eyes. Not only does it do a good job determining areas of high-density observations, but it also builds clusters of varying shapes and sizes.

K-means Vs DBSCAN
Other strengths of DBSCAN is that it doesn't require you to indicate the number of clusters beforehand, and is strong at finding the outliers that don't belong to any cluster.

We are going to see how to use DBSCAN directly from Alteryx and how you can leverage it on your own data.

Density-Based Clustering with Alteryx

The dataset we are going to look at is from the UK police data (publicly available here).

Looking at all incidents reported in London in 2018 and 2019, are we able to find hotspots?

Some clustering algorithms are available with just a drag and drop with Alteryx. Unfortunately, DBSCAN is not among them. Luckily, Alteryx doesn't limit its users to the out-of-the-box tools, and provides them with the capability to extend its capability with R & Python! I'm going to be using Python for my example, but using R is equally valid.

Why would someone use Alteryx? Why not just run the whole thing on Python?

Simplicity, ease of configuring, and the possibility to easily share. Additionally, I have used some Alteryx Spatial & Preparation tools that complete the Python code and make it even stronger. I can easily format my data, I can easily input new data, and I don't need to code the whole polygon creation part.

Let's look at what it looks like:

DBSCAN Flow - pretty simple, right?

- It starts with reading the data... or any ulterior Alteryx workflow in which you would connect, manipulate, clean, cleanse and transform your data.

- Next, in the Python script, the DBSCAN algorithm goes through all our latitude/longitude combinations and define whether it is a hotspot or not (identify the unclustered).

- Then, the Alteryx Spatial tools build polygons based on the clusters, making them more generalized and smoother.

- Finally, we combine clusters that are touching together and simplify the whole dataset into one spatial object representing all the hotspots.

DBSCAN Hotspots.png

Is the Python Script really that simple?

The Python code takes the latitude and longitude from an input dataset and determines and returns an assigned cluster for each point defining whether it is a hotspot or not. Feel free to play around!

# List all non-standard packages to be imported by your 
# script here (only missing packages will be installed)
from ayx import Package
from ayx import Alteryx
import numpy as np
from sklearn.cluster import DBSCAN

# Read the file and define latitude and longitude as geographical data
Input_data=Alteryx.read("#1")
LOC=np.column_stack([np.radians(Input_data['Latitude']),np.radians(Input_data['Longitude'])])

# Run the DBSCAN model, (epsilon ~ 600 meters, and min_sample is of 100 incidents), we chose the haversine metric to ensure the radial distance is calculated
Cluster = DBSCAN(eps=0.000015, min_samples=100, metric='haversine').fit(LOC)

# If cluster=-1, then it's an uncluster data point, and therefore part of a safe zone
Hotspot=np.where(Cluster.labels_ == np.array(-1), 'No', 'Yes')

# Creating the output file and writing to Anchor 1
Input_data['Cluster']=Cluster.labels_
Input_data['Hostpot']=Hotspot
Alteryx.write(Input_data,1)

Now I can combine the result of my python workflow to assess whether the next place I might live in falls in a crime hotspot or not. I can filter the data to Bicycle Theft only, and use the workflow to determine which areas I could safely park my bike in.

And that's the strength of Alteryx combined to Python!

I don't need to go back to the code again anymore, I can just play around with all the elements around t and ensure it answers questions that other peers are having. My users don't need to know Python at all. I can publish my workflow and share it with them; as they connect it to their own files or update the input. The workflow will run smoothly and return an answer to the question they are asking.

Alteryx and Python are meant to work together, and DBSCAN is just an example of how that integration smoothly works.

Attached is the full workflow; we'd love to hear about how else you use such models in your work.

Data Science

Partitioning Spatial Data with DBSCAN