Data Science

Machine learning & data science for beginners and experts alike.
5 - Atom

Have you ever wondered if having more trees in your neighborhood would stop the spread of graffiti? What about noise complaints? Or bee nests? Probably not. But we’re going to explore how to create a relationship between them with Alteryx, and see if our hypothesis is accurate.


You may have heard that correlation does NOT equal causation, and that is true. Just check out this site, it takes seemingly unrelated data sets and shows how easy it is to match some type of relationship between the two.




I wanted to see if I could replicate any of these wacky relationships using Alteryx predictive tools.


It’s a snowy day in Toronto, Canada, so what better to do than explore some online open data? Open data is a great resource for learning more about your own backyard. I recommend trying it yourself. In my case, I searched Toronto Open Data and found two data sets, 311 Calls, and Street Trees. 311 is a number you can call for a complaint that is non-emergency, for example, if the trash hasn’t been picked up, you have a noisy neighbor, road work needs to be done, or if you spot some graffiti. In addition, Toronto tracks the location of street trees that are planted around the city. There is point location, genus, and the neighborhood they reside in.


Creator: Jose San Juan. Copyright: City of Toronto.Creator: Jose San Juan. Copyright: City of Toronto.


I saw an opportunity to run this data through the Linear Regression tool, to see if there was any relationship.


As always, I started by cleaning up my data set, I was working with tree data sets.


  • Street Trees Toronto.shp
  • Neighbourhood.shp (That will determine which neighborhood each tree falls in)
  • 311 Calls.csv


I brought in all three data sets, using a spatial match tool on Neighbourhoods and Street Trees to find out which trees belonged to which Neighbourhood, and only kept the resulting fields I needed, followed by grouping the number of records by each Neighbourhood.





Next, I explored the 311 calls to match the same Neighbourhood format in order to join the two data sets together. There is a ward number associated with each Neighbourhood, so a formula is used to eliminate it. Then it is groups per call and reason for call.






Then the two data sets were joined.




It is always a good idea to see if there’s any relationship between data points before running them through a predictive model. The Association Analysis tool explores the strength of the correlation between two variables. Low correlations (Close to 0) implies that there is almost no relationship between the two variables and high correlation (Close to 1 or -1) implies that there is a strong relationship between the two variables. Note: variables will always have a correlation of one with itself.


We are currently working with ALL 311 call types, so we are going to use the linear regression and stepwise tool, to see which service types are the best indicators in predicting the number of trees.


I created an estimation sample and a validation sample to test against later. 70/30 split will work here.




There were some relationships that made NO sense, so those were eliminated from the analysis. What we were left with was some good indicators.




Based on my score card, trees (X) seem like a pretty decent way to combat graffiti--how about that!




In order to better understand how to determine if something is a result of correlation or causation, we have to consider what other factors may be at play in order to create our result. If we break down the definitions of both correlation and causation; a correlation is the relationship between the two variables, while causation is simple cause and effect, an action was done, hence, the second action was done. The key word that differentiates the two is relationship. Although there is no “one size fits all” definition, generally when we stumble upon a relationship, we would consider that to be a correlation, further scientific research can upgrade that relationship to an actual cause.


To solidify this relationship scientifically, we need to control our environment and establish variables. We need to establish an independent variable and a dependent variable.  The independent variable is the piece of data that we control, we can change it, or leave it as it. The dependent variable is the variable that is changed by the independent variable, or some other factor.


In our analysis we took a look at the number of trees in a neighborhood and the amount of complaints reported about graffiti. We already have a skew in our data. Not everyone calls to report graffiti, so this is not a complete count of the graffiti in the city. In addition, some of these neighborhoods are more residential, which is another factor at play. When taking a look at any data set, we need to think deeper about factors that surround the data, instead of just taking it at face value. Why was this particular data set recorded? Where did it come from? What is filtered out? All in all, it doesn’t look like we have a lot of control over our data.


There are many ways to draw conclusions here. A neighbourhood with more trees would have more complaints about trees blocking the road. But we can’t take this relationship on blind faith alone. Why do the number of trees in a neighbourhood seem to be correlated with graffiti but not some of the other calls? Well let’s take a moment to consider a street art canvas, an empty concrete wall. Neighbourhoods with more buildings tend to have less space for trees and less things in the way for graffiti to take place, so more buildings and less trees may be the real culprit here. Whenever we are creating a relationship between fields, we must consider these outside factors, or else it will be hard to replicate this in another environment. Some of these will not have relationships, and that’s also a result on its own.


Let’s take a look at some of the other Service Request Types across Toronto. For example, animals. When we set up this model in the exact same way as previously, using animals this time, we see the prediction of animals across neighborhoods is not really a solid indicator of trees.




No relationship (Where X is the prediction):




Again, we need to consider why calls about animals are not a good indicator. As we can see there has only been 1 call in Etobicoke North and 1 in Spadina-Fort York, yet the amount of trees differs by 19k! This further shows that we don’t only need to know where the data comes from, but also that there is enough, and what underlying factors can come together to build a relationship.


Try for yourself and see what great relationship you can find!