Maddie: [00:00:11] This is Alter Everything, a podcast about data science and analytics culture. I'm Maddie Johannsen, your producer for this episode. Today we will be “dipping a toe” into the pool of causal inference. Alteryx Sr. Community Content Engineer, Sydney Firmin, will be guiding me through her conversations that she had with two experts in causality, Amit Sharma and Victor Veitch. There's a lot to unpack here, but don't panic, we’ll walk you through it.
Sydney: [00:00:45] You may have heard the phrase “correlation does not imply causation” in an introductory statistics or fundamentals of science class. It's a saying frequently repeated by statisticians and it's an important point. If we were to take correlations at face value, we would end up believing that a rooster crowing causes the sun to rise or that ice cream causes sunburns because the two seem to happen together frequently. A lot of the time, correlations can be nonsense when it comes to thinking about cause and effect relationships. However, because statistics is a field often tasked with supporting scientific inquiry, statisticians are often tasked with finding answers to causal questions like does adding a fertilizer to crops cause them to grow better or does smoking cigarettes cause lung cancer?
Maddie: [00:01:35] So Sydney, if correlation does not imply causation, what does? How can we research
these causal questions to get meaningful answers?
Sydney: [00:01:44] Historically, randomized controlled trials are how statisticians have approached answering causal questions. A randomized controlled trial is where you assign a treatment or intervention to some individuals or areas in your crop field and leave the other individual areas untreated as controls. The hope is that by randomly splitting up your treatments and controls or areas where you don't do anything you can kind of trick nature by leveraging the randomness to cancel out any noise or random fluctuations in your data and find a real causal relationship. In situations where you can't set up a randomized controlled trial, the answer is causal inference. Recently causal inference has been worked on in both statistics and computer science. Last year in 2018, Judea Pearl and science journalist Dana McKenzie published a book called “The Book of Why: The New Science of Cause and Effect.” If you haven't heard of Judea Pearl’s work before, he is a Turing Award winner, which is pretty much the Nobel Prize of computer science. He's pretty much a Pioneer in AI research and his recent book is an entry point for his work on causal inference.
Amit: [00:03:00] So yes, I think I am very familiar with Judea Pearl's work. In fact, his work I believe is the foundation of a lot of the recent advances that we are seeing in computer science because he has been working on the intersection of causality and artificial intelligence for I think about 20 to 25 years before it was even sort of thought about as a problem to be considered in machine learning and AI. So, in fact, when I started asking these questions about recommendation systems, very focused questions, I would always go back to many of his papers which were written and unfortunately forgotten in the 1990s and find insights that applied to these practical problems.
Sydney: [00:03:54] That was Dr. Amit Sharma a researcher at Microsoft India whose work is centered around causal inference, data science and societal impact.
Amit: [00:04:04] So yeah, at least in my research, his work has been very crucial in trying to make sense of causality and especially in terms of the systems that we are building today.
Maddie: [00:04:15] Sydney, let's take a step back. Can you explain what causal inference is?
Sydney: [00:04:21] Yeah causal inference is the study of understanding when one thing causes another thing. If it feels really simple or intuitive to think about causal inference it's because it's kind of how we as humans are predisposed to think. We're also able to more or less automatically filter out correlations from causation.
Maddie: [00:04:45] Gotcha. So, Dr. Judea Pearl, the Turing Award winner that you and Amit referenced, again, is this pioneer in AI research and expert in causality, correct?
Sydney: [00:04:56] Yeah. So, Dr. Pearl lays out this metaphorical ladder of causation where each rung is a different way to see the world or ask and answer questions. Throughout his book, he uses the ladder to explain and contextualize causal inference There are three rungs on the ladder of causation. There’s seeing which equates to association or correlation. There's doing which he calls the intervention step and then there's imagining which is where counterfactuals come into play. He describes each step on the ladder with different examples. So the first step is like an owl or AI as we know it today where you can find correlations or associations just through observation and just observation can get pretty effective outcomes. Like an owl can be a master hunter because it knows what time the stuff it likes to eat comes out. So mice come out at night. The owl doesn't know why the mice are out at that time. It just knows that's when they're there and that's enough to catch the prey. AI and machine learning today work that way as well, they identify patterns in a provided data set to deduce relationships from those patterns.
Maddie: [00:06:19] One thing about the ladder - so you talk about counterfactuals later on, which is the third rung...
Sydney: [00:06:30] Yeah, yeah, that's the the highest branch of causal inference is when you can imagine things if they had gone another way. So like if I had gone to college for computer science instead of environmental studies, maybe I'd be a software engineer now as opposed to what actually happened. So that's all a counterfactual is it's like we know what happened in fact, but what if it didn't happen.
Maddie: [00:06:58] Yeah, I feel like I "what if" all the time.
Sydney: [00:07:00] Totally. I think it's totally how people think and it's how we learn and understand causation. It's because we can imagine the what ifs and kind of have an understanding on what those what ifs would have played out as that we're able to understand cause.
Maddie: [00:07:21] So historically, how does this fit in with statistics or data science? How did this come to be a part of the sciences?
Sydney: [00:07:35] Yeah. It's been a part of philosophy for a really long time. Hume, who if people read my writing, they know he's probably my favorite western philosopher of all time because he's just questioning everything all the time, but he makes a point talking about causality. As a condition, if the first object had not been the second never existed. And so that's like a clear like causal and it's spoken through the language of counterfactuals - like if this hadn't happened this wouldn't be here. And so like it's been a part of how we think as humans in philosophy for a long time, but it hasn't really been a part of mathematics or statistics for a long time like historically. In statistics there are statisticians that deal with causality and I think that's becoming more and more prevalent.
Victor: [00:08:38] So people thought about this classically a long time ago.
Sydney: [00:08:42] That was Dr. Victor Veitch. He's a postdoc at Columbia who focuses on researching machine learning and causal inference. He was also featured in episode 43.
Victor: [00:08:53] In some sense, you know causality or causal inference is just like really a more fundamental discipline than statistics or data science. I mean like looking at associations in the world and then inferring causes is sort of like the the natural mode of human cognition. And you know the fact that it showed up like this historical quirk that we worked with that we sort of figured out how to mathematize this only like, you know later on in the the development of science, it shouldn't be mistaken for like a fundamental fact about it - its ordering or its importance. By which I just mean like, you know causality is maybe the foundation of the whole thing. And you know, it's true we're only scratching the surface now in terms of how to study it. But my guess is that we may eventually see things like data science has outcroppings of causality rather than vice versa.
Sydney: [00:09:52] The way Victor described it in his interview is that it's kind of a historical quirk that we don't have an established mathematical language or scientific tool or scientific process for approaching causal relationships. Something Dr. Pearl kind of offhandedly mentions in his book is that because of how natural it is for us to think about cause and effect just in our day-to-day lives we might not have needed a mathematical or scientific approach for handling causality. He says scientific tools come from scientific need and because we didn't necessarily need a language to answer simple causal questions one wasn't developed. Later on when the need became more apparent, kind of early statistics time, there were some individuals within the field that kind of banished causality and that's why it wasn't really handled by statistics until the 80s, but Dr. Pearls work - a lot of it is building models like around what you're trying to get out of your data. And so it's really more of a qualitative thought process. It's like sitting down and thinking like, well, what what do I know about how this functions in the real world and how can I use what I know to isolate cause and effect? And then using what I know in this like written out diagram of cause and effect, how can I isolate it in my data?
Maddie: [00:11:27] Yeah, I mean like you're saying it's just kind of like a natural thing and with mathematics and the history of it - it would make sense that it would have fit in at some point to you know - I just think of math as being so logical and with the way that we think about, you know, like the what ifs it almost feels kind of illogical to me, you know, it's illogical for me to say - Well if I had done this if I had majored in computer science like your example earlier than you know, maybe you'd be a software engineer now, it's illogical to think that way because -
Sydney: [00:12:05] It didn't happen, you can't go back and change it.
Maddie: [00:12:08] Yeah, and maybe I'm thinking of it as like a past thing and instead of you know, like trying to predict things, which maybe that's what the third rung is trying to do.
Sydney: [00:12:19] It's cool because in a way it's less about using patterns to predict things and more about understanding and learning from what's happened in the past so you might be able to understand what will happen in the future if a different decision is made or a different treatment is applied. Causal inferences about being able to work with your data in a way where you can draw causal conclusions without the setup of a randomized controlled trial something that classical statistics isn't comfortable with. At least that had always been my perception.
Victor: [00:12:55] That was also my impression until recently but it doesn't really seem to be true. So like particularly, you know the work of Don Rubin starting in the 1980s, which was the potential outcomes approach to causal inference has been extraordinarily influential. It has huge ramifications in econometrics and epidemiology and you know it's a thing which is taught regularly at statistics departments. So I think the idea that causal inference per se has been shunned by classical statistics is also not correct. We already had the causal revolution in statistics and it happened in the 1980s. Now, it's true that Pearl's work was slower to gain attention or slower to gain acceptance.
Sydney: [00:13:47] If anyone is interested in learning more about the similarities and differences between the Don Rubin framework for causal inference versus Judea Pearl, check out our show notes. There are a lot of great resources linked.
Victor: [00:14:03] I think of causal inference like the interesting problem of causal inference is of course like you just see a bunch of associations in the world...
Sydney: [00:14:13] In statistics an association is a correlation and it's pretty much just seeing that two things seem to happen in a relationship with each other - that if one thing happens the other thing seems to happen too. Like the sun sets the critters come out. The sun rises the rooster crows. Like these things are associated with one another.
Victor: [00:14:35] ...you just see a bunch of associations in the world and you would like to somehow take those associations and deduce a causal relationship. And of course the the difficult thing about this problem is that it's impossible because as everybody learns very early on in life association is not causation.
Sydney: [00:15:01] So for example, the sun goes down and the critters come out. That's probably causation because the sunset for the critters means that it's a good time to come out of your hidey-hole and find food. But we know that because we know something about the ecology of those critters they come out at night maybe because it's safer for them from predators. With the rooster crowing and then the sun rising we know that's an association or correlation - not a causal relationship. At least we know that the rooster crowing does not cause the sun to rise. I think spurious correlations are the major concern behind the phrase correlation does not imply causation. It's dangerous to see that two things are related and then assume that that means one thing causes the other. It could be that they're both caused by a common thing or that they just happen to match up.
Victor: [00:16:03] Causality as a field says “well if you're just willing to assume a little bit more...” so put in some information that the data doesn't give you but which reflects your understanding of how the world works or what things might be going on, then indeed, you can pass from association all the way into causation.
Okay, and the various tools for doing this like how to articulate the assumptions that you need to make your inferences causal and having articulated those assumptions, what things you actually compute - this is like what I think of as the province of causal inference.
Sydney: [00:16:43] That's an important point for performing causal inference. You'll never be able to answer a causal question with data alone. You need to be able to incorporate your own knowledge about the data as it exists in the world to be able to drive anything causal from it. This is maybe the problem in data science where more often efforts are focused on the data than the science part.
Amit: [00:17:07] I think where we differ and again, this is talking about data science as in the broad concept not sort of particular studies, is that we think rarely about where the data comes from. Which actually is also a major part of science if you think about it, right? So a physicist who collects some data, they know exactly what experiment they ran and what control they had in their experiment right? Similarly for biology people spend painstaking amounts of time thinking about their experimental protocol and then doing that experiment to collect the data. I think that's something broadly, I wouldn't say missing, but less thought of when we think of data science especially with big data sets. The traditional feeling or the first start which I also did when I was working on recommendation systems is to collect a data set or obtain a data set from somewhere - let's say these are preferences of users on some website - and then try to make inferences on that data set assuming that you know what happened and how the data was collected, right? And this I think can go horribly wrong in some very simple scenarios.
So talking of recommendation systems, I can give a simple example. Let's say if you wanted to build a recommendation system that suggests interesting stuff that people would want to check out. That's the sort of basic definition of a recommendation system. If you would only look at logs, you would find what are called as these “Harry Potter problem” where a person who may have read Harry Potter 1 and 2, there's a high likelihood in your data set that they would also have read Harry Potter 3, right? So a relevant algorithm or an algorithm that's trained just on the data set, would immediately predict that Harry Potter 3, Harry Potter 4, Harry Potter 5, all are great recommendations, right? But now if you step back and if you think about how this data was generated you would realize that this data was generated based on the preferences of this user on this particular product.
Maddie: [00:19:34] So we have machine learning algorithms fitting obvious patterns in the data. Is this really a problem?
Sydney: [00:19:42] It depends on the goal of the recommendation algorithm.
Amit: [00:19:46] Maybe what we have to think about is since our goal is to actually show recommendations that are different, we might want to now think about items that are slightly different from this item, but are also relevant for this user. Right? And I think this thing becomes even harder to solve when you already have an existing system in place.
Sydney: [00:20:10] So this is kind of rooted in Amit’s perspective of treating deployed machine learning models as interventions. As something that has the potential to change the behavior of the individual interacting with it. Instead of just showing them something that they were already going to buy, you can show them something they might like, but were less likely to have already heard of.
Amit: [00:20:32] So now it's not just when you see a user's preferences, it's not just what the user liked but it also is a combination of what was shown to them by a previous recommendation system that the website has. And I think what I have seen in my research is that these kinds of questions change the answer that you get from the same data set and this is something I think we are still coming to terms with in data science on how to disentangle these two parts of the “science” that we are doing.
So there's one part of the scientific process where given a data set, we are computing statistical estimates of it, but there's another part which is about how the data is generated which actually gives an interpretation to those statistical estimates. As for the sake of the example the interpretation of relevance could be either the user likes this, or the user had no choice but this was the only thing that was shown to them using a recommendation system and so the user clicked on it. And I think this is the thing that causality helps us with and can bring us closer to sort of filling this loop of a scientific process to follow.
So just in the case of for example, this Harry Potter example that I just gave. If you are a book seller, and you see that someone has read Harry Potter 1 and Harry Potter 2, you're likely to show them Harry Potter 3 because you would find that other people like this user who have read the first two books also read the third book, right? And there's nothing wrong in it and it turns out that even if you now show this recommendation the user they most likely will click on it and will buy that book and so when you look at the data and look at it from the aspect of deciding whether I made a good recommendation, your answer will be, "yes, sure." You did, right, because you made a recommendation, that user clicked and you had a person purchasing a book right from your store.
Maddie: [00:22:55] So that's a good thing, right? The person clicked on the recommendation and bought the book.
Isn't that a good recommendation system?
Amit: [00:23:03] The problem that comes is that what we are missing out is that the user would have purchased this book anyways. If you happen to be a fan of the Harry Potter series with or without recommendation, you're more likely to buy that book. So if you only look at what you can now call as this causal impact of the recommendation system, that happens to be about half of that metric. And I don't think that one metric is more right than the other, I think it's just about what you really want to measure.
Sydney: [00:23:44] What Amit is emphasizing with this story is how intrinsically intertwined the data generation process and the meaning of a dataset are. It's impossible to really fully understand the meaning of a data set in a vacuum. This can lead to a line of inquiry on Kaggle datasets, or other shared data repositories, which are pretty popular commodities in the data science community when you're trying to learn new data science processes and techniques or just get experience taking on an independent project. The hardest part of a project is often getting an interesting data set to work with so having these shared resources makes that easier, but it might also cause people to miss an important component of what doing data science really is.
Amit: [00:24:28] There are the standard data sets that have been collected. So one benefit of that is you can now track and benchmark progress. So for example with ImageNet, we can literally see how well we are doing year over year, but the other problem that it generates is that we are not thinking so much as researchers about the generation process because that's not what is evaluated in the paper.
Sydney: [00:24:53] Context is everything. Not just for data, but also for how to approach a problem.
Amit: [00:24:59] So I think what's happening today is that we are using models that we built for - I would say not so critical tasks like showing people ads or showing people recommendations for books or music - we are using the same technologies and then suddenly bringing them on to domains like healthcare or finance or sometimes even criminal justice.
And one of the things that we're missing - and this came up in our conversation earlier as well - is what are the effects of these algorithms when they are deployed in these settings and if you think about it if if you get a recommendation wrong for an information product, a book or some kind of a job, I think there is still a second chance because the user may come again, the user may refresh the page and they may find a better recommendation. But I think the challenge is that now when you are using the same kinds of algorithms maybe with some constraints on things like giving people a loan or not, or deciding what kinds of policies we want to make in healthcare, things can quickly get very interesting and actually very important to understand on what are the effects of these algorithms.
Maddie: [00:26:30] So Sydney, what do these recommendations systems have to do with causality?
Sydney: [00:26:35] So when you think about what causes a book purchase and then how a recommendation algorithm plays into that event, this is where recommendation systems need to start considering causal inference or causality.
If the goal of your algorithm is to cause someone to buy a book, you don't want to recommend them something they were going to buy anyway, you want to suggest something that they are likely to be interested in but wouldn't have considered without the algorithm's intervention.
So this gets back to philosopher David Hume's definition of causality. It's a quote from “An Enquiry Concerning Human Understanding” and it's, "Where the first object had not been the second never had existed." So if the recommendation algorithm had never existed would the customer have bought the book? And in the case of Harry Potter - definitely. You're not going to stop a book 3, that's the best one! You're on a high. But you might not buy another fantasy novel like The Hunger Games or something else, and so that's a more effective recommendation for causing a purchase that wouldn't have come in otherwise.
Maddie: [00:27:47] With all this in mind it feels like causality must be an important consideration for data science projects.
Victor: [00:27:53] So the answer to how I think causality fits into data science today is well, I think you know the tools of data science certainly are obviously immediately applicable to this problem of, well, once you've decided in the infinite data world that you could have gone and answered your causal question, then the tools of data science will tell you in the finite data world how do you do this well? So I mean that's certainly the most obvious answer and I think that's also the area with the most room for growth, you know, at least in industry or in application right? Because like how to do that is already very well understood. I mean, you can right now go pick up an introductory causality book, pick up an introductory machine learning book, smash the two things together and start answering causal questions from observational data. Yeah, but then above that I mean there is a lot of interesting work right now sort of along the flavor of, well, if what I care about really is answering causal questions and not just making predictions then how should I modify the existing data science toolkit in order to do that better?
Maddie: [00:29:15] Sydney, what is currently in the data science toolkit that Victor just referred to and how can it be modified in order for Victor to answer causal questions instead of just making predictions?
Sydney: [00:29:26] Python and R packages are how statistical and machine learning techniques make it into common usage in data science, and the same is true for causal inference. Amit has actually worked on a Python package called DoWhy which implements causal methods in Python.
Amit: [00:29:41] This is a labor of love that Emre Kiciman (my collaborator) and I have done, because we realized that we were working on causal inference problems in the domain of online systems, social networks, the effects of algorithms and what we were finding was that, one it's very hard to translate - especially as computer scientists - the statistical literature and the econometrics literature that talks about causal inference, somehow they have a different language. The second was that it's very hard to then also implement those algorithms and more importantly implement them in a way that you can also test the assumptions that your model is making. And so after having a frustration ourselves that we had to do this over and over again and build our custom estimators, we decided that it would be nice if we can think of creating a library for causal inference that helps anyone who has a causal question in mind to actually go through the steps of causal analysis.
It's available open source. So one of our goals is that more and more people find it, they use it and hopefully also contribute to it so that as a community we can start building tools for causal inference for everyone to use.
Maddie: [00:31:21] Thanks for tuning in to Alter Everything. Continue the fun and share your thoughts on Twitter using the hashtag #AlterEverythingPodcast, or leave us a review on your favorite podcast app. You can also subscribe on the Alteryx Community at
community.alteryx.com/podcast. While you're there, fill out our
audience engagement survey. The first 100 people to leave their feedback will be entered to win one of five pairs of Bluetooth headphones. Good luck, and catch ya next time.