In a slightly different format from our regular podcast episodes we’re featuring the insights of two statisticians, Victor Veitch and Iain Carmichael as we explore the statistical origins of data science and the divisions and approaches to modeling.
Alteryx Sr. Community Content Engineer, Sydney Firmin takes our Alter Everything Podcast producer, Maddie Johannsen through this journey.
Special thanks to Baba Brinkman for our special theme music, and rap track for this episode. You can access Baba’s full rap track for free on SoundCloud.
Also be sure to take our audience engagement survey to share your listening preferences and suggestions to make Alter Everything your dream soundtrack to pair with your analytics lifestyle!
*Bonus: the first 100 survey takers will be entered to win 1 of 5 pairs of Bluetooth headphones. Survey away!
Victor: [00:00:00] I do have to say that there is a cultural value in statistics, which is correctness. So, if you read something in a good statistics journal and the authors claim that something is true, then it's just true. Right like I mean, nobody ever publishes like wrong or misleading results, right Whereas, you know, machine learning venues right now, I'd say maybe more than half the time you try and reproduce a paper published in a top machine learning venue, it will just turn out to be like fundamentally broken.
Maddie: [00:00:43] This is Alter Everything. A podcast about data science and analytics culture. Today in a slightly different format from our regular podcast episodes, we're featuring the insights of two statistics post-docs Victor Veitch and Iain Carmichael, as we explore the statistical origins of data science and the divisions and approaches to modeling. Alteryx Sr. Community Content Engineer, Sydney Firmin will be walking me through this journey. She's well-versed in data science and stats, so stick with us. She's a great tour guide.
Iain: [00:01:14] Okay, so there's a famous article from the 60s from John Tukey called the future of data analysis. And in that article, I think that's one of the first examples of this like push back against maybe academic statistics focus on certain things sort of particularly certain areas of mathematical statistics and maybe also hypothesis testing.
Sydney: [00:01:40] That was Iain Carmichael, a recent PhD recipient in statistics now working on a National Science Foundation project at the University of Washington. The paper he's referring to was published in 1962. It's 67 pages long and it's pretty dense, but it's an important paper for understanding the history of data science and how it's related to statistics. Maddie, can you read the first line from the paper?
Maddie: [00:02:06] For a long time I have thought I was a statistician. Interested in inferences from the particular to the general. But as I have watched mathematical statistics evolve, I have had cause to wonder and to doubt.
Sydney: [00:02:19] That's John Tukey and he is a big deal statistician.
And in this paper. He pretty much comes out and makes this public confession to the field of statistics that he thought the research going on within the field was too narrow possibly useless and potentially harmful. And that the scope needed to be redirected and broadened in order for the field to stay relevant. So in a lot of ways, he pretty much prophesied the rise of something he called "data analysis," which is effectively what we call "data science" today. Something else that was interesting is that he felt that this new field would fall under the umbrella of sciences as opposed to mathematics.
Iain: [00:03:03] So really the upshot is he saying like hey statistics were you know, we do all this stuff, you know, we do things like exploratory analysis, but we really focus on you know, these three areas, but maybe we should put a little bit more focus on these seven other areas that we've been kind of ignoring or you know underappreciated. And I think that's really sort of like the data science. I guess I'll maybe we can call the movement is sort of a reaction to things where you know, people are a field sort of underappreciate something and you know data science at various fate or people who are sort of calling for data science at various points will come in and say hey, you know what, there's this stuff that's really valuable and you're under appreciating we should do more of that.
Sydney: [00:03:53] So in this paper, Tukey is effectively identifying four points or driving forces that ultimately define this new science of data analysis. Statistical theory is one, developments and computers is two handling large bodies of data is three, and a new emphasis on quantification in a wider variety of disciplines is four.
And thinking about that this paper came out half a century ago that's pretty on point for what data science is now. It's a lot about processing large datasets. Quantification has become important in industry and a lot of kind of diverse fields we're seeing more disciplines like computational biology or computational geography kind of rising and Academia.
And so, papers along this line have continued to be written and published in statistical journals for the past 50 some years, but they're definitely the minority. And Tukey's paper is definitely the one kicking it off. And what's funny about these warnings and calls for action to move statistics away from traditional theory hypothesis test stuff like that to more applied data analytics is that they were outliers.
For the most part it felt like in the broad tent of statistics - or at least the public facing side of academic statistics - was pretty unresponsive to these calls to action for a long time and more or less stayed on the course it had been.
Iain: [00:05:34] People have been using data and quantitative methods for probably hundreds or maybe even thousands of years. The discipline of statistics as we know it, I think probably developed in the early 1900s. You know, it's sort of came out of this problem where people were using data to try to solve scientific problems and not getting very reliable answers. And so, people start to think a little bit more carefully like, you know, how do we design experiments effectively, and what kinds of statistical analyses are reliable.
And you're the they were focusing, you know, probably a lot more on the sort of simple problems that we are today. We would be with simple problems that you probably learned in a stats 101 class, you know things like, you know, "if I add this fertilizer do I get a higher crop yield," you know to pick the very simple example.
Maddie: [00:06:33] Oh fertilizer. That's an interesting example.
Sydney: [00:06:36] Yeah, it's not a not a random one. An early example of statistical analysis comes from Ronald Fisher working on controlling the effects of inorganic fertilizer at the Rothamsted Research Station - which fun fact is still in operation. It's about as classical statistics as you can get and it's pretty indicative of where classical statistics is at its heart. Statistics is a mathematical field driven to support scientific inquiry.
Iain: [00:07:07] Over time, I think the problems started to get more and more complex, you know, people started to develop more sophisticated tools and methodology. That's how I've sort of come to view statistics as a you know, as a service discipline to people who solve problems with data
Maddie: [00:07:24] Sydney tell me more about this service discipline that he's referring to.
Sydney: [00:07:30] I think the viewing statistics is a service discipline is a really interesting perspective. It kind of transcends the whole "statistics as a math or statistics is the science" dichotomy and instead describes it as a field used to make scientific analysis meaningful to support the scientific method using math and science.
Iain: [00:07:52] If in academic statistics the incentive is just to create new methods, you know for the sake of creating new methods, but that's you know, that can be somewhat orthogonal to actually providing value to someone
Maddie: [00:08:05] Sydney... orthogonal?
Sydney: [00:08:07] Yeah orthogonal, I recently learned, is math speak for perpendicular. It's a more precise way of saying perpendicular, so it's not too much to worry about there.
Iain: [00:08:20] And I think if you sort of view yourself as a service discipline to providing value to someone, I think that refocuses your incentives and your priorities.
Sydney: [00:08:31] I think this does effectively sum up the problem that Tukey and other statisticians have had with classical statistics.
So, if statistics is a service discipline, it matters entirely who you're serving. And that's how statistics stays relevant or doesn't and opens up space for data science.
Victor: [00:08:54] So I think statistics is of course just the science of learning things from data, right so that that's uncontroversial.
Sydney: [00:09:02] That was Victor Veitch.
He is another recent PhD recipient in statistics and is now a post doctorate researcher at Columbia University.
Victor: [00:09:13] But I do think there's this interesting distinction between you know, classical statistics, which is mostly taught in undergraduate courses still and you know, like the sort of general principle of learning from data and that distinction is that classical statistics is very model-based.
Right? Like it fundamentally relies on, you know, writing down some story about how I think the data that we saw was generated and then saying "oh, well, if our story is true, like what parameters will like best explain the data," right? So, it's like I write down a model, I commit myself to believing in that model and then I sort of say well if I believe in that commitment that assumption that I've made then what is the best I can do, right?
And certainly when I was learning statistics in undergrad and even at the beginning of grad school, I had the impression that that's really what statistics was about and you know, my view is certainly shifted since then to say, "oh, yeah, like in fact that's a particular approach that we very frequently take but it's not is not at all like a necessary one."
So, in terms of how this relates to how I perceive data science, I think you know the difference between statistics and data science is really the difference between like classical statistics and data science. And I think it's really around these parametric assumptions or this idea that you know, you know data science doesn't write down a model and say this is how the world actually is, right?
If you fit like a random forest or neural network or whatever else, you are making some kind of assumption, but the assumption is definitely not about how the world actually operates.
Sydney: [00:11:04] Yeah. So what Victor is saying here is related to another paper that gets referenced a lot when talking about the origins of data science. This paper was published in 2001 by Leo Breiman. If the name sounds familiar, it's because Leo Breiman is the statisticians that came up with the random forest algorithm.
The paper is called "Statistical Modeling: The Two Cultures." And it's pretty much Breiman calling out what he sees as an important divide in statistics. Two separate cultures that take different approaches to modeling data.
The first culture is a traditional or classical statistics approach where we assume that data are generated by a process with a randomized component, but a process we know, or can learn the general shape of. These models are the linear regressions of the world. We make an assumption about how the variables in the data set are related to each other, or about the population the data was drawn from, and we use those assumptions to understand relationships and make conclusions. This modeling culture is by far the majority in statistics. And in Breiman’s eyes, this disproportionate focus on one type of data modeling has caused the field to suffer. His criticisms are actually really similar to Tukey’s - irrelevant theory, questionable conclusions, and just generally uninteresting work.
The second culture, the one Breiman feels he belongs to, and the one he also feels is the minority in statistics, is algorithmic modeling. These are more of the Black Box models, like random forest or boosting, or neural networks. These types of models aren’t so concerned with the process that is generating the data. They’re focused on getting the most accurate predictions possible. And although the tend to be more accurate in their estimates of the data set, they also tend to be a lot harder to interpret. These are the models who’s primary function is to predict instead of trying to explain. All this is touching on a trade off that a lot of data scientists have to deal with today, known as interpretability vs. accuracy.
Victor: [00:14:39] I think historically statistics has undervalued prediction. I mean, statistics has of course always done predictive problems, but for example in the famous Leo Breiman “Two Cultures” paper, which, that paper is really all about “hey statistics, you should care about predictive problems a lot more” and if you're going to solve predictive problems, there's all sorts of methods like support vector machine and decision trees that are really really good at solving predictive problems that statistics has kind of undervalued. And so that's the big point in that paper.
There's a discussion in the end of that paper and Bradley Efron, who's a very famous and amazing statistician, makes three points, two of which I think are so incredibly fundamental to modern machine learning like, how do you assess machine learning systems?
If you if you look in the rejoinder, it's like rule 1 and rule 2, and then he makes a third point essentially saying, “yeah prediction is important, but its importance is sort of limited.” And I think that was sort of statistics maybe fundamental oversight. In the early 2000s, it was kind of undervaluing how important predictive problems are. In the rise of artificial intelligence and machine learning recently, one could summarize that predictive modeling is really powerful.
Maddie: [00:16:11] So Sydney, it sounds like there's this rivalry building between Breiman and Efron.
Sydney: [00:16:18] It’s a fundamental difference of values and opinions on what good or interesting statistical models looks like. But this paper is kind of funny because it's published at first with just Leo's original paper and then it gets republished with responses from other famous statistician in the field, one of which is Brad Efron.
And then Leo Breiman gets an opportunity to respond to their comments. And so, it kind of turns into this academic rap battle where you have Breiman going, “Hey these models that we've been using for a long time are lame. Check out all these other models that are really effective” and Efron is like, “yeah, I don't know about that, like are you sure that this is a good choice?”
And so, they go back and forth for a for a total of three separate kind of -
Maddie: [00:17:07] Three showdowns.
Sydney: [00:17:08] Yeah. Yeah exactly. Three Showdown. So, there's the original paper, the retort and unlike the final. Come back, and they're obviously very respectful and cordial and I think they do respect one another's works, but it's just this fundamental disagreement on what makes this kind of deep-set disagreement on what good statistical models look like.
Iain: [00:17:34] so I think statistics and machine learning sort of talk past each other a little bit and I think that's because in machine learning people are really focused on these predictive problems and statistics were much more focused on these inferential problems.
Sydney: [00:17:54] When Iain talks about talking past each other statistics and machine learning, I think this can be clarified and built upon by reading a section from a paper he published last year called “Statistics and Data Science: Two cultures.” Maddie, would you mind reading the excerpt?
Maddie: [00:18:15] "A computer scientist might pejoratively describe a linear or logistic regression as shallow and quaint. A statistician might express bewilderment at the buzz around deep learning, and question why a more principled and interpretable statistical model doesn’t do the trick. The point here is that these two imaginary academics are thinking about problems with different goals. The computer scientist is trying to build a system to accomplish a given task; the statistician is typically trying to learn something about how the world works."
Sydney: [00:18:48] Right. And so that's that split between an engineering perspective and I think engineering goals where you're trying to make something work and have it give the best possible answers that you can as opposed to a scientist who's maybe trying to understand why.
Maddie: [00:19:09] So with this rise towards a culture that's you know, getting more and more excited about machine learning what other factors have played a role in popularizing these practices.
Sydney: [00:19:28] Yeah, so I think just the availability and capabilities of computers have definitely been a contributory factor in the rise of the machine learning or algorithmic modeling culture that Breiman's referring to.
Victor: [00:19:47] A lot of classical statistics was developed in an era where compute was very limited.
And so that the set of models that we used were things for which there were like a very efficient computational tricks. Right. So, I mean like a really classical example is like linear regression, right? A major reason that we’ve used linear regression for a hundred years is just that it's extremely extremely efficient to solve the computational problem that you need there.
But even apparently more advanced things like exponential family models, for example, which I think most people might know under the heading of generalized linear models. Again, these are these are things which are much more motivated by their computational tractability than they are by their intrinsic statistical importance or how good of a model they actually provide.
Yeah, and so one influence from engineering is basically saying “well write down the model you actually want and then we'll figure out how to actually fit it. Like we’ll solve the computational problem implied by the statistical problem instead of restraint restricting ourselves to the models that we can solve computationally and practice”
So of course, data science certainly feels that influence, right? In some sense like you've been liberated by just having much more sophisticated computational tools at our disposal.
Maddie: [00:21:23] So it sounds like Victor has a different idea regarding the differences between data science and statistics.
Sydney: [00:21:32] Yeah, I think that in Victor's eyes the differences between data science and statistics are more stylistic than they are substantial differences that would cause you to consider statistics and data science two different disciplines.
Victor: [00:21:48] So here’s an important aesthetic difference between statistics and data science, which is the aesthetics of statistics have very clearly articulated precise assumptions and then you try and prove something with those assumptions, right? So, you really say, “this is exactly what I believe and if you believe that as well, then here is exactly the result” and the aesthetics of data science are a lot fuzzier, right?
They tend to be things like well if I fit a neural network to this look at our pretty good predictor. And so, in that sense, I mean, of course it is a thing where in data science, you know, we kind of habitually sweep whatever assumptions were making under the rug right? Like I think it's very rare to see things explicitly articulated, but I'm not sure it's generally problematic.
So, one issue is that the things that you can do formerly with like very concretely articulated assumptions, that's a very restrictive setting to work it. My feeling is even if you don't have formal results or you haven't said exactly what assumptions you're making, if you do you know, a very good empirical study or you have some sort of like empirical argument for why what you're doing works and is appropriate, then that can be like a totally satisfactory substitute.
Yeah, so I think indeed the articulation of assumptions is like routinely neglected, but I'm not really sure that's like a major issue.
Maddie: [00:23:29] What Victor is saying here is that one of the major differences between statistics and data science is how each group has decided to handle assumptions going into modeling.
Victor: [00:23:40] I think statistics has historically been much more strongly rooted in rigor and mathematics. Right. So, you see a lot of results in statistics talks or statistics journals, which involve like, you know, very precise guarantees about what a method will do like very clearly articulated assumptions and maybe some deep theoretical analysis, which is much less common in data science.
It’s just not part of the discipline in the same way. And I mean also particularly kind of a flavor of assumption which is popular in statistics is you write down a generative model for the data. And then you say, “assuming that my data was generated in this fashion, what will happen under my inference procedure?”
Right? So, you make an assumption about where the data actually came from and in data science, I think assumptions like that are very very rare. So that is a remaining aesthetic distinction.
Maddie: [00:24:45] This next part will sound familiar because this is what we started the episode with.
Victor: [00:24:49] I do have to say that there is a cultural value in statistics, which is correctness. So, if you read something in a good statistics journal and the authors claim that something is true, then it's just true. Right like I mean, nobody ever publishes like wrong or misleading results, right?
Whereas, you know, machine learning venues right now, I'd say maybe more than half the time you try and reproduce a paper published in a top machine learning venue, it will just turn out to be like fundamentally broken.
And so, I don't know that mathematical rigor is the right answer to this but certainly, you know, one part of the aesthetics of statistics that I really appreciate is if you say something is true, it has to be true.
Maddie: [00:25:41] I think what's important here is that Victor does see value in both of these approaches. There's value in the more engineering minded approach or you just want to get the problem solved, as well as in the scientific approach where you're trying to account for assumptions completely. With these two differences in values and perspectives between the two modeling cultures that make up data science today, as well as the influences that come from other traditional academic fields,
The question might become who should be teaching data science and what exactly needs to be taught.
Maddie: Let's check back in with Iain
Victor: [00:26:12] As to what that should be taught, you know, that's an open question that the whole discipline is working on right now. It’s some combination of core computational tools, core algorithmic skills, you know some mathematical concepts and of course some statistical concepts.
Sydney: [00:26:37] Yeah Iain put together a data science curriculum for the University of North Carolina, which is where he did his doctorate, and this is something he had particularly close insight on
Victor: [00:26:49] if I were to design a six core course built out of courses that kind of exist right now? I would say something like, you know, an intro to programming course and a data structures and algorithms course from computer science. For math, linear algebra and probability although Maybe that goes under statistics and you know a classical statistical modeling course as well as maybe a more modern, you know machine learning course and probably like some kind of Capstone course that brings it all together in some kind of interesting project-based way.
Sydney: [00:27:36] So definitely interdisciplinary though?
Victor: [00:27:38] Definitely interdisciplinary. I think that's probably maybe one of the biggest changes going forward is going to is the interdisciplinary nature. So, you know, it's not enough for us for statistician to just teach undergrads statistics tools.
They also need to understand computer science tools. And then of course that brings the intersection of you know, the statistics tools and the computer science tools. That's one of the main challenges going forward is “how do we deal with these intersections” because if you're solving a problem, it needs to be statistically rigorous and you also need to be able to compute it. Teaching that intersection, is something we're working on right now.
Maddie: [00:28:21] So what does Iain think that statistics has to offer to data science as an interdisciplinary field?
Iain: [00:28:29] So like we were talking about earlier, the way statistics historically developed was people were trying to use data to answer questions and people realized, “okay, we need to be a little bit more rigorous about that.” The 1900’s are different than the questions we’re trying to answer today. And if I could phrase, you know, what could the discipline of statistics really contribute to modern data science, it would be, “just because an algorithm gives you an answer, it doesn't mean that answer is correct”
I think that's maybe the more modern version of statistics as a field has to uniquely contribute to data science.
Sydney: [00:29:22] And as for the fields contributing to data science going forward…
Iain: [00:29:26] historically rooted disciplines, like statistics computer science are going to stay there their own discipline.
I think probably going forward, maybe the most important thing is how we deal with these interactions and intersections, you know, like creating like, you know, creating environments where it's easier for computer scientists and statistician and other people to collaborate. Educating people in stuff that's not their expertise. So, you know statistician probably need to learn a little bit more computer science things like that. I think we're going to see a lot more of that.
Sydney: [00:30:06] It seems like to a lot of statistician, including Victor, when we talk about any of these areas of data science that seem maybe new, they’re really just underserved or underappreciated areas of statistics. So maybe the future of data science or even the president of data science, is statistics.
Victor: [00:30:31] I think statistics and data science, yes, have more or less merged to be a single concept, and maybe there are aesthetic differences between the two areas but really, they're just the same thing under two different names. I think they're there is some push actually, among statistician to rebrand as data science. I mean, I certainly think it is a common perspective among statistician that data science is statistics. I don't have good predictions for exactly how branding will go in the future. But I would predict that the aesthetic differences between data science and statistics in as much as these exist as two different areas, the differences are aesthetic, and my feeling is like that won't persist. So, I think the two things will unify basically into like a single area done in a fairly consistent way.
Maddie: [00:31:31] So because data science is growing in popularity at do people need to go back to school to start their career as a data scientist.
Sydney: I think there are a lot of paths to becoming a data scientist. Here’s Iain’s thoughts on the matter.
Iain: [00:31:45] I I'm I would guess that data science is one of the best disciplines out there in terms of free quality educational resources. So, when I was when I was doing my PhD in statistics and we weren't really taught, you know things like R or the core data science computational schools.
So, I taught myself of those things using these online resources and I would imagine most people who’ve at least gone through graduate school have similar stories. I think these resources are really amazing. So, they kind of democratize the subject, they’re lowering the barrier to entry so, you know, I think they allow people who don't have a PhD or an undergrad in statistics to become data scientists.
And I think that's really great and they also met, you know, just make it a lot cheaper for students and people to learn this stuff. You don't if you don't have to buy like several hundred-dollar textbooks, like that's awesome.
Maddie: [00:32:48] Sydney thank you so much for walking us through this.
Sydney: It’s been my pleasure.
This episode of Alter Everything was produced by Maddie Johannsen (@MaddieJ).
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.