This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
In this episode, we talk all things automation, and discuss why humans need to stay in the loop. Knowing that data is crucial to the decisions analysts and data scientists make, automating analytics processes allow humans to focus on delivering insights. This combination of automation and human thinking power makes the most of humans’ contributions to the data analytic process.
You’ll hear from Alteryx Data Science Journalist, Susan Sivek, and Doris Lee, a PhD student at the School of Information at UC Berkeley, as they emphasize these topics, and the importance of keeping humans in the loop.
[music] Welcome to Alter Everything, a podcast about data science and analytics culture. I'm Maddie Johannsen, and I'll be your host. For this episode, I spoke with my teammate, Susan Sivek.
So hi, Maddie. This is Susan, and I'm the data science journalist for the Alteryx Community.
And Susan walked me through a conversation she had with Doris Lee.
Hi there. I'm Doris Lee, and I'm a fourth-year PhD student at the School of Information at UC Berkeley.
We'll be talking about automation and why humans need to stay in the loop. Let's get started.
So I'm a fourth-year PhD student at the School of Information at Berkeley. And I'm broadly working on designing interactive tools and systems that make it easier for users who might not have data or programming expertise to more effectively work with data. And my background, during undergrad, I sort of was in astronomy and physics, and I was working with a lot of large data sets. And I realized that there was really this need for creating tools to help serve domain experts, scientists, people who are professionals in the field to work with data who might not otherwise be professionally trained with programming or data expertise. And so that really jump-started sort of my graduate work in sort of developing these tools for these users.
This is interesting, not only for people like me who don't have this programming or formal analytics training but also from the perspective that we like to talk about on the podcast often, which is that of digital transformation.
Yeah. There's always just so much to learn in the data world, and there's also a lot of different ways to accomplish the same goals. So it can feel kind of overwhelming, and I think of data literacy for everyone too, right? So Doris and I got into this a bit when she talked about why this work is important. And by the way, much of her work is done in collaboration with researchers from UC Berkeley, Tableau Research, and the University of Illinois.
In the process of exploratory data analysis, often you're faced with this issue of having to make a lot of decisions in terms of where to look for your insights and what models should you use for your machine-learning algorithms. There's all these decisions that you end up making that relates to what you actually get out of these models or these analyses. And a lot of times, it's very challenging to come up with these configurations. And so a lot of my work has been designing tools to more effectively help people within that process.
I think you've been working on this for maybe four or five years at this point?
What kinds of changes have you seen so far as we've moved toward greater automation of that analytical process?
Traditionally, we've been really good at developing tools. Right now we have a whole slew of open-source as well as commercial tools that allow people to do the basic visualizations, machine learning, and analytics. And those tools are great. And there has been several like Python and R and STATA and certain ecosystems that allows analysts to do these analyses very flexibly. However, there's kind of a barrier of entry in sort of being able to figure out how do you use this tool for a certain task. And so there's a whole slew of tools that are out there, and it becomes very hard for someone who is entering the field to figure out, "Oh, I need to use Pandas for wrangling my data and cleaning off my data, but I need to use maybe Scikit-Learn for modeling my data and doing machine learning on my data."
And so speaking of barrier to entry, what is Scikit-Learn?
So Scikit-Learn is a really popular set of Python tools for doing different machine-learning tests. It can do a lot of different things and just mastering it would be a big project in itself. But what I think Doris is getting at here is that there's a lot of different names and terms and tools to use in the data science process, and figuring out how they all apply to your own project and goals can be a big challenge, especially if you're new to this kind of work.
So there's sort of this disconnect. And so part of the goal is seeing if there are ways that we can guide users through that process, even in particular scenarios like visualizations or machine learning, by introducing some bits of automation in that process.
And are you seeing that those bits of automation are becoming more widespread, more adopted across the field?
Yeah. So if we take the example of machine learning, traditionally, people have been using tools like Scikit-Learn or building their own neural network models and things like that. Nowadays, we're seeing this rise of these classes systems called AutoML tools.
--Assisted Modeling in Alteryx, these AutoML tools that Doris went into during our conversation offer that automated approach to implementing machine learning. So they help with everything from preprocessing of data, cleaning it up and getting it ready, to developing a model and seeing how it performs and sometimes even putting it into production. Basically, the idea is that even people who aren't expert in machine learning can use these tools to explore their data and build models.
Oh, this sounds perfect for somebody like me [laughter]. Essentially, you just explained that Alteryx Assisted Modeling breaks that barrier to entry that Doris mentioned earlier. It guides analysts through the data science process. The idea is to make this kind of advanced analysis more accessible, even if you hadn't heard of Scikit-Learn either.
Right. Exactly [laughter].
I guess to take a step back when you're developing a machine-learning model, there is often a set of configurations that you have to do in order to achieve the learning outcome that you want. So for example, let's say that I'm creating a image classifier of identifying cats in photos. And as a developer, when I'm building this model, there are certain models that I have to pick in order to work well with image data, and there are also hyperparameters to that model that I need to pick in order to make that model work well. And then there's also preprocessing procedures that I need to consider as well as what metric do I want to use to say how good my model is. So this is kind of a combinatorial space that I have to search through. This is usually guided by domain knowledge as well as the developer's past experience with working with these models. And so what these AutoML systems essentially help you do is that it automates the search through the space, sometimes only limited to hyperparameter search or model selection, but it tells you which configuration of these choices are effective based on some sort of metric. So it might tell you something like boosted decision trees with a certain hyperparameter setting with this type of preprocessing would lead to a model that have like 0.98 of a classification accuracy. And then given those different alternatives, you can then go in and pick and observe these models and dig deeper to understand which one do you actually want to use in production or for deployment.
So that's kind of in a nutshell what these tools do. And the features that they have really span across the different life cycles in machine learning, from preprocessing to modeling to postprocessing. And even some of these providers help with the deployment of these model and monitoring. We are seeing more automation in the data science development life cycle, both in machine learning and data visualizations and analytics. And the real holy-grail question here is, how do you effectively introduce automation into what is usually manually done by data scientists or by analysts in a way that it works collaboratively with these analysts and still allow these analysts to have a lot of control and flexibility and a way of expressing their domain knowledge in the most effective way possible without sort of bottlenecking them or making automation the complete solution and giving them some sort of freedom to do that?
[music] Oh, that sounded important. I mean, we get to work with the analysts and data scientists all day in the community, and this sounds huge for them.
It is. It really is super important. I mean, think about it from the perspective even outside of analytics. Maybe a good analogy is someone who loves to cook, but they really hate doing dishes. Having to wash those dishes just totally lowers the fun level of the cooking for someone who really just enjoys the creativity and the exploration of cooking. Assisted Modeling and other automated tools are kind of like getting a dishwasher. They help analysts and data scientists get through the boring and routine stuff more quickly so they can focus on doing the creative and interesting stuff that they enjoy more.
Yeah. It's amazing to think about all of those different decision points that you just described being streamlined, and to some degree automated. You've also written quite a bit about that idea of collaboration with that automated system and this idea of keeping the human in the loop. So could we talk about that a little bit, what that means for you, this idea of collaboration and keeping humans in the loop?
Yeah. The right reason why we even think about keeping humans in the loop is that humans provide a lot of very valuable input when we are doing data science, in particular domain experts [laughter].
I love when people say that.
I know, right [laughter]? That validation is always nice.
So domain experts, whether they're in medicine or finance or manufacturing, are really important in providing this valuable knowledge about what they see in the data. They see connections within the data that an outsider might not be able to know. Let's say we might have a machine-learning tool that is doing cancer diagnostics, and precision score or a certain trade-off should be weighed higher or lower based on the mortality rate of that cancer type. And so this is one example where the clinician, who is a domain expert, might come in and add additional bits of information to the data science workflow that influences how that pipeline is improved based on some domain knowledge that is obtained in a clinical setting. And the main experts know their data the best. They might not be the one who processed it or collected it, but they know the domain very well. So they know what the attributes and the values mean and why maybe there are missing values in the data sets. And it's very hard to automate this process, in particular, what we see in this preprocessing stage. So how do you clean and wrangle data into a form that is digestible by these analysis or machine-learning algorithms? And these skills can take the domain experts, obviously, years or even decades of experience to acquire. That's why a human-in-the-loop perspective is very important because the humans provide a very valuable set of knowledge to guide these machine-learning algorithms.
[music] Yeah. It's really a way to get the best of both worlds, the things that humans can't quite do in terms of the massive computation and then the things that the computers can't do, those intangible insights that domain experts can bring. So it's a magic combination [laughter].
Well, you talked a little bit in that human-in-the-loop paper about different levels of human involvement in the AutoML system. So there's this idea of a user-driven system, the cruise-control system, and then an autopilot. Could you talk a little bit about each of those just briefly and how you see those different levels of collaboration functioning?
Yeah, definitely. So in the mixed-initiative paper about AutoML, we had this analogy to self-driving cars in the sense that when we think about the cars that we have today, it's largely user-driven. You're driving a stick and wheel, and you have control over the steering wheel. And then you think about the next level of automation, which is cruise control. That's a nice-to-have feature that introduces some level of automation. And then, obviously, there's also the autopilot, which is kind of a vision into the future. And so similarly, when we're thinking about the process of machine-learning development, there are user-driven tools that allows users to exactly specify, "This is the model that I want to use. Here are the hyperparameter settings. Here are the ways that I want to compute my metrics," and things like that. And that's sort of the user-driven situation. That is not to say that there's no automation being done at that phase. Even when we think about the actual car scenario, drivers of the car don't have to think about how the gas goes into the piston and creates this combustion engine. So there is some sort of automation even at the lowest level. The analogy there in terms of the machine-learning tool is that there are these existing frameworks that are very popular that have these models already coded up for you. There are certain parameters that you have to feed in. That's kind of what we mean at the first level. When we talked about AutoML tools, they're sort of already at the third level, which is like autopilot. And we're kind of taking a step back in terms of thinking about a human-in-the-loop perspective. That's kind of the second level of the cruise control that we're thinking about. So going to the third level--
Actually, hold on.
--of the autopilot--
Let's go over that one more time.
--it's the AutoML--
Yeah. So Doris is explaining three different levels of assistance for people using these automated tools, and her research paper gets into a lot more detail on this. But basically, she's saying that the first level of automated help for data analysts is some of the basic tools we have that already contain models that are sort of precooked for you. You just have to figure out some of the right parameters to use. But even that decision takes quite a bit of knowledge to figure out good parameters that will generate a useful model. What if you don't feel 100% confident with that? So there's a second level then, what she's calling cruise control. You still have to know how to drive and navigate. So that human role is really important, but there's less decision-making along the road. This second level is where Doris is really focused, making it possible for people with a little less experience maybe to still create models and to still get some guidance along the way. There's also then that third level, the fully automated level, which is like total autopilot, just the computer takes over and handles everything. It does the takeoff, the navigation, landing, all of it. And that sounds really cool, but as Doris is going to explore, maybe we kind of lose something if we just rely on autopilot. There's still definitely benefits to including human expertise in this process.
Gotcha. Yep. That makes sense.
So you feed in a data set and the task that you're interested in. So for example, I'm interested in classification or regression. And so you select the target task, like columns of interest. For example, I might want to predict the survival rates of people on the Titanic.
So just real quick here, a fun fact. That might seem like a totally random Titanic reference, but the details on the people who survived or didn't survive the Titanic, it's actually a super popular data set for people trying out data science skills. So like on Kaggle, a data science competition website, there's over 100,000 entries in the competition where people try to build models that can predict what happened to the Titanic passengers.
This is a very classic machine-learning task. And so that's kind of the level of input that you give it, and you also give it the data set. Some systems allow you to sort of select what models that you're interested in, and then the system goes in and spits back a model that you can then run your predictions on. And so our human-in-the-loop perspective, on the second level, the cruise-control level, is essentially what are some additional ways that we can allow users to communicate to the system about their intents and their usage in order to expose more of how the search is actually done and the resulting model that comes out of that. And in particular, I think there are two main challenges to this. There's one challenge, which is going from the user to the system. How do you effectively communicate the user's intent in terms of their problem specification or their domain knowledge to the system in a way that the system is able to take that information and operationalize it? And the other way around, which is going from the system to the user, how do you effectively communicate the result of the model or the result of that search process and why certain models were picked over other models or why certain parameters or modeling-decision or preprocessing procedures were used over others? How do you effectively communicate that from the system to the user so that you have this feedback loop and this two-way dialogue between the user and the system?
Yeah. That was something I was wondering about, which is what kind of baseline knowledge you still need a user to have in order to use either that cruise control or autopilot level? I mean, they still have to be able to interpret some of those basics that are going to come back from the system, as you're saying.
Yeah. And I think that abstraction level really depends on the design choices of these tools. The reason why we have this AutoML tool, one of the motivations is to democratize this data science process and enable domain experts who might not know the details of specific models to enable them to use AutoML. And then, in addition to sort of these three levels of automation, another way to think about it is how much control or how much knowledge do you have about these machine-learning systems or data science? I think the holy grail is some middle ground in between where the users can specify at a very high level these domain concepts and knowledge and then have the system be able to translate these domain or problem requirements and operationalize them into actual model decisions.
Yeah. I think you used the phrase, "The system could be a personal coach," at one point.
I thought that was a pretty cool concept.
Oh, I love that.
Right [laughter]. I think we could all use a personal coach for some of this stuff. I mean, the idea of having a hands-on guide for your data science projects, that's pretty cool [laughter]. Awesome. So you've got that awesome medium article, and I'm sure we can link to it in the show notes about all the different ways that people have approached the problem of automating dataviz. And you talk a little bit about trust and safety. Can you explain what that means to you in this context of maybe even AutoML more generally?
Yeah, definitely. I think one of the challenges, when you bring in automation, is that you necessarily have to introduce some abstraction because the users now do not have full control over the nitty-gritty details of what exactly is going on under the hoods with the automation. And so when we're thinking about these automated system or human-in-the-loop system is, does the user trust the model that is being used to generate the results that they're seeing, or are they able to understand what is going on under the hood at a very high level so that they could still feel some sort of control or be able to control certain aspects of the system in some way so that they could understand it? The safety issue is also kind of interesting, and I think it's related to the trust and how we perceive and interact with this automated system. So in some sense, when we have this automated tool, we are enabling users who might not have data expertise or programming expertise to work with data, and that's a great thing. It allows domain experts to contribute to the data science process without having to have the formal training in these fields. That enables the democratization of data science to the large majority of people, working professionals. However, that also introduces a danger in the machine-learning community. There is concerns about misusing machine-learning tools to unintentionally make mistakes or create charts that might be misleading to the user.
And so there is kind of this safety concern in data science. The fear is that by allowing people who don't have statistical or data expertise to have such powerful tools, like the automated data-visualization tools or the AutoML tools, they might unintentionally create models or visualizations that, for lack of a better word, that is not safe in the sense that there might be biases or mistakes that are unintentional but are in the data artifact that they end up creating. Part of the difficulty in creating these tools is also making sure that the artifacts that are generated by this automated system or this human-in-the-loop system are in some sense safe or are responsible design choices that come out of these systems.
And tell me more what you mean by that, responsible design choices.
So there was a example. This ProPublica article essentially talked about how there was this criminal risk-assessment algorithm that was being used to figure out rates of recidivism in criminals. And they found that there was a systematic bias towards African Americans via this algorithm that reflected the systematic bias in the data that they were feeding into the system.
Wow. So how did this happen? What's the backstory here?
Yeah. This is a pretty worrisome example of what can happen when you don't have humans carefully considering every aspect of an algorithm. There's a software out there that tries to predict if someone accused of a crime is set free, whether that person might actually commit another crime while they're out on the street. That risk assessment can be used to set bail amounts or even sometimes to determine someone's sentence. These are pretty big decisions that really affect that person. So the algorithm scores seriously matter. What the journalists at ProPublica found when they researched these algorithms and scores was that not only were the models not very good at predicting whether someone would commit a new crime, but the models also had this other huge problem. They gave unfair predictions for black and white defendants, overpredicting future crimes for the black defendants, and then also incorrectly assigning lower risk scores for the white defendants. They basically managed to automate racial bias.
Yikes. Yeah. That's definitely problematic.
Yeah. Automatically predicting with a model whether someone might re-offend on the surface, that seems like a super-efficient use of resources. It seems like the numbers wouldn't lie, and we'd know who needed a longer sentence or a higher bond or more rehab or whatever. But in reality, you need more nuanced human judgment in this process. And then you can determine which variables should actually be included and how. You need humans thinking deeply about the data and what they really mean and what their ethical and fair uses actually are.
And so part of the danger in creating these automated tools is that you're spending a lot less time in the modeling as well as the data science process in order to obtain that result. But the danger is that because you didn't go through that process, you might have overlooked some aspect that was really important, for example, like biases in the data or maybe errors in your data or missing values that you weren't aware of. And so when we're developing these human-in-the-loop systems, it's really important that there are some sort of safeguarding feature or some sort of explanations that communicates to the users what are the underlying assumptions that goes into the models, so that the users have a better understanding of what they can do and what they can say about the artifact that comes out of these automated tools.
Right. Yeah. It's interesting to think about all the time that we generally spend complaining about all that time doing our data cleaning and wrangling and so forth, that that's actually also reflection time to think about some of those other issues that you're mentioning. How do you still incorporate that reflection and some of those other warnings that might come up into an automated process that can coach the human through dealing with those questions? So yeah. That's a whole another dimension that I hadn't thought about.
[music] So we need tools that keep humans around and that also encourage good judgment about data. But we also want to make it easier for people with different skill levels to do data science, and also help them cut down on the parts of the process that maybe aren't the best use of their time. So this happy medium, this holy grail that Doris mentions, what does that look like?
[music] There are a lot of really cool tools out there that are striking this balance. Alteryx's Assisted Modeling that we mentioned briefly earlier, that helps someone who's never built a model before start with a data set, figure out a modeling strategy, and then choose among different models. There's also this powerful concept now of analytic process automation, which Alteryx is really into, that data tools can streamline the analytics process so that people can spend more time thinking, which is what we have to bring to the table, right? Everyone can focus less on which parameters to plug into a model and more on what the model is for, whether you're using the right analytic approach and how we're going to use the model, these bigger questions. And as we touched on earlier too, this makes it possible to democratize data science even further. So the idea is that modeling and automation can be much more accessible. We can free people's minds up to focus on their domain expertise and those aspects of data science that demand human judgment. There's also some other neat stuff in the works that would make even expert-level data science approaches available for all of us. I asked Doris about what she sees as the next steps beyond what we even have now.
So toward the end of the medium article, you say that you are looking forward to a constructive and sustainable future in human and machine collaboration for data science. And I'm reading that because that says it better than I can [laughter]. So what does that future look like to you at this point? What kinds of things are you hoping to see developed in the near future or distant future that will really enhance that human-machine collaboration in the world of data science?
These human-in-the-loop data science tools that we're developing, we can think about these automated tools as a way of encoding what experts or people who are very good at doing machine learning or data science are doing in their practices, being able to learn best practices from how people do data science, and then encoding that knowledge into these automated systems so that they're more interpretable as well as they're pulling together what we know best about machine learning or data science. To help lower the barrier of entry in terms of doing data science, one of the things that we are looking at is there are these collaborative tools that are out there, for example, GitHub or Jupyter Notebooks or JupyterHub that allows people to upload their data science workflows onto this collaborative platform, and they're shared across collaborators as well as the public. The data science culture is very open and collaborative in terms of sharing best practices. And so there's this more meta thing, which is, can we crowdsource or extract what people are doing in data science in order to design systems to help them in that process? So for example, if we find that users are really struggling to come up with good preprocessing procedures for a particular type of data, maybe like textual data, then we can go in and design our tools in a way that recommends these interventions or suggests that, "Hey, maybe you would want to take a look at this preprocessing procedure that you could use to improve your data science pipeline." That crowdsourced body of knowledge can also help guide users towards better data science. Because data science is such a new field, it's still kind of a dark art [laughter]. There's a lot of things that people are doing that you don't really understand why, but it kind of works. And so it would be really interesting to see how we could consolidate some of those practices and come up with a better data-driven approach to recommending some of these practices to the users to guide how they do their data science.
That's really cool. And one of the things that I find interesting about what you're saying is we can learn not only from the positive examples where people are sharing things that worked, but we can also learn from people's mistakes and sort of seeing--
--okay, here are things that we're struggling with, and here's how we can refine and educate around those things that are tricky.
Yeah. And I think, ultimately, both the positive and the negatives examples are very useful because it highlights what works and doesn't work in different scenarios, as well as we can now bridge the gap between what data science experts are doing in their practices, and then use that to educate or help guide users who might not have that level of expertise. And so now you can bridge that knowledge gap together.
Okay. So I think we've established that humans should always be in the loop. But just for the sake of clarity, I'm curious if there's a serious possibility of humans actually being omitted from this data science process completely. In some industries, people talk about automation as leading to losing jobs and so forth. Is that going to happen here?
Right. I know what you mean. When I think of automation, I usually picture robots replacing assembly-line workers and situations like that. But it doesn't seem like Doris thinks that's a likely thing in data science. Here's what she said about that scenario.
These automated tools and these human-in-the-loop systems are not to replace the manual approach to data science. The manual approach to data science is still very valuable in the very high-stakes situations where you care a lot about privacy of the data, where you have high regulatory environments like credit-risk scoring or diagnostic models or criminal-risk assessment in the justice system. And so automation is obviously not the answer to everything. The goal really is to use automation to help with the manual parts of data analysis or to suggest useful things to the analyst. But at the end of the day, it shouldn't replace human decision-making.
Sure. I think that's a great point. I mean, we've certainly seen these articles floating around on the internet about, "When will data scientists be obsolete because AutoML is coming?" But it sounds like you think that's probably not a realistic future.
Yeah. I think that we're just at the beginning of sort of this dawn of data science and being able to use data in a way that is effective and help us better understand our society and the world that we live in. I don't think that data scientists would go away. But they might be more efficient when they are working with these human-in-the-loop or automated tools. That would enable them to go for the meatier questions or look for a broader set of problems that they can solve because they have more time. The data science wouldn't go away, but they would become more productive in using their skills and the domain knowledge that they have to tackle a wider range of problems.
Yeah. Certainly, using the critical thinking skills and the human insights to be able to focus on those would be very welcomed for a lot of folks.
[music] This was a lot of fun. Thank you so much for joining me, Susan.
Yeah. Thanks for having me.
Thanks for listening. You can find our show notes and learn more about analytics process automation at community.alteryx.com/podcast. And join us on social media using #AlterEverythingPodcast. Catch you next time. [music]
This episode of Alter Everything was produced by Maddie Johannsen (@MaddieJ). Special thanks to @LordNeilLord for the theme music track for this episode.