Data Science Mixer

MaddieJ · ‎11-15-2021

What’s next for AutoML systems? We’re joined by Kalyan Veeramachaneni, who is a principal research scientist and faculty member in computer science at MIT, as well as an Alteryx fellow working with the Innovation Labs team. Kalyan shares exciting possibilities for AutoML systems, and explains that the future for these systems is right around the corner.

Panelists

Kalyan Veeramachaneni - @Kalyanv, LinkedIn, Twitter
Susan Currie Sivek - @SusanCS, LinkedIn, Twitter

Topics

Alteryx Innovation Labs
Alteryx Open Source libraries for autoML tasks: Compose, Featuretools, EvalML
MIT News article - “3 Questions: Kalyan Veeramachaneni on hurdles preventing fully automated machine learning”
“AutoML to Date and Beyond: Challenges and Opportunities” - Published version, preprint version (free access)
Data to AI Lab at MIT and its open-source software

Cocktail Conversation

Kalyan Cocktail Conversation.png

What do you want to see in the future of AutoML? Are you excited by the fully automated system that Kalyan describes? Are there particular tasks you would really like to have automated or maybe some you want to hang on to for yourself?

Join the conversation by commenting below!

Transcript

Episode Transcription

SUSAN: 00:00	Hello and welcome to your fully-automated data science system. Please take a look at the range of questions and potential data science problems you and I could tackle today. You can flip through the images below that describe various tests we can pursue, given the data I have on hand. I have also selected these tests for you based on your past history, your industry, and questions that data scientists like you have explored. Be sure to rate my recommendations, so we can continue to develop a productive and enjoyable automated machine learning relationship.
SUSAN: 00:36	[music] Are you ready to start your data science projects that way, interacting with a fully-automated machine learning system that knows you and your data well enough to know what you might like to work on today? Welcome to Data Science Mixer, a podcast featuring top experts in lively and informative conversations that will change the way you do data science. I'm Susan Currie Sivek, senior data science journalist for the Alteryx community. For today's episode, I talked with Kalyan Veeramachaneni who is helping map out and build the future of automated machine learning systems. We walked through some of the exciting potential and challenges of building AutoML systems that can do a lot more than build and evaluate models. In fact, they can do it all. We're talking about AutoML systems that can also formulate problems, clean and visualize data, build meaningful training and test sets, construct features, communicate results, and even make recommendations. Wow. And you might be surprised to hear, that kind of comprehensive AutoML, it's not that far away. I'm so excited to share this conversation with you and get you wondering and maybe dreaming about these systems too. Let's meet Kalyan.
KALYAN: 01:50	Thank you for having me. I'm Kalyan Veeramachaneni. I'm a research faculty in the computer science department at MIT, and I lead a group called Data-to-AI in MIT. And I also joined Alteryx as a fellow two years ago, and I work with the Innovation Labs team very closely at Alteryx.
SUSAN: 02:10	Awesome. Fantastic. And would you mind sharing with us as well which pronouns you use?
KALYAN: 02:15	Him, he, his.
SUSAN: 02:17	Okay. Great. Thank you. And as you may know, on Data Science Mixer, we often try to have a special beverage or a snack or something while we're chatting. So do you happen to have anything there with you today?
KALYAN: 02:28	I have coffee. I love coffee, so.
SUSAN: 02:30	Yay. Yes. Any kind of special coffee or just straight up the hard stuff?
KALYAN: 02:35	Straight up regular black from Starbucks, that's all I have all the time.
SUSAN: 02:41	Excellent. Yes, same here, French roast, always, yep, just black, first thing in the morning. So I've actually moved on to my second round of caffeine, and I'm now having some double bergamot earl grey. So that's my second dose of caffeine every day. Awesome. Great. Well, there's so many different things that we could talk about. You've worked on so many different projects in your research and your different collaborations. But one thing that I thought was especially interesting that has come up recently is a paper that you recently published with coauthors in ACM Computing Surveys where you talk about the development of AutoML tools, and this is a really neat, deep, comprehensive look at AutoML. And I think it's super interesting because it really lays out a taxonomy for thinking about those tools and that will really, I think, help people think about the future of how AutoML is going to develop. So as we're thinking about moving toward a completely automated machine learning system, that kind of taxonomy, I could see being very useful. So the paper was super thought-provoking and very readable, which is-- any academic paper is a major accomplishment. So yay. And we'll put a link in the show notes to the publish and preprint versions, so folks can check it out as well. But if we could, I would love to just walk through some of the concepts in the paper and talk through what you're seeing in the future of AutoML. And maybe a good place to start is just kind of with a simple question, or maybe it's not a very simple question, which is, for you, what is AutoML exactly as you would define it? And what motivated you to work on this paper, to write about it, and explore it so deeply?
KALYAN: 04:20	Oh, thank you for that question. I think it is hard to define AutoML because that definition has been evolving over time. What has not evolved is the mission, which is, "How can we make more people in the society use ML, machine learning, and data to optimize operations or make things more efficient, make things more equitable, make things more accessible or available?" These range from a lot of commercial applications to a lot of societal applications. So making that, making society able to use AutoML is the sole purpose. But what we started considering is what part of that process that we would like to automate, like the machine learning solution development, has evolved over a decade? So in this paper, we started going through our journey in 2010 when we realized that there is an immense need in the society to use machine learning, but everybody's stymie from being able to use it because a lot of that was research work or buried in papers; there's a lot of mathematics. And so at that time, we said, "Okay. Could we automate some parts of it? Could we build tools, essentially, to enable people to use a lot of the research and mathematics that's in the labs or research settings?" And then we, over five years - by we I mean a larger research community, including my group and a lot of folks that I work with - try to automate a lot of things and try to provide tools and ability for people to do it. Python became a very popular language among data scientists and machine learning folks, so we started building all the tools in Python.
KALYAN: 06:07	And lo and behold, in 2015, 2016, we had all those tools to automate every part of machine learning process, machine learning solution development, if you will, so process ranging from how do you prepare data, how do you extract features out of the data, which features are just historically, as my dad would say, "Variable. Now, you're calling them features." So they're just variables that describe what's going on, a phenomena. How do you automatically do that without having to write a lot of software? How do you explore different machine learning models? How do you tune them? The machine learning models have a lot of hyperparameters that control its behavior or that control how it interacts with data. So how do you tune them to maximize its accuracy or performance? So there's so many tools that we produced by the time it was 2015, 16 timeframe, and then so many more were in the pipeline at that time. So the AutoML definition started-- originally, it was just about modeling, and then it expanded to data preparation, feature engineering, and then the whole sort of pipeline, end-to-end pipeline of machine learning solution development for a problem. And then we said, "Okay. So now, we have all the tools, so it should be very easy for anyone to start using machine learning, right, so any data scientist or a domain expert, experts that have very sort of minimal software engineering experience or have software engineering experience but now focused on different things in their career." And lo and behold, we found that it's still not possible. It's still hard. And I think we would have only figured that out if we did actually make all those tools and say, "Okay. We took care of all the problems that we thought were the bottlenecks." So from 2016 onwards, we started uncovering a whole set of new problems when you interact with domain experts.
SUSAN: 08:04	Interesting. And I know that's another term that you spent a little bit of time on in the paper as well, this idea of what is exactly a domain expert. So what does that term mean in your system of thinking about AutoML?
KALYAN: 08:15	That's actually a very good question, too, because I was in a meeting at one of the DARPA programs where the program's goal was the same mission, increase usage of machine learning. And all of us were trying to say, "Well, we want to build this tool for the domain expert. We want to build this tool thing for domain expert." And there was a whole bunch of people ranging from machine learning, applied machine learning, HDI, database; it's a lot of my peers and colleagues. And then at some point, I popped in and said, "What do you mean by domain expert?" And somebody said, "Oh, I mean the machine learning folks." And then I asked someone else, "What do you mean by domain expert?" They said, "Well, we mean somebody else. We mean the researchers." So everybody had a different definition. So ultimately, we concluded that-- at least I concluded, and I want to sort of make sure that that's what we are talking about is that a domain expert is somebody that is working the domain whose problem we are trying to solve using machine learning. So if you are talking about sales and marketing, they are experts in sales, and it's their problem that we are trying to solve; they're trying to solve their problem. So that's what we mean by domain experts. So right now, in my group, we interact with operators that operate satellites and monitor them or folks that look at time-series signals from wind farms and continuously monitor them and try to optimize operations. So those are the domain experts. They're sort of embedded in that domain, and they care about that domain a lot. And so it's them we are trying to serve through making machine learning easy-to-use concepts. That's our audience. That's our goal.
SUSAN: 10:01	Yeah, yeah, that makes sense. So in the paper, thinking about these domain experts and putting these tools in their hands, you and your coauthors talk about the seven levels of AutoML and the various progression that we've seen as we've moved towards sort of that highest level, which would be a fully-automated end-to-end solution for handling the entire machine learning process. And so maybe you could give us-- don't feel obligated to go through every single level, but maybe you could give us kind of the overview of why that set of levels came about and how you thought about it as sort of a thinking tool for organizing the tools that are currently available.
KALYAN: 10:39	Yeah, I think when we started looking-- when we started expanding this definition, we started uncovering more problems that were not making machine learning easy to use. Even after having all these toolings, we started to realize, "Well, there's a lot of other things beyond what we originally defined or we went after." And then it automatically felt unclear, seeing that there's other tools that need to be developed; there's other automation that needs to be done. And so that led to those seven levels that we defined in the paper. And one of the big things that we started to identify were two bottlenecks. And in the whole process, even after having all the tooling to do the machine learning model development, the first bottleneck was how does one translate a business problem into a machine learning task? That itself was interaction between an applied machine learning engineer or a data scientist and the domain expert. So it was a lot of back and forth, and it involved figuring out, "Do we have data for this? Do we not have data for this." But also, the problem at the business level comes up something like, "Well, we are spending too much on going back and forth, which--" I'm just giving you an example, "Going back and forth between the wind farm, trying to replace these parts." Well, then you would have to translate that into, "Well, should we try to predict in advance which parts are going to fail?" So that translation is coming from a lot of back-and-forth between the data scientists and domain experts. So we thought about like, "Is that process even-- can we take that bottleneck away? Is there any way we can create tooling to help with that? Can we automate that?"
KALYAN: 12:27	It turned out that we had a completely incorrect way of looking at automation. We always thought of automation as just a tool that we can provide, and then suddenly, we will be able to replace a process that was done by people. In this case, it turned out that automation can only enable. It's an enabler, not a replacement. So we created a tool called composeml. It's actually open source, and it's with Alteryx Innovation Labs as well. And in that tool, we allow people to specify machine learning tasks and interact with the data and figure out, "Can the data do something on that task," right, "Is there a possibility? Do we have enough training examples? Do we have--" There's a number of questions one can ask. And that sort of takes care of some level of-- reduces some bottleneck in that back-and-forth. It doesn't completely eliminate it, but it reduces some bottleneck in that. Then we thought about the next part where we said, "Hey, if we can look at the data and automatically synthesize prediction problems or machine learning tasks, can we take to synthesized machine learning tasks back to the domain experts and say, 'Could this help in some business problem or not?'"
KALYAN: 13:43	So it's flipping the question. Instead of them coming to us with a business problem and then us trying to map, can we just show them a lot of prediction tasks or machine learning tasks that we could solve using the data? And can we ask them, "Is this useful? Could you think you can use this particular prediction task if we were to build a model for this?" So that was sort of level six or level seven automation. So as it's sort of going up and up and up towards higher levels, it became a lot more about making that interaction more efficient, so providing tools and creating tools to make those interactions more efficient. If you go lower and lower in the level, you start taking off the tasks that were traditionally done by people manually into something that could be now done automatically with software. So that was the whole sort of rationale coming up with seven levels and then dividing what we have already available into those levels and so on and so forth.
SUSAN: 14:46	Yeah, that's so interesting. And I thought some of the issues that came up in the paper around problem formulation, problem recommendation or task recommendation, and how that would be translated back and forth between the system and the human-- I thought some of those challenges were especially interesting. Could you talk about some of those issues and some of the things that you've already developed to try to address those problems?
KALYAN: 15:08	Yeah, yeah, absolutely. I think one big interesting observation that my group had a couple of years ago was if you look at the data coming up with a prediction task or a machine learning task is very mechanical. So that's how we automate it. So we programmatically can describe some constructs and create a language. We even call that prediction task language. And so we can create a lot of prediction problems or machine tasks that machine learning modeling can solve. But then it turned out that language itself is very-- it's very programmatic language or a mathematical language. So then we had to translate that into a natural language expression so that it can interface with the domain experts that may not be understanding what we are writing in software, or it's not-- you can't present them the coded language. So that itself led to a lot of work in terms of like, "How do we express these problems? So now, we have a machine learning task. But can we express this problem as a natural language-- in natural language so that it's presentable?" And that led to a little bit of research. And it was a little bit of a problem to solve that, but we were able to get over that with my team. I think that the second bigger problem was, and this is true for any automation in the theory in AutoML, that as you automate a process, in this case, generation of this prediction problem, you end up with thousands of solutions or thousands of things that could pop up. So we can't overwhelm the domain expert, "Hey, is this useful? Is this useful? Is this useful?"
SUSAN: 16:41	Thousands of possibilities.
KALYAN: 16:41	Thousands of possibilities. So then it's like, "How do we make it more interactive?" So we show a problem saying-- we show a machine learning task and say, "Hey, we can solve this using the data. We can predict this. Can you tell us if this is useful or not?" And if they give us feedback, then we have to take that feedback and figure out what next to recommend. If they said, "This is not good," thumbs up, thumbs down, we can go back and figure out what next to recommend or what next to show because we can't just show 1,000 problems to them. Then other thing we realized is that even within domain experts, with the same dataset, everyone is coming from different area, if you will, or a different goal, right? So the same data, same domain, one expert wants to optimize the operations; one expert may try to optimize the power generation of the turbine; another expert wants to understand sort of the controlled setting how they're influencing-- so everybody has slightly different goals. It's the same domain and same data but slightly different goals. So when you build a recommendation system, it's not something that-- it will work for everyone, so it has to be interactive. So it's like your movies. All of them--
SUSAN: 17:55	That's what I was just thinking.
KALYAN: 17:55	--go to Netflix, but we have slightly different preferences to what we are looking for. And that's one of the challenges. So those two challenges we haven't solved. They're still open challenges. We haven't been able to fully solve them yet.
SUSAN: 18:08	That's interesting, though. And it's really exciting to kind of imagine interacting with your data in the form of clicking through task options or having some sort of little tile like a Netflix movie cover image where you could see different possibilities and things, I would imagine too, that that would potentially surface, things that people wouldn't think of themselves as well, that they wouldn't necessarily have recognized on their own in their data that these possibilities existed, which is kind of cool to think about.
KALYAN: 18:38	Exactly. I think to give you one concrete example, if you take sort of doctor appointments and scheduling data and the data that comes from the doctor's scheduling, so patient scheduled visits, and they do show up at the visit, and so on and so forth, one prediction task could be we can predict for a particular appointment whether someone will show or not, right? So that's one task that we can generate. The other tasks from the same data, we can generate, in a day of 8 appointments or 10 appointments, how many we think will be no-show. So it's a slightly different problem, and it can lead to different goals and different uses, right? So in the first one, we are exactly telling which appointment will be a no-show. In the second one, we are saying what percentage of the appointments will be a no-show. We are not telling you which one, but we're telling you what percentage of the time-- percentage of the appointments will be a no-show.
SUSAN: 19:37	That's so interesting. And I think in the paper as well, there's mention of incorporating business outcomes as a potential factor in that recommendation system. And I imagine those would differ, too, based on the role of the person who's using it. Is that kind of the idea?
KALYAN: 19:52	Exactly. I think you talked to-- based on the role, they may decide to do something different with the prediction to have as well as based on their objectives. And these also give them a hint as to what action they could take. So in the example I was giving you, if you're predicting exactly what appointment is no-show, they can take an action of sending reminders or trying to fix that or try to double-book it, if you will, right, so that the slot does get used by someone else who's in need. If we give what percentage of the appointments will be a no-show, you can solve different kinds of problems with it. So you can decide maybe there's some other resources that you would like to preorder, or you only want to order sort of 8 things instead of 10 for that day because you know that only 8 will show up. I'll just give you example, maybe coffee. You only want to order 8 cups of coffee for your patients when they show up, right, instead of ordering 10. So different sort of problem prediction tasks, there's different experts and a different reason for them, and they also lead to different actions. So what we imagine, if we've solved this problem of being able to recommend and show prediction tasks and say, "We can solve this. We can solve this," we imagine that it will trigger a lot of thinking in domain experts' sort of brain, and they will start thinking about, "Huh, I could use it to solve this other problem." So some problems that they haven't even thought of matching to the data, they will be able to try to address that.
SUSAN: 21:28	Yeah, that's really cool. I'm still thinking about having my doctor sign up for this. They can have coffee waiting for me. I'm obsessed with that now. I like this idea. Always, coffee is a great example. So the other end of the process then, and I think we've been alluding to this a little bit already, but the interpretation of the results that are generated, what does that look like in a fully-automated system? Because I'm sure some folks listening to this right now will be like, "Well, we still need the people to understand what to do with all these results." So what form would that take? Or how would that be interpreted?
KALYAN: 22:03	I think at the other end, once the machine learning model does produce an output: a prediction or an insight or something of that sort, when it produces an output, we also started to recognize that a lot of those outputs will become part of a workflow or a process that domain experts already have. So it's an augmented input to their decision-making. And as a result, we realized, "Okay. So the first question obviously was how did you come up with this prediction? How did the model come up with this prediction?" So that was solved by a lot of tooling now available in the explainable AI, so people started to figure out how to backtrack, how the model came about the decision, and how to present that visually. But we started to realize there's more deeper question that the experts are asking. They're not necessarily concerned about how exactly the model did it. I think they're asking more about-- so not the mathematical process in which the model did it. I think they have this sort of like, "Well, I think about it this way. Why are you different with my thinking?" So instead of telling them, "This is the pathway the model took-- or the decision pathway the model took," you have to understand where their disagreement could be coming from, what are the concerns that they have. And that part is-- we don't have tooling for that yet. So that would be truly an interactive system where a domain expert is sort of asking, "Well, in the past, for this particular patient or for this particular turbine, this is how things had happened, so why is it that you're giving me a different result?" So trying to figure out where their lack of trust, where their questions are coming from, we don't yet have tooling for that. So we have just about started to work on that sort of tooling.
SUSAN: 23:54	Interesting. And I know a number of times in the paper, you and your coauthors mentioned the need for a lot of human-computer-interaction understanding around these issues because it seems like the question that you just raised is really kind of a psychological paradigm of thinking kind of question that people have one way of thinking about something, "Is this to present something else? How do you reconcile those two things with the user interface?" That seems like a very tricky thing.
KALYAN: 24:22	It is. It is. And it is very tricky, and it changes from domain to domain. So what works in one domain may not work in another domain. And so that's where I think a lot of domain-specific understanding and being able to configure these explanation tool so that it works with that domain will be very important. I think the other-- one of my peers and colleague, they once brought this up to me saying that in their domain, the domain that they were working with, matters a lot, whether you show the prediction first before the person have in their workflow put in some comments about some case that they're evaluating or doing some work on versus whether you put the prediction after the right result, so either before or after. Because if you put before, it has sort of an anchoring effect, so it can bias their decision-making. If you put it after, it can create a sort of disagreement. So there's a lot of, as you said, logical factors and ditching from domain to domain. In some domains, they don't show up. In some domains, they do show up. So there's a lot of work still needs to be done there.
SUSAN: 25:32	Yeah. All kinds of possibilities. So a couple of broader questions, just thinking about the realization of all of these big ideas, what's the timeline here? What are we looking at for potentially achieving a level six AutoML system in your estimation? And I know that's a huge hypothetical question. We won't hold you to it.
KALYAN: 25:54	I think the level six automation or level seven automation where we are able to help domain experts formulate machine learning tasks, I would say within two to three years, we should be there. So we are not that far away from it. We should be able to make that more interactive, and we should be able to do that. So to reiterate, it's where we show them machine learning tasks that the system can solve, and they map them to the business problems, or they ask for or give us feedback, thumbs up or thumbs down. So that is two or three years away. I don't think we will take that much time. We'll start seeing even commercial products doing that. They will have different names ranging from auto KPI generators so that people can use those KPI generators and present it to them, and they can say, "Oh, I would like to predict this KPI," or things like that or auto insights. There'll be having different names. We will start seeing commercial products. The other end of that where we try to help them post-predictions - the model is there; it's deployed; it's producing prediction; and we want to enable them to make decisions and augment those in their workflows - that is a little farther out. I think there's still a little bit more work to be done. We can do case by case. So you can take sales and marketing or a sales lead-generation problem or a lead-scoring problem. So you can solve for that one. So we can do case by case, and that will get us pretty far. But I think creating sort of a cohesive general-purpose way of enabling any sort of augmented decision-making where augmentation is coming from machine learning, in my opinion, it's a little farther away, maybe four or five years.
SUSAN: 27:39	Oh, still not that far.
KALYAN: 27:41	Still not that far. Still in my lifetime.
SUSAN: 27:44	Yeah. Oh, yeah, absolutely. Very cool. And then the other thing that I wondered about as I was reading about this was we talk a lot about getting people data skills and getting people data literate. So what level of data literacy do you expect people to need to be able to work with a more automated ML system? I mean, obviously, it depends on the level of automation. But what are your hopes or thoughts as far as what people need to know to make the most of this kind of system and to use it effectively?
KALYAN: 28:13	One of the big things that I also learned over time is that at least sitting in an ivory tower, I mean, the companies, they're fine, but then people who are in academia, they have a very binary definition, "Well, we have people who-- they don't know software, or they know software. They don't know machine learning, or they know machine learning. They don't know data, or they know data." And I realized that's completely a fallacy, or I don't know if that's the right word, but it's a wrong thing to think about it. I think the more important thing to think about is where their focus is and how much time do they have. So those are the two things. So for example, a domain expert working in wind turbine, their focus is turbines. So assuming that they don't know or they will be not knowing enough about data or software or machine learning is incorrect. They probably can, and they probably can learn even faster. So what I realized is-- to answer your question-- maybe I took a different tangent here.
SUSAN: 29:16	Nope, that's great.
KALYAN: 29:17	So to answer your question, the domain experts who we are trying to serve will be very data literate. What we do is we wrap concepts around data and then try to tell them that's what they should learn. And I don't think that's the right way to do things. So for example, we create these primary keys, foreign keys. Great. They help us in processing the data, storing the data, and do that, but then we should not expect them to learn that. That's not what their job is or should not even focus on automation tools to enable them to learn that because that's not their goal. They may find it interesting at this point of time, but at the end of the day, their goal and their focus is, rightfully so, solve the problem they have at hand, which is maximize the efficiency of the turbines or whatever they're working on. So in general, I think the real question for others [inaudible] all the concepts that we created for computational purposes, processing purposes, storage purposes, and then take them away, and meet them where they are. And they are extremely data literate, if you ask me. In fact, when it comes to their data, they're more literate than VR. I think the bigger struggle has been that for us, when we go from domain to domain as an applied data scientist or applied machine learning, we have no idea what their data set is, and half the times, we keep asking them, "So what is the foreign key here? What is the primary key here?" And then they are like, "What does that mean? What are you asking?" And so I think that's where we would have to meet them. We have to meet them where they are.
SUSAN: 30:58	It's interesting because I think taking away the jargon, taking away some of the complex-- well, taking away the jargon at least, not taking away the complexity, but it's a much more egalitarian view of machine learning, it feels like. And I wonder what your kind of vision is of if these tools are made readily available, do you see major change as a result? Or I mean, that's really kind of a grandiose question, but I wonder what you think about sort of the larger potential of making these tools more widely available and the potential impact that could have on people's work in business and in other areas.
KALYAN: 31:36	I think with the right tooling, a lot of people will not be scared of machine learning or scared of using machine learning. I saw recently an example from one of the peers at Alteryx, actually, where they were trying to teach imputation but without using the word imputation. So by giving an example and saying that, "Okay. This is one part of the data pieces missing. And how would you like-- what could be the right answer for it?" And in giving the answer, what the person naturally did was imputation. So if the tools have right interfaces, and we create the right level of instruction, I do see that a lot of people will be able to do machine learning without having to worry about it or jargon or-- just understanding it much more directly, as opposed to us trying to teach them a lot of concepts of machine learning, which is good but not necessarily, at the end of the day, their focus.
SUSAN: 32:38	Sure. Sure. Well, that makes a lot of sense. So we have a question that we always ask on the podcast, and I'll ask it to you now. This is our alternative hypothesis segment. And the question is what is the thing that people often think is true about data science or being a data scientist but that you, in your experience, have found to be false?
KALYAN: 33:01	Do you want the controversial answer?
SUSAN: 33:04	Of course.
KALYAN: 33:07	To that, I'll give an answer, but maybe it's too late - it's already known, and it's already accepted - which is deep learning will solve all the problems in data science, and that turned out to be completely false. And to give you a more concrete example, when I talk about time series forecasting or time series prediction, we had high expectations from deep learning, and they haven't turned out that way at all. Deep learning has worked great for computer vision, but for time series, it hasn't worked that way. And my dad is an economist, so I have some fun conversations with him, and I try to avoid these conversations, but we end up someplace because at some point, he asked me, "What are you working on?" And if I say time series, he says, "Well, that's what I was working before you were born. The economists have developed methods even before you were born." And then I said, "Well, we use neural networks." And he asked, "So how are they working out?" And I haven't had a solid answer for him compared to the age-old statistical models for time series analysis. So I think that's a big fallacy, that deep learning will be able to solve, but maybe now, everybody realizes already.
SUSAN: 34:31	Well, if they haven't, then this will bring them some hard truth here. So we'll see what kind of comments we get on the podcast. No, that's great. And I love the family relationship there. Getting to talk about time series at the dinner table, that's awesome. [music] Well, Kalyan, thank you again for joining us on Data Science Mixer. It's been really fascinating to hear about the current state and the potential future of AutoML and really excited to see what you and your team will do next. So thanks again.
KALYAN: 34:58	Thank you, Susan.
SUSAN: 35:01	Thanks for listening to our Data Science Mixer chat with Kalyan Veeramachaneni. Join us on the Alteryx community for this week's cocktail conversation to share your thoughts. What do you want to see in the future of AutoML? Are you excited by the fully-automated level-six kind of system that Kalyan describes? Are there particular tasks you would really, really like to have automated or maybe some you want to hang on to for yourself? Share your thoughts and ideas by leaving a comment directly on the episode page at community.Alteryx.com/podcast or post on social media with the hashtag #datasciencemixer and tag Alteryx. Cheers.

SUSAN: 00:00 Hello and welcome to your fully-automated data science system. Please take a look at the range of questions and potential data science problems you and I could tackle today. You can flip through the images below that describe various tests we can pursue, given the data I have on hand. I have also selected these tests for you based on your past history, your industry, and questions that data scientists like you have explored. Be sure to rate my recommendations, so we can continue to develop a productive and enjoyable automated machine learning relationship. SUSAN: 00:36 [music] Are you ready to start your data science projects that way, interacting with a fully-automated machine learning system that knows you and your data well enough to know what you might like to work on today? Welcome to Data Science Mixer, a podcast featuring top experts in lively and informative conversations that will change the way you do data science. I'm Susan Currie Sivek, senior data science journalist for the Alteryx community. For today's episode, I talked with Kalyan Veeramachaneni who is helping map out and build the future of automated machine learning systems. We walked through some of the exciting potential and challenges of building AutoML systems that can do a lot more than build and evaluate models. In fact, they can do it all. We're talking about AutoML systems that can also formulate problems, clean and visualize data, build meaningful training and test sets, construct features, communicate results, and even make recommendations. Wow. And you might be surprised to hear, that kind of comprehensive AutoML, it's not that far away. I'm so excited to share this conversation with you and get you wondering and maybe dreaming about these systems too. Let's meet Kalyan. KALYAN: 01:50 Thank you for having me. I'm Kalyan Veeramachaneni. I'm a research faculty in the computer science department at MIT, and I lead a group called Data-to-AI in MIT. And I also joined Alteryx as a fellow two years ago, and I work with the Innovation Labs team very closely at Alteryx. SUSAN: 02:10 Awesome. Fantastic. And would you mind sharing with us as well which pronouns you use? KALYAN: 02:15 Him, he, his. SUSAN: 02:17 Okay. Great. Thank you. And as you may know, on Data Science Mixer, we often try to have a special beverage or a snack or something while we're chatting. So do you happen to have anything there with you today? KALYAN: 02:28 I have coffee. I love coffee, so. SUSAN: 02:30 Yay. Yes. Any kind of special coffee or just straight up the hard stuff? KALYAN: 02:35 Straight up regular black from Starbucks, that's all I have all the time. SUSAN: 02:41 Excellent. Yes, same here, French roast, always, yep, just black, first thing in the morning. So I've actually moved on to my second round of caffeine, and I'm now having some double bergamot earl grey. So that's my second dose of caffeine every day. Awesome. Great. Well, there's so many different things that we could talk about. You've worked on so many different projects in your research and your different collaborations. But one thing that I thought was especially interesting that has come up recently is a paper that you recently published with coauthors in ACM Computing Surveys where you talk about the development of AutoML tools, and this is a really neat, deep, comprehensive look at AutoML. And I think it's super interesting because it really lays out a taxonomy for thinking about those tools and that will really, I think, help people think about the future of how AutoML is going to develop. So as we're thinking about moving toward a completely automated machine learning system, that kind of taxonomy, I could see being very useful. So the paper was super thought-provoking and very readable, which is-- any academic paper is a major accomplishment. So yay. And we'll put a link in the show notes to the publish and preprint versions, so folks can check it out as well. But if we could, I would love to just walk through some of the concepts in the paper and talk through what you're seeing in the future of AutoML. And maybe a good place to start is just kind of with a simple question, or maybe it's not a very simple question, which is, for you, what is AutoML exactly as you would define it? And what motivated you to work on this paper, to write about it, and explore it so deeply? KALYAN: 04:20 Oh, thank you for that question. I think it is hard to define AutoML because that definition has been evolving over time. What has not evolved is the mission, which is, "How can we make more people in the society use ML, machine learning, and data to optimize operations or make things more efficient, make things more equitable, make things more accessible or available?" These range from a lot of commercial applications to a lot of societal applications. So making that, making society able to use AutoML is the sole purpose. But what we started considering is what part of that process that we would like to automate, like the machine learning solution development, has evolved over a decade? So in this paper, we started going through our journey in 2010 when we realized that there is an immense need in the society to use machine learning, but everybody's stymie from being able to use it because a lot of that was research work or buried in papers; there's a lot of mathematics. And so at that time, we said, "Okay. Could we automate some parts of it? Could we build tools, essentially, to enable people to use a lot of the research and mathematics that's in the labs or research settings?" And then we, over five years - by we I mean a larger research community, including my group and a lot of folks that I work with - try to automate a lot of things and try to provide tools and ability for people to do it. Python became a very popular language among data scientists and machine learning folks, so we started building all the tools in Python. KALYAN: 06:07 And lo and behold, in 2015, 2016, we had all those tools to automate every part of machine learning process, machine learning solution development, if you will, so process ranging from how do you prepare data, how do you extract features out of the data, which features are just historically, as my dad would say, "Variable. Now, you're calling them features." So they're just variables that describe what's going on, a phenomena. How do you automatically do that without having to write a lot of software? How do you explore different machine learning models? How do you tune them? The machine learning models have a lot of hyperparameters that control its behavior or that control how it interacts with data. So how do you tune them to maximize its accuracy or performance? So there's so many tools that we produced by the time it was 2015, 16 timeframe, and then so many more were in the pipeline at that time. So the AutoML definition started-- originally, it was just about modeling, and then it expanded to data preparation, feature engineering, and then the whole sort of pipeline, end-to-end pipeline of machine learning solution development for a problem. And then we said, "Okay. So now, we have all the tools, so it should be very easy for anyone to start using machine learning, right, so any data scientist or a domain expert, experts that have very sort of minimal software engineering experience or have software engineering experience but now focused on different things in their career." And lo and behold, we found that it's still not possible. It's still hard. And I think we would have only figured that out if we did actually make all those tools and say, "Okay. We took care of all the problems that we thought were the bottlenecks." So from 2016 onwards, we started uncovering a whole set of new problems when you interact with domain experts. SUSAN: 08:04 Interesting. And I know that's another term that you spent a little bit of time on in the paper as well, this idea of what is exactly a domain expert. So what does that term mean in your system of thinking about AutoML? KALYAN: 08:15 That's actually a very good question, too, because I was in a meeting at one of the DARPA programs where the program's goal was the same mission, increase usage of machine learning. And all of us were trying to say, "Well, we want to build this tool for the domain expert. We want to build this tool thing for domain expert." And there was a whole bunch of people ranging from machine learning, applied machine learning, HDI, database; it's a lot of my peers and colleagues. And then at some point, I popped in and said, "What do you mean by domain expert?" And somebody said, "Oh, I mean the machine learning folks." And then I asked someone else, "What do you mean by domain expert?" They said, "Well, we mean somebody else. We mean the researchers." So everybody had a different definition. So ultimately, we concluded that-- at least I concluded, and I want to sort of make sure that that's what we are talking about is that a domain expert is somebody that is working the domain whose problem we are trying to solve using machine learning. So if you are talking about sales and marketing, they are experts in sales, and it's their problem that we are trying to solve; they're trying to solve their problem. So that's what we mean by domain experts. So right now, in my group, we interact with operators that operate satellites and monitor them or folks that look at time-series signals from wind farms and continuously monitor them and try to optimize operations. So those are the domain experts. They're sort of embedded in that domain, and they care about that domain a lot. And so it's them we are trying to serve through making machine learning easy-to-use concepts. That's our audience. That's our goal. SUSAN: 10:01 Yeah, yeah, that makes sense. So in the paper, thinking about these domain experts and putting these tools in their hands, you and your coauthors talk about the seven levels of AutoML and the various progression that we've seen as we've moved towards sort of that highest level, which would be a fully-automated end-to-end solution for handling the entire machine learning process. And so maybe you could give us-- don't feel obligated to go through every single level, but maybe you could give us kind of the overview of why that set of levels came about and how you thought about it as sort of a thinking tool for organizing the tools that are currently available. KALYAN: 10:39 Yeah, I think when we started looking-- when we started expanding this definition, we started uncovering more problems that were not making machine learning easy to use. Even after having all these toolings, we started to realize, "Well, there's a lot of other things beyond what we originally defined or we went after." And then it automatically felt unclear, seeing that there's other tools that need to be developed; there's other automation that needs to be done. And so that led to those seven levels that we defined in the paper. And one of the big things that we started to identify were two bottlenecks. And in the whole process, even after having all the tooling to do the machine learning model development, the first bottleneck was how does one translate a business problem into a machine learning task? That itself was interaction between an applied machine learning engineer or a data scientist and the domain expert. So it was a lot of back and forth, and it involved figuring out, "Do we have data for this? Do we not have data for this." But also, the problem at the business level comes up something like, "Well, we are spending too much on going back and forth, which--" I'm just giving you an example, "Going back and forth between the wind farm, trying to replace these parts." Well, then you would have to translate that into, "Well, should we try to predict in advance which parts are going to fail?" So that translation is coming from a lot of back-and-forth between the data scientists and domain experts. So we thought about like, "Is that process even-- can we take that bottleneck away? Is there any way we can create tooling to help with that? Can we automate that?" KALYAN: 12:27 It turned out that we had a completely incorrect way of looking at automation. We always thought of automation as just a tool that we can provide, and then suddenly, we will be able to replace a process that was done by people. In this case, it turned out that automation can only enable. It's an enabler, not a replacement. So we created a tool called composeml. It's actually open source, and it's with Alteryx Innovation Labs as well. And in that tool, we allow people to specify machine learning tasks and interact with the data and figure out, "Can the data do something on that task," right, "Is there a possibility? Do we have enough training examples? Do we have--" There's a number of questions one can ask. And that sort of takes care of some level of-- reduces some bottleneck in that back-and-forth. It doesn't completely eliminate it, but it reduces some bottleneck in that. Then we thought about the next part where we said, "Hey, if we can look at the data and automatically synthesize prediction problems or machine learning tasks, can we take to synthesized machine learning tasks back to the domain experts and say, 'Could this help in some business problem or not?'" KALYAN: 13:43 So it's flipping the question. Instead of them coming to us with a business problem and then us trying to map, can we just show them a lot of prediction tasks or machine learning tasks that we could solve using the data? And can we ask them, "Is this useful? Could you think you can use this particular prediction task if we were to build a model for this?" So that was sort of level six or level seven automation. So as it's sort of going up and up and up towards higher levels, it became a lot more about making that interaction more efficient, so providing tools and creating tools to make those interactions more efficient. If you go lower and lower in the level, you start taking off the tasks that were traditionally done by people manually into something that could be now done automatically with software. So that was the whole sort of rationale coming up with seven levels and then dividing what we have already available into those levels and so on and so forth. SUSAN: 14:46 Yeah, that's so interesting. And I thought some of the issues that came up in the paper around problem formulation, problem recommendation or task recommendation, and how that would be translated back and forth between the system and the human-- I thought some of those challenges were especially interesting. Could you talk about some of those issues and some of the things that you've already developed to try to address those problems? KALYAN: 15:08 Yeah, yeah, absolutely. I think one big interesting observation that my group had a couple of years ago was if you look at the data coming up with a prediction task or a machine learning task is very mechanical. So that's how we automate it. So we programmatically can describe some constructs and create a language. We even call that prediction task language. And so we can create a lot of prediction problems or machine tasks that machine learning modeling can solve. But then it turned out that language itself is very-- it's very programmatic language or a mathematical language. So then we had to translate that into a natural language expression so that it can interface with the domain experts that may not be understanding what we are writing in software, or it's not-- you can't present them the coded language. So that itself led to a lot of work in terms of like, "How do we express these problems? So now, we have a machine learning task. But can we express this problem as a natural language-- in natural language so that it's presentable?" And that led to a little bit of research. And it was a little bit of a problem to solve that, but we were able to get over that with my team. I think that the second bigger problem was, and this is true for any automation in the theory in AutoML, that as you automate a process, in this case, generation of this prediction problem, you end up with thousands of solutions or thousands of things that could pop up. So we can't overwhelm the domain expert, "Hey, is this useful? Is this useful? Is this useful?" SUSAN: 16:41 Thousands of possibilities. KALYAN: 16:41 Thousands of possibilities. So then it's like, "How do we make it more interactive?" So we show a problem saying-- we show a machine learning task and say, "Hey, we can solve this using the data. We can predict this. Can you tell us if this is useful or not?" And if they give us feedback, then we have to take that feedback and figure out what next to recommend. If they said, "This is not good," thumbs up, thumbs down, we can go back and figure out what next to recommend or what next to show because we can't just show 1,000 problems to them. Then other thing we realized is that even within domain experts, with the same dataset, everyone is coming from different area, if you will, or a different goal, right? So the same data, same domain, one expert wants to optimize the operations; one expert may try to optimize the power generation of the turbine; another expert wants to understand sort of the controlled setting how they're influencing-- so everybody has slightly different goals. It's the same domain and same data but slightly different goals. So when you build a recommendation system, it's not something that-- it will work for everyone, so it has to be interactive. So it's like your movies. All of them-- SUSAN: 17:55 That's what I was just thinking. KALYAN: 17:55 --go to Netflix, but we have slightly different preferences to what we are looking for. And that's one of the challenges. So those two challenges we haven't solved. They're still open challenges. We haven't been able to fully solve them yet. SUSAN: 18:08 That's interesting, though. And it's really exciting to kind of imagine interacting with your data in the form of clicking through task options or having some sort of little tile like a Netflix movie cover image where you could see different possibilities and things, I would imagine too, that that would potentially surface, things that people wouldn't think of themselves as well, that they wouldn't necessarily have recognized on their own in their data that these possibilities existed, which is kind of cool to think about. KALYAN: 18:38 Exactly. I think to give you one concrete example, if you take sort of doctor appointments and scheduling data and the data that comes from the doctor's scheduling, so patient scheduled visits, and they do show up at the visit, and so on and so forth, one prediction task could be we can predict for a particular appointment whether someone will show or not, right? So that's one task that we can generate. The other tasks from the same data, we can generate, in a day of 8 appointments or 10 appointments, how many we think will be no-show. So it's a slightly different problem, and it can lead to different goals and different uses, right? So in the first one, we are exactly telling which appointment will be a no-show. In the second one, we are saying what percentage of the appointments will be a no-show. We are not telling you which one, but we're telling you what percentage of the time-- percentage of the appointments will be a no-show. SUSAN: 19:37 That's so interesting. And I think in the paper as well, there's mention of incorporating business outcomes as a potential factor in that recommendation system. And I imagine those would differ, too, based on the role of the person who's using it. Is that kind of the idea? KALYAN: 19:52 Exactly. I think you talked to-- based on the role, they may decide to do something different with the prediction to have as well as based on their objectives. And these also give them a hint as to what action they could take. So in the example I was giving you, if you're predicting exactly what appointment is no-show, they can take an action of sending reminders or trying to fix that or try to double-book it, if you will, right, so that the slot does get used by someone else who's in need. If we give what percentage of the appointments will be a no-show, you can solve different kinds of problems with it. So you can decide maybe there's some other resources that you would like to preorder, or you only want to order sort of 8 things instead of 10 for that day because you know that only 8 will show up. I'll just give you example, maybe coffee. You only want to order 8 cups of coffee for your patients when they show up, right, instead of ordering 10. So different sort of problem prediction tasks, there's different experts and a different reason for them, and they also lead to different actions. So what we imagine, if we've solved this problem of being able to recommend and show prediction tasks and say, "We can solve this. We can solve this," we imagine that it will trigger a lot of thinking in domain experts' sort of brain, and they will start thinking about, "Huh, I could use it to solve this other problem." So some problems that they haven't even thought of matching to the data, they will be able to try to address that. SUSAN: 21:28 Yeah, that's really cool. I'm still thinking about having my doctor sign up for this. They can have coffee waiting for me. I'm obsessed with that now. I like this idea. Always, coffee is a great example. So the other end of the process then, and I think we've been alluding to this a little bit already, but the interpretation of the results that are generated, what does that look like in a fully-automated system? Because I'm sure some folks listening to this right now will be like, "Well, we still need the people to understand what to do with all these results." So what form would that take? Or how would that be interpreted? KALYAN: 22:03 I think at the other end, once the machine learning model does produce an output: a prediction or an insight or something of that sort, when it produces an output, we also started to recognize that a lot of those outputs will become part of a workflow or a process that domain experts already have. So it's an augmented input to their decision-making. And as a result, we realized, "Okay. So the first question obviously was how did you come up with this prediction? How did the model come up with this prediction?" So that was solved by a lot of tooling now available in the explainable AI, so people started to figure out how to backtrack, how the model came about the decision, and how to present that visually. But we started to realize there's more deeper question that the experts are asking. They're not necessarily concerned about how exactly the model did it. I think they're asking more about-- so not the mathematical process in which the model did it. I think they have this sort of like, "Well, I think about it this way. Why are you different with my thinking?" So instead of telling them, "This is the pathway the model took-- or the decision pathway the model took," you have to understand where their disagreement could be coming from, what are the concerns that they have. And that part is-- we don't have tooling for that yet. So that would be truly an interactive system where a domain expert is sort of asking, "Well, in the past, for this particular patient or for this particular turbine, this is how things had happened, so why is it that you're giving me a different result?" So trying to figure out where their lack of trust, where their questions are coming from, we don't yet have tooling for that. So we have just about started to work on that sort of tooling. SUSAN: 23:54 Interesting. And I know a number of times in the paper, you and your coauthors mentioned the need for a lot of human-computer-interaction understanding around these issues because it seems like the question that you just raised is really kind of a psychological paradigm of thinking kind of question that people have one way of thinking about something, "Is this to present something else? How do you reconcile those two things with the user interface?" That seems like a very tricky thing. KALYAN: 24:22 It is. It is. And it is very tricky, and it changes from domain to domain. So what works in one domain may not work in another domain. And so that's where I think a lot of domain-specific understanding and being able to configure these explanation tool so that it works with that domain will be very important. I think the other-- one of my peers and colleague, they once brought this up to me saying that in their domain, the domain that they were working with, matters a lot, whether you show the prediction first before the person have in their workflow put in some comments about some case that they're evaluating or doing some work on versus whether you put the prediction after the right result, so either before or after. Because if you put before, it has sort of an anchoring effect, so it can bias their decision-making. If you put it after, it can create a sort of disagreement. So there's a lot of, as you said, logical factors and ditching from domain to domain. In some domains, they don't show up. In some domains, they do show up. So there's a lot of work still needs to be done there. SUSAN: 25:32 Yeah. All kinds of possibilities. So a couple of broader questions, just thinking about the realization of all of these big ideas, what's the timeline here? What are we looking at for potentially achieving a level six AutoML system in your estimation? And I know that's a huge hypothetical question. We won't hold you to it. KALYAN: 25:54 I think the level six automation or level seven automation where we are able to help domain experts formulate machine learning tasks, I would say within two to three years, we should be there. So we are not that far away from it. We should be able to make that more interactive, and we should be able to do that. So to reiterate, it's where we show them machine learning tasks that the system can solve, and they map them to the business problems, or they ask for or give us feedback, thumbs up or thumbs down. So that is two or three years away. I don't think we will take that much time. We'll start seeing even commercial products doing that. They will have different names ranging from auto KPI generators so that people can use those KPI generators and present it to them, and they can say, "Oh, I would like to predict this KPI," or things like that or auto insights. There'll be having different names. We will start seeing commercial products. The other end of that where we try to help them post-predictions - the model is there; it's deployed; it's producing prediction; and we want to enable them to make decisions and augment those in their workflows - that is a little farther out. I think there's still a little bit more work to be done. We can do case by case. So you can take sales and marketing or a sales lead-generation problem or a lead-scoring problem. So you can solve for that one. So we can do case by case, and that will get us pretty far. But I think creating sort of a cohesive general-purpose way of enabling any sort of augmented decision-making where augmentation is coming from machine learning, in my opinion, it's a little farther away, maybe four or five years. SUSAN: 27:39 Oh, still not that far. KALYAN: 27:41 Still not that far. Still in my lifetime. SUSAN: 27:44 Yeah. Oh, yeah, absolutely. Very cool. And then the other thing that I wondered about as I was reading about this was we talk a lot about getting people data skills and getting people data literate. So what level of data literacy do you expect people to need to be able to work with a more automated ML system? I mean, obviously, it depends on the level of automation. But what are your hopes or thoughts as far as what people need to know to make the most of this kind of system and to use it effectively? KALYAN: 28:13 One of the big things that I also learned over time is that at least sitting in an ivory tower, I mean, the companies, they're fine, but then people who are in academia, they have a very binary definition, "Well, we have people who-- they don't know software, or they know software. They don't know machine learning, or they know machine learning. They don't know data, or they know data." And I realized that's completely a fallacy, or I don't know if that's the right word, but it's a wrong thing to think about it. I think the more important thing to think about is where their focus is and how much time do they have. So those are the two things. So for example, a domain expert working in wind turbine, their focus is turbines. So assuming that they don't know or they will be not knowing enough about data or software or machine learning is incorrect. They probably can, and they probably can learn even faster. So what I realized is-- to answer your question-- maybe I took a different tangent here. SUSAN: 29:16 Nope, that's great. KALYAN: 29:17 So to answer your question, the domain experts who we are trying to serve will be very data literate. What we do is we wrap concepts around data and then try to tell them that's what they should learn. And I don't think that's the right way to do things. So for example, we create these primary keys, foreign keys. Great. They help us in processing the data, storing the data, and do that, but then we should not expect them to learn that. That's not what their job is or should not even focus on automation tools to enable them to learn that because that's not their goal. They may find it interesting at this point of time, but at the end of the day, their goal and their focus is, rightfully so, solve the problem they have at hand, which is maximize the efficiency of the turbines or whatever they're working on. So in general, I think the real question for others [inaudible] all the concepts that we created for computational purposes, processing purposes, storage purposes, and then take them away, and meet them where they are. And they are extremely data literate, if you ask me. In fact, when it comes to their data, they're more literate than VR. I think the bigger struggle has been that for us, when we go from domain to domain as an applied data scientist or applied machine learning, we have no idea what their data set is, and half the times, we keep asking them, "So what is the foreign key here? What is the primary key here?" And then they are like, "What does that mean? What are you asking?" And so I think that's where we would have to meet them. We have to meet them where they are. SUSAN: 30:58 It's interesting because I think taking away the jargon, taking away some of the complex-- well, taking away the jargon at least, not taking away the complexity, but it's a much more egalitarian view of machine learning, it feels like. And I wonder what your kind of vision is of if these tools are made readily available, do you see major change as a result? Or I mean, that's really kind of a grandiose question, but I wonder what you think about sort of the larger potential of making these tools more widely available and the potential impact that could have on people's work in business and in other areas. KALYAN: 31:36 I think with the right tooling, a lot of people will not be scared of machine learning or scared of using machine learning. I saw recently an example from one of the peers at Alteryx, actually, where they were trying to teach imputation but without using the word imputation. So by giving an example and saying that, "Okay. This is one part of the data pieces missing. And how would you like-- what could be the right answer for it?" And in giving the answer, what the person naturally did was imputation. So if the tools have right interfaces, and we create the right level of instruction, I do see that a lot of people will be able to do machine learning without having to worry about it or jargon or-- just understanding it much more directly, as opposed to us trying to teach them a lot of concepts of machine learning, which is good but not necessarily, at the end of the day, their focus. SUSAN: 32:38 Sure. Sure. Well, that makes a lot of sense. So we have a question that we always ask on the podcast, and I'll ask it to you now. This is our alternative hypothesis segment. And the question is what is the thing that people often think is true about data science or being a data scientist but that you, in your experience, have found to be false? KALYAN: 33:01 Do you want the controversial answer? SUSAN: 33:04 Of course. KALYAN: 33:07 To that, I'll give an answer, but maybe it's too late - it's already known, and it's already accepted - which is deep learning will solve all the problems in data science, and that turned out to be completely false. And to give you a more concrete example, when I talk about time series forecasting or time series prediction, we had high expectations from deep learning, and they haven't turned out that way at all. Deep learning has worked great for computer vision, but for time series, it hasn't worked that way. And my dad is an economist, so I have some fun conversations with him, and I try to avoid these conversations, but we end up someplace because at some point, he asked me, "What are you working on?" And if I say time series, he says, "Well, that's what I was working before you were born. The economists have developed methods even before you were born." And then I said, "Well, we use neural networks." And he asked, "So how are they working out?" And I haven't had a solid answer for him compared to the age-old statistical models for time series analysis. So I think that's a big fallacy, that deep learning will be able to solve, but maybe now, everybody realizes already. SUSAN: 34:31 Well, if they haven't, then this will bring them some hard truth here. So we'll see what kind of comments we get on the podcast. No, that's great. And I love the family relationship there. Getting to talk about time series at the dinner table, that's awesome. [music] Well, Kalyan, thank you again for joining us on Data Science Mixer. It's been really fascinating to hear about the current state and the potential future of AutoML and really excited to see what you and your team will do next. So thanks again. KALYAN: 34:58 Thank you, Susan. SUSAN: 35:01 Thanks for listening to our Data Science Mixer chat with Kalyan Veeramachaneni. Join us on the Alteryx community for this week's cocktail conversation to share your thoughts. What do you want to see in the future of AutoML? Are you excited by the fully-automated level-six kind of system that Kalyan describes? Are there particular tasks you would really, really like to have automated or maybe some you want to hang on to for yourself? Share your thoughts and ideas by leaving a comment directly on the episode page at community.Alteryx.com/podcast or post on social media with the hashtag #datasciencemixer and tag Alteryx. Cheers.

This episode of Data Science Mixer was produced by Susan Currie Sivek (@SusanCS) and Maddie Johannsen (@MaddieJ).
Special thanks to Ian Stonehouse for the theme music track, and @TaraM for our album artwork.

Data Science Mixer

Episode Guide

Shaping the future of AutoML systems | Kalyan Veeramachaneni

Panelists

Topics

Cocktail Conversation

Transcript