Data Science Mixer

Tune in for data science and cocktails.
MaddieJ
Alteryx Community Team
Alteryx Community Team

We go behind the scenes of Alteryx Machine Learning and the Alteryx open-source Python libraries with Alteryx CDAO Alan Jacobson. He shares his experiences leading the data scientists who create tools that empower even the most hardcore data scientists. 

 

 


Panelists

 


Topics

 

 


Cocktail Conversation

 

What are your favorite analogies, metaphors, and explanations for data science concepts? Do you have a favorite way of talking about an idea in data science that you think is especially effective or fun?

 

Join the conversation by commenting below!

 

Alan Jacobson CC.png

 


Transcript

 

Episode Transcription

SUSAN: 00:00

Could you just introduce yourself with your name and title?

ALAN: 00:03

Sure. I don't know about titles. I like playing with data. I'm a data geek. [music]

SUSAN: 00:10

Hello, fellow data geeks, and welcome to Data Science Mixer, a podcast featuring top experts in lively and informative conversations that will change the way you do data science. I'm Susan Currie Sivek, the data science journalist for the Alteryx Community. Today, I'm joined by one data geek in particular.

ALAN: 00:28

Alan Jacobson, the Chief Data and Analytic Officer here at Alteryx.

SUSAN: 00:32

Awesome. Thank you. And would you mind sharing with us what pronounces you use?

ALAN: 00:36

He. Him.

SUSAN: 00:37

Perfect. Thank you. I was excited to talk with Alan about his unique career and his approach to leading team's building data science tools like Alteryx machine learning, with Alteryx's open-source libraries at their foundation. We get into why AutoML tools empower even the most expert of hardcore data scientists and encourage collaboration. And we'll see what Alan's alternative hypothesis is. Spoiler. It involves abacuses. Abaci? Well, calculators too. And before we get started, let's make sure we're hydrated. And as you know, on Data Science Mixer, one of the things we like to do is have some sort of special happy hour snack or beverage or coffee or tea, anything, as we're recording. So do you have anything special with you there today?

ALAN: 01:25

I don't. I should have grabbed it ahead of time, but that's okay. Normally, it would be a smoothie at this time of day.

SUSAN: 01:31

Oh. That sounds good. What's your favorite smoothie?

ALAN: 01:33

Kiwi strawberry. Anything fruity. Yeah. That's definitely an indulgence I enjoy.

SUSAN: 01:42

Yeah. Awesome. Love it. I'm having a nice chai at this moment. So good little morning pick-me-up. Awesome. So, yeah. Allan, I'm really happy to have you here to talk about some of your interesting experiences in data science. Could you give us kind of the nutshell version of your career? I know you've done a lot of different things and maybe we can get the little condensed version of that.

ALAN: 02:02

Absolutely. So in some ways, I think my career path is maybe a bit untraditional. I spent the first 25 years or so of my career at Ford Motor Company. Had a myriad of jobs there from engineering, marketing, and sales, IT-related jobs. And then came to Alteryx. And I've been at Alteryx for a couple of years. I say non-traditional because I don't think many people stay at one company for that long anymore. So it's becoming the unorthodox career path, maybe to stay at one place for a while.

SUSAN: 02:35

Right. Right. Interesting. And at Alteryx, you work in data science. You're leading teams of data scientists who are building data science tools, which is a little bit different from leading teams of data scientists who are working within a company and other kinds of capacities. So could you talk a little bit about that and maybe how that differs from some of the other data science leadership roles that you've had?

ALAN: 02:58

Yeah. So fundamentally, my roles here, there are three different roles that I play. One role is performing data science within the company, which is very similar to data scientists, I'd say at most companies where they're helping their marketing and sales teams, their HR teams, their legal teams answer questions using data, building models, and deploying models into production. That's one piece of what we do, which is very kind of inwardly facing. Being a company that builds a data science product, our data scientists sometimes build product. And so we have a team of engineers that are in Boston building a data science product. We sometimes put data science tools, technology into our design or product. And some of that work is done by our data scientists in partnership with our engineering teams. And the third role is helping customers go on the journey. We sit in a position where we see a lot of companies going through this digital transformation journey, and we can share both maybe best practices and not best practices of how you might want to go on that journey.

SUSAN: 04:12

Interesting. Yeah. So that really covers a lot of territory. That's a lot of different things to manage every day.

ALAN: 04:18

Yeah. No, it's fun. I think one of the exciting things about the data science field is the diversity of problems that you get to work with and work on. Certainly, working at a technology company where you get to both create data science products as well as answer day-to-day questions for the business is a fun role.

SUSAN: 04:38

Absolutely. So I think it's interesting that you're hiring data scientists and leading data scientists who are doing all of these different functions. Does that change the way that you look for people to be on your teams in terms of the skills that you're looking for?

ALAN: 04:53

Yeah. I mean, I think building a successful team - and this is not only true for data science. It's true for, I'd say, most if not all teams - one of the arts of doing that well is building an extremely diverse group of people. And the science on that's very clear, that that diversity yields better results for teams. And there's no doubt that when you're dealing with the mix of problems that we deal with every day, having people with many different backgrounds certainly helps. And so some of the best data scientists I've ever worked with have had incredibly different backgrounds. A geologist, an engineer, an English major, they all come from different experiences. And I think that's one of the keys to building great teams is having that diversity of talent to draw from.

SUSAN: 05:53

Yeah. Absolutely. So in the people that you were mentioning that you're working with, one of the groups I know that's involved there is folks who are working on open-source software. And I know that Alteryx is definitely committed to building open-source tools and sharing those. Can you tell us a little bit about that? How that came about and what some of those tools are?

ALAN: 06:12

Yeah. So Alteryx has been in the open-source community for quite some time. With our acquisition of Feature Labs about a year and a half ago, we acquired even more open-source properties, if you will, and we've continued to develop and lean into those. And so we're currently at over 1 million downloads per year. So we're really getting some great traction in the open-source. And the majority of those downloads are coming from a few of our key machine learning libraries. So we've got our Featuretools library; EvalML, which is an AutoML library; Compose, which lets you set up your machine learning process, your prediction engineering; and we have a supporting library Woodwork that kind of underpins those libraries. And those four libraries get really a lot of use in the machine learning space. And we're now building a commercial product on top of those. And we really are excited to have that open-source community helping us develop and really hardened those techniques as we bring it fully into a commercial product.

SUSAN: 07:27

Yeah, yeah. And I want to come back to that here in just a little bit. I wonder if you have-- it might be like asking you to choose one of your favorite children, but do you have a favorite among the open-source libraries that you are just especially enamored with that you think does especially cool stuff?

ALAN: 07:44

Yeah. So, yeah, that is like picking which one of your children is the favorite. I'll say one of the newer ones that we just recently launched. So Featuretools has been out for a couple of years, but EvalML is one of the newer libraries that does automated machine learning, and it's been really exciting to see the uptake of how people are using it and how they're solving problems with it. So that's one of the newer libraries that we've launched and I'm really excited to see how it's progressing.

SUSAN: 08:12

And so this is a tool for AutoML correct?

ALAN: 08:15

Correct.

SUSAN: 08:16

Yeah. Can you tell us a little bit about some of the applications that you've seen maybe or are there use cases where you thought it was particularly effective for people?

ALAN: 08:23

Yeah. So I think one of the things that I find, and this is true of both data scientists and analysts, a lot of people are very new to machine learning and are kind of going on that journey. And so a lot of what I see people initially do are learning examples. They're trying to take data and learn from it. So can I take a set of emails, some are spam and some are not, and quickly identify, using AutoML which messages are the spam messages and which ones aren't. I want to detect fraud. All of these kind of very typical examples. And what gets me excited is these examples aren't the actual example, it's actually seeing how many people are going on this journey and learning this new skill so that they can apply them in their businesses. And again, I think that's just as true for the data scientists sometimes as it is for the analysts as they kind of explore and develop the skill to use these techniques.

SUSAN: 09:20

Yeah. Absolutely. And I know one of the questions that people often have about AutoML tools is around interpretability and understanding what's going on behind the scenes when they have a tool that's automatically putting together the pipeline and building the model and so forth. How is EvalML incorporating that kind of interpretability and making sure that whoever's building the model can understand what's going on behind the scenes?

ALAN: 09:42

Yeah. I sometimes think these words of black-box ML and not having transparency is maybe a little overcooked. There's an example of, you're going to get on an airplane, and I can show you all the math of the model by which we've designed the airplane by. And I can show you completely transparently all the formulas and all the math. Great. And maybe they're very simple. They're easy to understand. Great. Or I can tell you that we flew the plane 1 million times and we have a model that worked 100% of the time. Would you like to get on the plane and be the 1 million and first flight? You can either pick the plane that you have the history 1 million times it worked, and it never was wrong, or it's never flown before. You have no history, but I can show you all the math. Which one are you interested in getting on, right? Personally, I would take the one that has done it 1 million times and has worked every time.

ALAN: 10:48

And so machine learning is in some ways more on that path of using lots of historical data and building models that match the historical data versus maybe more of a statistical econometric approach using formulas. So there are different approaches. But when it comes to the actual transparency, once you've built the model, it's very easy with machine learning to understand how the model works and what's in it. You can see the formulas if you want to see the formulas. I don't know that seeing the formula is necessarily making it more understandable, but I really think the art is not the transparency, can I see everything that's in the box, but have I made it understandable enough that you can really understand what's going on? And we've put a lot of work into both EvalML and the AutoML that we're doing to make sure that we make things very understandable. That we allow you to see to see what's going on in a way that you can get the transparency and the transparency is really through the understanding.

SUSAN: 11:54

Yeah. Yeah. That's so interesting. And I think in terms of the example that you gave and creating that level of understanding for the user, this reminds me also of how recently you'd provided this training session for Alteryx associates, where you gave that basic introduction to machine learning and modeling and how it works. And it was just a very accessible and honestly fun introduction to modeling for folks who hadn't necessarily heard much about the nitty-gritty details of the process. So how do you do that? How do you come up with the examples that will help communicate what's going on? Is it just years of experience? Is it kind of a level of creativity that you've gotten to and coming up with anecdotes and analogies that work? And what gets you excited about sharing that kind of information with people?

ALAN: 12:43

Thanks, Susan. That's very kind of you to say. So I find that data science, in general - machine learning fits this, but data science, in general - is really not that hard. I mean, this is not-- if you say, "What--" certainly, there are concepts in data science that are harder than others. But the majority of what a practicing data scientist does on a day in and day out basis, a lot of what we do is not-- it's not thermodynamics. Thermodynamics was a very hard course. At least for me, that was a really hard course. Multidimensional calculus. That's a pretty abstract, hard-to-picture kind of thing. I find that most data science principles, p values, these are things that you can explain in pretty plain terms and teach it on a level that-- I have two kids. They're a middle schooler and a high schooler, and they can understand these concepts. They're just not that hard.

ALAN: 13:49

I find like with most subjects, sometimes teachers use the jargon and even in a given profession, whether it's acronyms or, again, technical jargon terms that make things less accessible to everybody else. And so frequently I find when I'm training these concepts, it's trying to take the jargon out and trying to use examples that we can all picture in our everyday lives to really make it accessible to everyone. And I really do feel passionately that data science is an area that eventually should be part of what everyone does. It's not just for the PhD data scientist to do data science. It's math. It's for everybody. Again, not to say that there won't be deep end of the pool complex things that you need a data scientist for but I'd really love to see everybody in their day-to-day work be able to leverage and take advantage of this stuff.

SUSAN: 14:49

Yeah. Yeah. So you mentioned earlier the work on the Alteryx product that is using at its foundation some of the open-source libraries, correct? And so can we talk a little bit about what that looks like and what you're excited about there?

ALAN: 15:03

Yeah. So we have an all-new machine learning product, Alteryx Machine Learning, that allows, again, anyone who wants to experience machine learning to be able to jump in the pool. And I think one of the things that differentiates a bit what we're trying to do with the product is that we're allowing people to use machine learning and data science both to get insights as well as produce models to put into production. And I will tell you that most data scientists who model data on a regular basis, most of the time when they create models, the reason they create the models is actually not to deploy it into production, it's to get an answer to a question. And once they have the answer to the question, they do something about it. They take an action. But they don't necessarily need to run that every day and put it into production. And so while we will enable people to create production models and put them into production, like most AutoML, I think equally, if maybe not more so, we want to enable people to get answers to questions using AutoML and be able to share that insight into the business. And so we're really excited to have a machine learning product that does both and does them in a way that hopefully everybody feels comfortable using it.

SUSAN: 16:26

Yeah. Absolutely. Yeah. That's super exciting. I can imagine that this product will be accessible to people at all different levels of skill, as you were saying. Why should somebody who is maybe an expert who could code up a model just like that, why should they actually take the time to use this product instead? What's the advantage of using an AutoML tool for somebody who is very expert in data science and for whom hand-coding is not an issue?

ALAN: 16:54

Yeah. So certainly, an experienced data scientist likely could open up a notebook and start writing some Python code. And in fact, there are some great times when that's probably the right thing to do. But one of the reasons why tools like this are very advantageous is, A, they make it faster and easier to do it. Whether you're using the open-source library and not having to write all of that code, you're using something that's already built, or whether you're using a commercial product. One is that it's faster. The second is that there's a layer of visualization that comes out that most libraries aren't going to give you. So we've built it in. And whether it's you quickly want to see the Shapley values or a partial dependency plot or explore the data in other ways, see the mutual information between the different items, these are all basically built-in.

ALAN: 17:53

And so, again, you could do all of these things, but it would be more lines of Python code. Where this is already kind of pre-written. Bring your data in and it's doing all of this for you. And so there's a fair amount of functionality out of the box that you wouldn't necessarily get otherwise. Now, the other thing is you typically are rapidly iterating initially. So you have a question. You bring in some data. You do some exploration. You get some answers, maybe some more questions. And you iterate. You bring in some more data. You iterate, you iterate, you iterate. And as you're iterating through that process, you want an ability to rapidly be able to explore and see what's going on. And again, you can do this handwriting code every time, but using pre-built stuff is probably going to be a lot faster and easier even for a data scientists.

SUSAN: 18:44

And correct me if I'm wrong here, but you can also use the tool as a starting point that then generates code for you as well, right?

ALAN: 18:52

Yeah. So Alteryx Machine Learning allows you to both export your model, let's say, as a tool on your canvas into designer. And so effectively, you're getting a model out that you can use in a workflow to do other things. Or you could be using it through an API call. So basically, you can use it Pythonically, either through an SDK or through an endpoint. And so at the end of the day, being able to use it in a code-free or code-friendly way matches the same way that we think of our flagship designer product in that we have code-free and code-friendly tools and you can mix and match those two different methods.

SUSAN: 19:32

And it seems like that whole compatibility and flexibility also would make collaboration a lot easier for folks with different skill levels?

ALAN: 19:42

Yeah. So we want people of different levels of skill coming from different domains and areas to be able to collaborate and join together and have a single tool that they can use and share. And there aren't really many of those if you think about it. Data scientists frequently use tools that are great for data scientists, but your accountant probably isn't going to easily be able to open up and manipulate. Equally, sometimes the accountant is using a tool that the data scientist doesn't view as kind of what they would normally use. And with Alteryx, we're trying to provide a platform-- and we've been very successful to date doing that. Provide a platform that both data scientists and analysts can both work in and can share things with each other. And this is very powerful for really both constituent sets. Data scientists frequently talk about how they build things and they can't get people to consume them. And one of the reasons that sometimes happens is that they built it in a tech that isn't easily consumable by the other audiences. And so I think we want to continue that tradition, obviously, with our new products, that they are accessible by the whole swath of communities that have questions that need answers.

SUSAN: 20:58

Yeah. Awesome. I love that philosophy. So the one thing that we always do on Data Science Mixer is we have a recurring segment called The Alternative Hypothesis. And so we always ask our guests the same question here, which is, what is something that people often think is true about data science or about being a data scientist, but that you have found to be incorrect?

ALAN: 21:21

So one of the things that I, unfortunately - I'll say, unfortunately, because it's not true - I hear frequently, is a belief that we can't arm knowledge workers with data science tools and capabilities. That somehow providing people tools like Alteryx, somehow, they're going to create damage and mayhem to their business. And I sometimes hear this from IT teams. I sometimes hear this even from data science teams. I have found very few examples, if any examples, where this has actually proven out to be true. That by providing someone with a data science tool, they somehow damage the business in a way that wouldn't have happened had you not provided them the tool. And the reality is what we're talking about here is math. This is like thinking about when the calculator first came out. There were probably people who said this. "We can't give people calculators. If we give people calculators, they could make terrible mistakes because they don't understand the math that's going on inside the calculator. We should make them continue to do their math by hand. Accountants, keep using the abacus. [laughter] You're not allowed the calculator because if we give you the calculator, you're going to make a terrible mistake." And yeah, most of us who now use calculators and have for quite some time realize that having a calculator probably has meant fewer mistakes, not more mistakes.

ALAN: 22:50

Now, it's not that you can't make a mistake with a calculator. Clearly, you could make a mistake with a calculator. But you're probably going to make fewer mistakes with the calculator than if you were on an abacus. I would propose. And data science is very similar. And too many that worry about it, it's kind of funny because when you think about kind of how we check our work and how we know we're getting the right answers, the very people who are concerned that you can't give the accountant or the marketing professional or the engineer the data science tool because they might make a mistake. The irony of this is that when that marketing professional or tax professional or accountant goes to the IT department and says, "Can you build me a solution? You're the experts, you go build it," who checks the work, that it's right, in the end? I mean, last I checked, it's the tax person, it's the engineer, it's the domain experts who does the check to determine if the answers right or not. The IT person actually doesn't even know if it's right.

ALAN: 23:52

Nor does the data scientist. The data scientists, frequently, we're not the domain experts. We have to talk to our domain experts to make sure that we've really done it right. That we've gotten it all right. And so I think when you think about that, it makes you wonder that the people who are the most concerned are actually not the ones that can check the quality of the work. And the people who need this so desperately are just the opposite. It's the domain experts that really need these tools. So I really hope people-- maybe this conversation will open people's minds to the fact that, holding technology back, holding technology away from the users-- which is kind of strange that the technology organization, the IT organization of most companies is frequently where we hear this from me. They need to become the enablers and not put-up walls around this.

SUSAN: 24:42

Right. Right. And it just reminds me of similar sorts of reactions to new things in culture that people are scared at first, and then they calm down when they see that things aren't going to be as terrible as they think. These worst-case scenarios are not going to happen. And so--

ALAN: 24:58

Yeah. I mean, clearly, change is hard. I get that. And people have a tough time with it. And this is a big change. Enabling people to use this new technologies is certainly a big change. But the businesses who have done this have certainly had outsized results. We've seen that the more mature companies are analytically, it definitely correlates to their profitability and their success as companies. And so it's certainly something that has to happen. It's just a matter of how fast will people get on this bus.

SUSAN: 25:30

Yeah. Yeah. Awesome. Well, Allan, anything that we haven't talked about yet that you want to get in there, that you think is important for folks to know about with any of the issues that we've addressed?

ALAN: 25:41

I mean, I hope that people are incredibly excited about the future in front of us. I know I am. From an Alteryx standpoint, we're launching incredible new content into the product from a data science perspective, whether it's in our intelligence suite on designer or the all-new Alteryx ML product. But watching people go on this journey, seeing people do their first sentiment analysis, or topic modeling or OCR document in that was an image and turn it back into structured data, or experience AutoML for the first time, it's just really wonderful to watch people going on this journey. And I hope your listeners are excited to explore and try to learn some of these new techniques and these new tools.

SUSAN: 26:25

Absolutely. Well, thank you for being one of our guides on the journey, Allan. It's been great to have you on Data Science Mixer. Thanks so much.

ALAN: 26:32

Thanks for having me, Suzanne. [music]

SUSAN: 26:36

Thanks for listening to our Data Science Mixer chat with Alan Jacobson. We're glad to be part of your data science journey. And if you want to try some of the tools Alan discussed, check out alteryx.com or find our open-source libraries on GitHub. Also, be sure to join us on the Alteryx Community for this week's Cocktail Conversation and share your thoughts. This week, let's chat about your favorite analogies, metaphors, and explanations for data science concepts. Do you have a favorite way of talking about an idea in data science that you think is especially effective and/or fun? Or have you heard a brilliant one from someone else? Share it with all of us. Leave a comment directly on the episode page at community.alteryx.com/podcast or post on social media with the #DataScienceMixer and tag Alteryx. Cheers.

 

 


 

 

This episode of Data Science Mixer was produced by Susan Currie Sivek (@SusanCS) and Maddie Johannsen (@MaddieJ).
Special thanks to Ian Stonehouse for the theme music track, and @TaraM  for our album artwork.