Data Science Mixer

Tune in for data science and cocktails.
Episode Guide

Interested in a specific topic or guest? Check out the guide for a list of all our episodes!

VIEW NOW
MaddieJ
Alteryx Alumni (Retired)

How can data science help cut back on food waste throughout the grocery supply chain? Shawn Ramirez, head of data science at Shelf Engine, explains how minimizing food waste can help maximize profits, benefit communities and protect the environment. 

 

 


Panelists

 


Topics

 


Cocktail Conversation

 

Share an example from your own work of a unique data point or pattern that surprised you!

 

Join the conversation by commenting below!

 

Shawn CC.png

 


Transcript

 

Episode Transcription

SUSAN: 00:01

[music] Do you like food? I really like food. I like food so much that when I have to throw some of it away, I find it upsetting and I feel really terrible about the waste. We could psychoanalyze my personal issues, but instead, let's talk about minimizing food waste throughout the grocery supply chain and how data science can be used to not just reduce waste but also maximize profit for everyone involved and even help address hunger and environmental issues. That sounds like a delicious combination. Welcome to Data Science Mixer, a podcast featuring top experts in lively and informative conversations that will change the way you do data science. I'm Susan Currie Sivek, senior data science journalist for the Alteryx Community. My guest today is Shawn Ramirez, head of data science at Shelf Engine. She tells us all about her journey from studying terrorism and teaching political science to doing data science with the mission of making the grocery supply chain more efficient for everyone involved. And be sure to stay tuned to hear about a way to participate in the show and maybe hear your comments in the next episode. Let's not waste any more time and get right into our chat with Shawn. [music]

SUSAN: 01:21

So tell us a little bit about Shelf Engine for our listeners who haven't heard about it.

SHAWN: 01:26

Absolutely. Shelf Engine is a supply chain forecasting company. We, basically, optimize inventory for grocery stores and cafes and other food retailers around the United States. We do all the ordering, and we do all the optimization of what they're doing, and our goal is to reduce food waste through automation.

SUSAN: 01:43

Awesome. Very cool. Yeah. I definitely want to dive into a number of things that you just said in that description. Tell us a little bit about how you ended up at Shelf Engine and working on groceries and cafes and ordering and supply chain. What was it in your background that led you to be interested in that area?

SHAWN: 02:00

I was actually a political science professor for a long time. I spent about a decade teaching political science at both Harvard and Emory. And you can think of that as studying hidden behavior of people, looking for terrorists and trying to design policies around that, understanding influence in terrorist networks and recruiting strategies. So looking for a lot of signals. We don't know who a terrorist is, but we look for signals of their behavior. We try to track their violence, and we try to make models of these things using game theory and statistical methods and machine learning. And that actually has a ton of application in real world. It sounds a little crazy.

SUSAN: 02:38

That is wild. Wait. I have to pause for just a second. So from identifying terrorists using machine learning, finding signals in the noise, trying to use models to make that happen to groceries. I love it.

SHAWN: 02:53

That's right. Yeah. It seems crazy, but basically, you can think about consumers as we don't get to actually see consumers walking to a store, and we're not taking video of them coming in and out. We're not tracking human behavior in that way. We do see evidence of what they do. And so in many ways, it's actually very similar to thinking about terrorism and policies around that. We are thinking about consumers and policies about inventory and trying to systematically identify patterns in that, model them, create machine learning models that predict that behavior given things that we do, meaning like how much we stock, how many apples we stock or something like that. So it is a little bit of a roundabout way to get into data science, but I made the switch into industry, looked for something especially in the aggregate space. I wanted to do something that had a positive impact on the world. This company, really, its mission in our profits are driven by that waste. So we only make money if a product sells, and we actually pay for all the waste. So anything that goes to waste cuts directly against our profit. We absorb that cost for groceries. And that means our profit-- as our profits go up, we can be really proud of the work that we do, and we can think we're doing the best thing for groceries. We're helping them improve sales at profit. And at the same time, we're cutting back on, dramatically, tons of waste across the United States.

SUSAN: 04:13

Well, that is really interesting. So tell me a little bit more, though, about that transition because it sounds like it wasn't necessarily straight passed directly from political science to Shelf Engine. What came in between there?

SHAWN: 04:26

Oh, that's right. So, of course, like many PhDs or many professors thinking about what they should do next, I spent a long time thinking, "Do my tech skills-- do they have any relevance in the real world? Does it make sense? Do I fit anywhere? If so, where do I fit? How do I even look for a job?" It was all very foreign to me. So I joined a program called Insight Data Science. I was hired out of that program to run the program in Seattle. And with that, I learned a lot about the Pacific Northwest companies, about startups, about Google and Amazon and Microsoft, all the big companies that are here. Facebook, of course, that's also present as well. And just understanding what it takes to grow a data science team there. How do they work with engineering? What kind of data challenges do they really face? And then how do skills really get applied on the ground? And I was just really motivated to think about all of these connections. I always loved understanding the world and how people behave in the world and how they act or what they do and how they optimize their lives. And you can think of this as how do teams optimize? And with that, I started going on the next job market during COVID or after COVID, I would say, maybe, and trying to figure out what I should do next in my life and looking for something that really had positive impact. So here I am.

SUSAN: 05:42

Awesome. So was it primarily the AI for good mission that drew you to Shelf Engine? And looking at all these different companies that you'd become familiar with, were you also kind of intrigued by the idea of a startup? Was that something that was appealing to you as well?

SHAWN: 05:56

Oh, definitely. One of the big differences for people who are looking for jobs, the big difference between a startup and a larger, more established company is that at a startup, you wear many hats. So you get to really experience both production code, triaging things, debugging things, building the code base from the ground up, understanding what modularization looks like, understanding how to write unit tests and integration tests all the way up to I run my machine learning models at scale. I'm using deep learning. I'm involved in the research. I set a vision for what my work is, and I own it, and I drive it. I think in a larger company, you may not have that ownership. There are a lot of political battles there with the way things are already done. Of course, they were established by very intelligent people who are also mission-driven in some way. And so you've got to work around that to figure out where you fit. And your role in a large company may be very carved out. And whatever you do, you're going to continue to do, and you'll continue to grow your expertise, but that expertise is going to look a little different. And so I thought if I joined a startup, I would have more well-rounded understanding of where I fit and really just see the bigger picture. I think of it like, at a startup, you're kind of the symphony director of a small symphony and you get to really own what that chorus looks like or what that harmony looks like. And at a larger company, you're a piece of the puzzle. You actually probably will never understand the full picture of what goes on. And that's okay.

SUSAN: 07:20

Yeah. Yeah. Just different roles with different appeal for different people. Makes sense. Cool. So let's get into some of those elements you were talking about in your daily work and the things that Shelf Engine is doing. What are some of the data sources and techniques that Shelf Engine is using to do that inventory optimization and cutting down on waste that you're describing, to the degree that you feel comfortable discussing any details?

SHAWN: 07:44

Yeah. I'm happy to talk about it. It's super fun. So basically, if you think about how it works, we get data streams on sales and deliveries pretty typically. Sometimes, we don't see the deliveries, right? Nobody is back there counting the cherries that go into a crate that go into a truck that get delivered to four different stores in the middle of Pennsylvania, right? They're not doing that. So I think some of that we have to simulate or we have to estimate. We have ground workers or field workers that go and collect certain data for us to provide some ground truth. So we have those kinds of data streams. And using those things, we also see things like what happens in stores. We talk to managers. We take pictures. There are other aspects that become part of our data stream. And that forms the core data about inventory and sales. And from those core data, we do a lot of extrapolation because we have a lot of privilege. We have full market data. We see a lot of the consumers. We work with Kroger and Target and Walmart and other places down to hospital cafes and how that works differently than some of these big name brands. And then take that data. We merge that with third-party data streams where we think it makes sense. So and what I mean by that is we have different models and we score these models in order to understand, do these data streams make sense? Do they effectively change what we're going to order? And how do they do that? What sort of new model or technology do we need to employ for data science to make this work? We may include things like weather, foot traffic proxies, other things like nearby lag indicators or nearby neighbor indicators of what's happening as well as some of our own domain knowledge about areas or regions of the United States that may be affected differently, so.

SHAWN: 09:25

Shopping looks very different in the south, right? Kentucky Derby time doesn't mean the same thing anywhere else except in the Kentucky Derby, right? So I think there's a lot of nuance to that. And we fill in lots of event data, holiday data. And when I say Kentucky Derby, there's a holiday. Most people don't think of the Kentucky Derby as a holiday, but if you're there, it's a holiday. Super Bowl is a holiday. There's a lot of weird holidays all around the United States, and we have the backchat also to demographic data about what's there. If you have a large Indian population, they're going to celebrate very different holidays. Ramadan is celebrated by certain groups and not others. And we have to take account for these things of how that affects food because every holiday, every culture is somewhat associated by patterns in what they eat and what they consume and what they buy.

SHAWN: 10:15

So all of that data comes together, and that forms the core of our data. From there, we then have a series of models. So most people think, oh, we look at the time series, we get a prediction. So we say, "We need to order 5 bananas." It's a lot more complicated than just looking at time series. Of course, the data comes in as time series. But once we get the number, "Oh, you've got to sell 5 bananas. We're going to sell 5 bananas," there's a real question about how much inventory is there. If you know there's 100 bananas there, you're not going to send 5 bananas. You're going to let that product sell through that's on the shelf already. If you know there's 0 bananas, then you're definitely going to send 5 bananas. If you know this model tends to underpredict for this area or for this holiday or for this weekend, maybe it's getting more summery outside, and you think people are going to get make picnics and maybe they're going to order bananas more, you might spend more than five, and you want some sort of adjustment for that. So there's an inventory adjustment and a sales adjustment that we used to think about these things.

SHAWN: 11:13

And then finally, there's the question of sales and profit for the store as well as profit for Shelf and waste for Shelf. And so we have a sort of calculation that we do on all of this and, of course, on all the different products that then allows us to ratchet this number up and down. So let's pretend that the forecasting and inventory fortunes told us we need to sell says 7 bananas. We run it through some sort of optimizer that then says, "Give them the profits that we're making on this given our expectations of everything else that's going on." We then say, "Oh, we actually should send 6." And maybe that's the magical number. Or we may say, "We actually should send 12." So what that optimization function does is we tailor that to in part what Shelf needs in terms of profit and waste, to bring down waste, and to make sure we maintain profits and at the same time to what a customer needs. Do they need a certain level of shelf presence? People want to walk into a store that looks beautiful with all the rainbow colors of food. Is that something they're looking for? Are they really looking for sales and driving up sales, which are not always the same as profits, or are they really trying to make more profits right now in certain locations? So we want to make sure the system is fine-tuned for, of course, our mission, but also the specific goals of retailer. Then ideally, if we're able to do that, then retailers can then spend more time with their customers. They can have more face time. They can do the other things that are really important to bringing about a good grocery experience or cafe or restaurant experience.

SUSAN: 12:38

That's really interesting, the idea of what are the priorities of individual stores and thinking about, for example, various grocery stores that I might go into, some which have beautiful arrays of produce and some of which have kind of the bare minimum, and the fact that you are actually building systems to make sure that that individual store has the look and feel and the profit and sales that they're seeking. That's a lot going on. That's a lot to incorporate into your modeling.

SHAWN: 13:07

Well, yeah. And that's why we have many sources of data. Yeah. And certainly, we have customer success teams and the field teams that give us data back as well about what our customers really want, what are their goals, how are they feeling at the moment, are things going well or not? We try to automate as much of this as possible, as you can imagine. We have grown tremendously in the past, I think, since the start of 2020. I'm not sure what the original numbers were in 2020, but when I started, we had about 1,500 retailers, and now we have over 2,500. So in six months, it's grown a lot, and we're onboarding new places all the time, which is only increasing the number of challenges that we have to, but that's okay. Yeah.

SUSAN: 13:49

Well, and I would imagine that also increasing the variety of data that you have and potentially enriching your work in that way. [music]

SUSAN: 14:02

Shawn's work to optimize and automate grocery ordering is super interesting, but I'm sure that, like me, you had to think of one thing that could wreak havoc on all those finely-tuned models. Yep. I'm talking about the year 2020. Let's take a quick break before we hear about 2020's effect on Shawn's work so we can prepare ourselves. But stay tuned to learn about how she and Shelf Engine are dealing with the impact on their data and forecasting, things you can take into your own work. Before we do that, I wanted to let you know that this happy hour conversation doesn't have to end with this episode. If you're newer to the show, you might not know that with every episode, we have a Cocktail Conversation, a discussion of a data science question. These conversations are hosted in the Alteryx Community at community.alteryx.com/podcast, where you can click on the Data Science Mixer episodes. You don't have to be an Alteryx user to join the community and come learn from our data science resources and chat with other awesome data-minded people. Our Cocktail Conversations are relevant to anyone and everyone doing data science no matter which tools you use. We may even feature your comment in an upcoming episode. Here's this week's question to think about. I'll remind you again at the end of the show. Shawn talked about the Kentucky Derby as an example of a regional holiday that significantly affects consumer data from that area. Do you have an example from your own work, a unique data point or pattern that surprised you, something that might not seem significant to people in a different place or different industry but that matters a whole lot for your projects? Tell us about it. Again, join that Cocktail Conversation about your unexpectedly important data at community.alteryx.com/podcast. [music]

SUSAN: 15:51

So it sounds like you arrived at Shelf Engine during COVID, or? I mean, we're still in it, really, but at what point did you actually join the company?

SHAWN: 16:00

I joined the company in December of 2020. So pretty late. I'm not sure where in COVID we want to place ourselves for that, but yeah, I left my previous company at the start of COVID-19. Yeah. And I joined this one in December.

SUSAN: 16:13

Interesting. So how has that played into the work that the company does? I mean, I can imagine that could potentially wreak havoc on attempts to forecast demand and the kinds of things that people are seeking or hoarding or whatever the case may be.

SHAWN: 16:27

Yeah. Oh my goodness. The changes that have been noticed from COVID are certainly dramatic. And I think it's important to realize that even though those changes in-- we had a rush on toilet paper. Everyone all of a sudden was buying tons of meat for a while. There was almost no shopping on Memorial Day last year or Mother's Day, for example. Sales dipped down on holidays where people are throwing barbecues or holding brunches. And it was pretty surprising. No one knew what anyone was going to do for Thanksgiving. And I started my job here just after Thanksgiving week. So we were dealing with the aftermath of Thanksgiving, certainly a loss to profit. A lot of waste there for Shelf that Shelf absorbed the cost of all of that. I think people didn't necessarily predict or know what people were going to do for Thanksgiving. Are we going to have family get-togethers? Are they going to make turkey and go all out? Are they going to have something really small? And I think we wanted to be prepared and we wanted stores to be prepared for the potential that consumers would want to have the typical fixings that they would want on a Thanksgiving dinner. No one wants to walk into a store or have their customers walk into a store on Thanksgiving or for Thanksgiving to prepare a wonderful Thanksgiving meal and not be able to find what they need. That would then drive them to another store and, of course, increase their risk during COVID. So I think we wanted to leave stores really prepared, and that ended up meaning pretty big losses in profit for Shelf, at least temporarily.

SHAWN: 17:57

We had to make pretty stark changes when it came to Christmas time and thinking about what those holidays would look like. What would New Year's holidays look like? We did a lot of data mining around this, really rapid data mining, to try to figure out what would happen. We have to stock stores. Some of these stores have lead times of two weeks. And so we need to stock them by about December 10. We needed to figure out what was happening for Christmas and really get that in our code base and make sure those orders were reviewed in time. So we had really only a couple of days to figure this out. And that's a lot. We ended up doing things like thinking about what were leading indicators. Were there differences in stores in terms of what we saw? Do certain stores tend to cater toward local small parties more, or do certain stores tend to be more families because families may behave differently than couples or single individuals who are shopping for food and holidays? And we tried to find patterns in that. We also tried to find patterns in what neighboring states would have done. Were there similarities? So we thought buying actually happens three hours earlier in the East Coast than it does in the West Coast. Can we use that data in some way?

SUSAN: 19:09

Right. Oh my goodness. Yep.

SHAWN: 19:11

It was clear by now, yeah, that patterns across the globe were going to be useful. What happened in Italy was not the same as what happened in China, was not the same as what happened in Iran, or nor that was that going to be the same as what happened in the US. So we did think about how do we extract from the data the best signals that we can about human behavior during the crazy, unprecedented time that's going to hopefully satisfy our customers and make our stores happy, allow people to have things to eat that they want to eat as well as reduce waste overall? That story ended up going really well for Christmas compared to Thanksgiving. So we're glad that some of that paid off. We did a lot better thinking about that as well as the New Year's holiday, just trying to understand where people are going to throw small parties and where they're not going to. And it was a big challenge. Looking ahead, I think, and thinking about it in a larger context, what's really interesting is this is not the first situation that's going to happen like this. There's hurricanes. There's listeria outbreaks, if anybody remembers that from a few years ago. So there's a lot of scenario modeling that we've thought about now about how do we model these scenarios and think about the worst case as well as better cases and the probabilities for each of those things and then have our models or design our models to be impacted by these scenarios and our expectations of how this is going to play out. So we're definitely working on various aspects around that. And there's nothing else that we can do [inaudible]. [laughter]

SUSAN: 20:37

Right. Right. Yeah. Yeah. It sounds like you're doing everything you can to get that bird's-eye view of the bigger picture and incorporate all these different sources of data. So it sounds like you have a lot of different strategies already in place and at work. But how will you be handling 2020 data as you're trying to set things up for success, for example, in the holiday season of 2021?

SHAWN: 20:57

Yeah. This is a great question. One of the biggest problems with COVID is, of course, all of 2020 data was, essentially, just marred by COVID. None of it was necessarily valid. And I think we had lots of requests about, "What do we do for Easter? Why don't we just look at last year and what happened with Easter?" For one, year-over-year effects are challenging, in general, right? The holiday doesn't necessarily fall on the same day. Easter falls on a Sunday, thankfully. Other holidays, they sometimes fall-- July 4th, for example, does it fall on a Wednesday? That may mean something really different from it falling on a Saturday. And layer into that the fact that all of 2020 data looks very crazy. And it looks crazy in certain ways. For one, sales may dip down on certain products. People may be eating differently. It also sends a signal too. So we have to know how to read the signal from the noise because the other thing that happened in 2020 was a lot of people relocated or changed their lives in certain ways that have become permanent. They may have taken a remote job or made some decisions about where they live or where their children go to school or even if their children do go to school, right? Those have become somewhat permanent decisions. So we need to try to extract that from the data and decide what part of this data is usable, if that's the right signal. And it's not as simple as just throw last year's data into this model. There's no deep learning method magic that's going to get us the right transformation here. And we have to really think hard about what that looks like. And we're trying to do our best to break that apart while also, how do we know if we're right in breaking that apart? But we have to measure our success against the most recent holidays that we see as we emerge from the pandemic. Meaning, Mother's Day sent us a signal of how useful 2020 was for Mother's Day. Father's Day sent us a signal for how useful 2020 was for Father's Day. Memorial Day sent us a signal. July 4th sent us a signal. And we'll just keep going through the holidays to figure this out and improve what that signal is from 2020.

SUSAN: 22:53

That's really interesting. So yeah. Every holiday this year will be a learning point, it sounds like. Very cool.

SHAWN: 22:58

That's right.

SUSAN: 23:00

So what are you excited about in the future of the work that you're doing, either specific to Shelf Engine's particular brand of forecasting and optimization or just broadly for supply chain and logistics and using data science in those areas? What are some things that you're really curious about exploring, learning, implementing?

SHAWN: 23:22

A few things, definitely. And this is such a fun question to answer because I'm always excited about so many things. For one, there's really new interesting research in transformers for time series, and it's proving to be pretty useful. So one question is, how do we use that in the best way that we can? So we've seen this really interesting evolution of what time series modeling looks like, from base levels in econometrics for a [inaudible] for time series to vector or vectorized autoregressive models for time series to machine learning and viewing time series as a supervised machine learning approach and building out features for that, which have been very useful for us and, I think, lots of other people around the world. Layering in gradient boosted methods in order to use boosted trees and maybe potentially layer a deep learning model on top of that. So we've seen this really interesting evolution. It looks like the next stage in this evolution is about using something called transformer models. I won't go into the technical details of what that is, but I will say that if you are interested in using the most advanced technologies for time series models, that understanding what LSTMs give you in deep learning, meaning a subset of RNNs that have a good attention window that they can apply to understand how useful is recent data versus how useful is other data from a long time ago to the next level of thinking there's some really, really new research out now that shows how transformer models can work. And there's really interesting details there. So if you're interested in applying the most highly technical levels of research to this question about how do we work with time series, that's the direction to move in.

SHAWN: 24:55

The other really interesting direction points in almost the opposite direction, which is in the MLOps space or in machine learning explainability, we are really finding-- and some of the examples that I have given to you also today have been about holidays and what do the holidays mean and demographics and what do demographics mean? And can we code in features for families and single people and things like that? That is not necessarily in the world of transformers and deep learning. Those kinds of things may be in very standard, supervised machine learning methods that we can apply to those data and say, "This is actually really useful and generates really highly accurate predictions for us." That world of understanding ML explainability-- what are the right features? What is their impact on the final numbers that we're looking at or the final metrics that we care about? That world is also growing dramatically. And new technologies there in MLOps helps you dig in and drill in when you see some sort of performance error so that you can say, "Oh, the validation data looked this way. And my data from this specific window for this specific retailer has a very different distribution," and drilling really fast. That ability to do that means that we can hone our models faster for any specific customer. And that kind of thing actually gives us a tremendous amount of power in terms of what we can do, what kind of benefit we can bring to the world. So it's almost like there are two different paths that the world of time series and supply chain forecasting is going on. One is this super technical path, and the other is this ML explainability. Dig in deep, think about consumer behavior, model it well, and develop that observability or that lens to really quickly have that snapshot and magnifying glass to just see what's really going on.

SUSAN: 26:37

Yeah. Yeah. And that's, again, where it sounds like some of your political science experience in trying to observe things that are not easily observable, where that might come into play.

SHAWN: 26:47

Absolutely. Yeah.

SUSAN: 26:48

Awesome. So just to change gears a little bit, you talked a little bit earlier about the importance to you of pursuing something in the AI for good, data science for good category for a career. Can you talk a little bit more about that and why that was important to you as you sought out your next opportunity?

SHAWN: 27:04

Sure. There are a lot of companies, and it's really fascinating to see the trajectories that people go on for their lives and what they want to do. And I think that's a very personal choice for everyone. And you should think about where your skills really fit, the companies that you might work for. They're pursuing different things, and people are different stages in their lives. They may be looking for different learning experiences and adding to that portfolio. I think that's a really marvelous adventure to see people go on. For me, this has been about I've always wanted to help people. And when I studied terrorism, I really-- I moved into that after September 11th, which seems like a very long time ago right now, but I was really motivated to help understand where terrorism came from, how do we detect it, and what kinds of things can we do to improve the world and make the world safer for people, and also, to help people around the world so that they ideally wouldn't be choosing terrorist paths. They should be choosing political paths that are much more meaningful and won't hurt people, of course. So I wanted to do things that would have a real impact on people's lives. Unfortunately, as a professor, I actually found that I didn't have very much impact on terrorism. And that's okay. I think that's fine. What ultimately emerged for me was this desire to have that daily impact in the real world. There are a lot of companies that are doing amazingly interesting technical work, and I love digging into the research and complexities around the technical details, of course, but I also wanted my work to have an impact on lives around me.

SHAWN: 28:34

And as I look around, I think about climate and what an impact we've had on the climate and how much that's changed and how hard that's becoming for many people to uproot themselves or change the way they live and some of the strains that we've put on our environment. I also think about the cost of food. It costs a lot of money now for people, a family of four. We spend 600 dollars a week to buy food in the United States, and that's a huge amount of food. There are a large number of people who struggle with hunger. One in six Americans, according to a study in 2019 and 2020, struggle with hunger. And that means they don't necessarily know where their next meal comes from. 1 in 6 in a developed country like the United States is shocking. So I think that, ultimately, when I heard of this goal of we want to reduce waste, we're going to have an impact on climate, we're trying to help grocers be able to shore up their profits in ways that can potentially bring down the price of food and potentially end hunger for a lot of people, I feel like that's a meaningful goal that I can get behind and something that I want to drive progress toward. So when I look at those profit numbers for Shelf and the waste numbers for how much we're saving, I am definitely thinking about, are we impacting the world in a positive way? How do we support vendors next? How do we support smaller retailers who we don't want them to be forgotten in this path as well? We don't want the local cafe to be throwing away hundreds of dollars of food and then have to figure it out for themselves, right? We can just do it for them. So for me, I'm really driven by that mission and the many different aspects or facets of it. Hopefully, it works. [laughter]

SUSAN: 30:19

Yeah. Yeah. That's awesome. Well, it sounds like you're coming up with some really innovative ways of dealing with those questions, so that's really exciting. So we have a question that we always ask on Data Science Mixer. We call this the alternative hypothesis. And the question that we always ask is, what is something that people often think is true about data science or about being a data scientist but that you have found to be incorrect?

SHAWN: 30:46

The one thing that I-- [laughter]

SUSAN: 30:49

Yeah. That's usually the first reaction.

SHAWN: 30:51

I think that definitely many things. And so it's really funny to nail it down to one thing. The one thing that I think people think is true is you learn to write a model and you think, "Oh, I'm going to figure this out. I'm going to solve this puzzle. I'm going to collect the data. It's going to go into my model. I'm going to get an answer. It's going to be solved." So what we learned fast is that it doesn't work that way. I have people who start their job if this is a new job for them and they say, "Okay. So in my first 30 days, what I'm going to do is try out three different models, and I'm going to see if they work, and I'm going to horse-race them, and I'm going to figure it out, and then I'll get my results and put that into production. It's going to be really exciting." I'm like, "Yeah. That sounds great. [laughter] [Excited?] for you. But I think we're not even going to get to square one on model number one because the data's more complicated. Putting it into production is more complicated than that." And the act of putting something into training and validation and doing the cross-validation and setting the right metric, it just requires so much more effort than it did on any data set I have ever worked on in the past. And so it's just not the case that you can go and run seven different models in your first week and just see which one works the best. It's just not going to work that way. There's a lot of extra learning to be done. And it is a great thing because it allows your models to work at scale. And in a system where data is changing all the time-- yesterday's data doesn't necessarily look like today's, and you have to figure out why. Is that because there's some sort of ingestion problem? Is that because there's some sort of query that changed and you don't know why? Or is that because of your model's just not working the way you wanted it to work? So I think the one thing that I've learned is it's just not as simple as the modeling that I did or not as fast as the modeling that I did as a professor. And that's okay. It makes it more interesting.

SUSAN: 32:37

Yeah. Definitely. Anything else that you would like to add? Anything I haven't asked about that I should have?

SHAWN: 32:44

No. I feel like I told you all our secrets. [laughter]

SUSAN: 32:48

[music] Well, Shawn, thank you so much for joining me today on Data Science Mixer. It's been great talking to you and learning more about the work that you're doing there. It's really awesome.

SHAWN: 32:55

Thank you, Susan. This has been really exciting. Thanks so much.

SUSAN: 33:01

Thanks for listening to our Data Science Mixer chat with Shawn Ramirez. Join us on the Alteryx Community for this week's Cocktail Conversation to share your thoughts. We might even feature your comment in upcoming episode, so be sure to jump in. As a refresher, here's this week's question. Shawn talked about the Kentucky Derby as an example of a regional holiday that significantly affects consumer data from that area. Do you have an example from your own work of a unique data point or pattern that surprised you, something that might not seem significant to people in a different place or a different industry but that really matters a lot for your projects? Tell us about it. We hope you'll share your thoughts and ideas by leaving a comment directly on the episode page at community.alteryx.com/podcast or post on social media with the #datasciencemixer and tag Alteryx. Cheers. [music]

 

 


 

This episode of Data Science Mixer was produced by Susan Currie Sivek (@SusanCS) and Maddie Johannsen (@MaddieJ).
Special thanks to Ian Stonehouse for the theme music track, and @TaraM  for our album artwork.