Alter Everything Podcast

A podcast about data science and analytics culture.
Podcast Guide

For a full list of episodes, guests, and topics, check out our episode guide.

Go to Guide
AlteryxMatt
Moderator
Moderator

In this episode of Alter Everything, we chat with Barzan Mozafari, CEO and co-founder of Keebo. We discuss how Keebo reduces data warehouse costs, optimizes data pipelines, and equips data professionals to leverage AI effectively. Barzan shares insights on Keebo's mission, technology, and approach to automating and enhancing data operations. He also offers advice for data professionals on optimizing pipelines, tackling challenges in data management, and embracing AI to stay ahead in the industry.

 

 

 

 


Panelists

 


Topics

 

Ep 174 (YT thumb).png

 

Transcription

Episode Transcription

Ep 174 Data Pipeline Optimization

[00:00:00] Megan Bowers: Welcome to Alter Everything, a podcast about data science and analytics culture. I'm Megan Bowers, and today I am talking with Barzan Mozafari, CEO, and co-founder at Keebo. In this episode, we chat about how his company cuts down data warehouse spend, the challenges of optimizing data pipelines and how data professionals can equip themselves to better leverage ai.

Let's get started.

Hey Barzan, it's great to have you on our show today. Thanks for joining us. Could you give a quick introduction to yourself for our listeners? 

[00:00:38] Barzan Mozafari: Sure. Thanks for having me. I'm, uh, co-founder Keebo. Square data learning platform. Prior to Keebo, I was in academia. I was a professor specializing in machine learning and database systems at the University of Michigan.

And prior to that I was at a few other universities, MIT and UCLA. Before that I also worked for a number of companies along the way, but I've spent the last two decades of my career at intersection of AI and database system. So happy to tell you more about what we do, but in a nutshell, that's my background.

I've been doing databases in ML for quite time now. 

[00:01:10] Megan Bowers: That's great. Yeah. I'm excited to pick your brain and hear about your expertise from all that experience. 

But I'd love to just start off with a little bit more about your company, Keebo. If you could just tell us what you do, what the company's mission is, and how your technology works.

[00:01:27] Barzan Mozafari: Sure. So we're data learning platform and our mission is to empower data teams to take control and drive growth through automation. So from a high level, what we do is that our platform learns from how users and applications interact with the data in the cloud, and then uses that to automate and accelerate the tedious aspects of the interaction between the data team and their data in the cloud.

And by doing that, and you can talk about how it does it, but by doing that, you know, we significantly reduce the amount of manual time that has to be spent on optimizing and operating these data pipelines. We reduce the cost of the infrastructure. Where's the cloud data warehouse and whatnot. But we also boost performance as a result of the authorizations that we actually do on behalf of the customer.

[00:02:09] Megan Bowers: Gotcha. 

So what do you mean when you talk about data learning platform, like learning how people use their data? What does that really mean? 

[00:02:18] Barzan Mozafari: So if you think about it, like, you know, just to give a bit of context for your viewers, the likes of Snowflake and Google's BigQuery and Amazon Redshift, what they've really done is that.

They've enabled companies to innovate very quickly, uh, and tap into their data with veto upfront investment and efforts, right? You can, uh, you know, you now have a lot more data teams and a lot more users tapping into their data and trying to drive value from it, and that's great by the way, because that has reduced the CapEx barrier, the upfront barrier.

But what that has done is that now there's more users and applications that are tapping into data and they're actually combining more data sources. They're creating more complex data pipelines. As a result of this, you have today's data pipeline, which are incredibly complex. You know, they cost a lot to, to run and operate.

The cost of customer snowflake goes through the roof. And that means that you need a lot of data engineers to constantly fine tune these data pipelines and optimize these queries one at a time, and then try to keep the cost within reason. And what keyboard does is that we actually go in there and then we analyze the metadata performance telemetry.

Like, Hey, how are these users creating the database? What are these queries doing? What are the resources they need? And then. You know, for example, one of the models we use is reinforcement learning, where our agent for each customer, we train a different, a unique agent where it basically plays with different levers, like for example.

Maybe if you have a large data warehouse here, do you need a large 24 7? Maybe There are times where you need an X large. There are times where you can get away with a small or even an X small and starts making these changes in real time transparently from the user. So they still manage to get their job done by the significantly lower cost because of all these real time optimizations that we do By having learned how these queries interact with the data, what kind of resources they need, what time they come in, what time they finish.

What queries work well together when they're run simultaneously, what needs to run on which kind of warehouse, and there's a lot of these complex decisions that they're just not humanly possible for someone to stare at like 1 million queries and say, you know what? I think tonight between 2:00 AM and two 7:00 AM This warehouse instead of a large, should be medium, but that, that's the advantage of ai.

I call it an infinitely competent, infinitely patient DBA. It basically can analyze and, you know, sift through tens of millions of transactions and log entries and figure out what's the optimal thing to do for the user based on whatever SLA that they have in mind. 

[00:04:44] Megan Bowers: I love that. And infinitely patient database administrator.

I feel like a lot of people probably would love to have that. I know. I hate bothering any sort of administrator like that. So all of that AI is like basically running in the background. So users are coming in and running their queries, but then it's in the background. It's basically taking all of that chaos and creating order for it.

Is that what's happening 

[00:05:07] Barzan Mozafari: Exactly. 

Um, having come from academia, honestly, like one of the biggest. And to this day, like, you know, if you look at all the buzzword around like LLMs and Gen AI in general, ai, right? Like one of the biggest. Barriers there is just the adoption barrier and when you dive deeper into what's preventing enterprises from embracing ai, I think it just comes down to four things.

One of them is the implementation barrier and like, hey, okay, fine. Like what are the kind of skill sets I need to have to implement ai? How much resources they need to set aside, right? The second barrier is typically the maintenance barrier. Then okay, fine. Like let's say I got this up and running, but like on an ongoing basis.

How much of my attention as an organization should be invested into making sure that this AI doesn't go wrong, it doesn't break anything, it just keeps doing what it's supposed to be doing. There's a security slash privacy barrier where like, Hey, what about compliance? Do I want AI to comment on, learn from my data and whatnot?

And finally, there's that auto high barrier. So the way we've created Keebo, we've carefully made the right design decisions. For each of those kinds of barriers. And that's one of the reasons we've actually been going very rapidly. For example, to your, to answer your question, the implementation part of it is pretty straightforward.

All you need is like half an hour of one engineer's time, where you just basically create a keyboard, user keyboard role, create a view that only has access to the those particular TE metadata columns that we need granted to ki and then you're good to go. It's a third and forgetter kind of solution. So.

It's not one of those things where you have to log in every single day and keep tweaking it every day. It's not, you know, it's not a crying child in the room, it's running behind the scenes transplant from users. And we used to have a joke that inside joke that, you know, we tell customers if we ever need more than 30 minutes of your time to implement Jibo, we'll send you an iPad for free.

And to this day, we've never had to buy one. So the implementation part has been pretty helpful in gaining the traction. 

[00:07:05] Megan Bowers: Very cool. And of those four challenges that you just mentioned, what do you think was the hardest of those challenges to overcome? 

[00:07:14] Barzan Mozafari: That's a very good question. Which one? Like, you know, when you're in a startup, every challenge feels like a pretty big one.

Right? But to honest with you, I think the luck we had was that I think what happens a lot technologists, they build something and then to try to figure out how to get it in the customer's hand and how to sell it and how to, you know, figure out the go-to market around it. I think. It was a lot easier for us at Keebo because we started with the end goal and you know, we started thinking, hey, like we wanna make sure it's a no-brainer auto eye, no-brainer security, no-brainer adoption, no-brainer implementation.

And we actually built the product with that end goal in mind from day one. So it was, for example, like on the security side of it, like we have a lot of public companies using our product, which are usually very sensitive about data and whatnot, but we get past their security reviews. Like, you know, in, in a matter of minutes because they look at the deployment model and they see that not only do we not store any data, we don't even access it.

We actually train on metadata only. So we, it was tempting, I'll be honest, it was tempting to try to incorporate the, you know, the database content and the query text and all of that in the, in the training. But then I think what was interesting was that we managed to do what we're doing now and deliver a lot of savings and value to customers without even looking at the data.

I. So it's hard, like if you ask me which one was the hardest, I would say, you know, each of 'em was the hardest. They're all the hardest. But I think it did help that it wasn't like we created the product and see how we can overcome these. We were like, how can we build a product that checks these boxes, for example?

And the auto part of it, a lot of customers have been burnt out there, right? By like investing in AI and never getting in ROI. So the way we overcame that was true. A pretty creative pricing stock share where we told customers, we just charge you one third of whatever we save you. Right? So essentially the idea there was that, hey, we should have slightly more confidence in our own product than we expect the customer to.

So if we confident that we're gonna be able to deliver that value, we're just gonna put our money where our mouth is. We tell the customers if we need. Save you $0, we charge you $0. If we save you $3 million, we charge you a percentage of whatever we saved you. So it's a success based pricing. So in a way that it guarantees that auto I to the end, end use it.

That, hey, look, these guys are not gonna get paid. How less they deliver the value. So, you know, they don't need the budget. We make our own budget. And I think that's, that was at least looking back, I think that played a pretty big role in, in our traction and the growth that we saw in the market. 

[00:09:49] Megan Bowers: That's really creative.

I had not heard of that from any sort of AI company, but I have, what I have heard of is seeing, you know, the statistics on so many companies wanna implement AI, but only this many are seeing value delivered. You know, it's always like there's this. Big gap, why is there this gap? So I think that's really interesting about how you're dealing with that differently.

Another thing I wanted to chat about from your experience, not only at your current company but previous ones, what is your advice for optimizing data pipelines? Like we have a lot of data analysts in our audience as well as data engineers. And so what is some of your advice for, for optimizing? 

[00:10:31] Barzan Mozafari: That's a very good question.

Like if I get to put my professor hat back on here. Mm-hmm. I think a lot of it has to do with data literacy and SQL literacy and database literacy. Right. You know, engineers and when you know when they're in school, like how to write proper sql, how to model the data properly, what are some of the best practices and whatnot.

But I think what's happened now is that's just a training that's taking place on the engineering computer science side of things. But the beauty of what's happening out there in the market is that. With the likes of Snowflake and similar solutions out there, you don't need to have a PhD in database systems to be able to tap into your data, right?

You can be a business analyst, you can be a sales leader, a marketing leader, and these tools allow you to actually tap into those tools. So it's hard to come up with an advice that would be applicable across the board. Like we can't ask everyone to go back and take a course on data science or on data engineering, or on database systems.

But one thing I have seen, this is just purely empirical observation, is that far too many data engineers are lost in the weeds of, Hey, here's a really efficient query. Let me see how I can rewrite this query and optimize this to improve performance. And I. Nine out of 10 times. There's no auto either. And the reason for it is that, you know, back in the day if you had a SQL Surfer, there was maybe a small team of DBAs, one, two, or three DBAs, and there was an entire organization that were depending on these DPAs, you know, you would optimize one query and then tomorrow optimize the second query and the third query.

But now we're talking about like millions and millions of queries coming in every day. And if you look at the pricing structure out there, there's just not enough AutoEye for you to manually optimize your liquidity. So I think my best advice for data engineers out there who are trying to manually optimize their data pipelines is that, to be honest with you, it's just not easy because the analogy I give is like a hotel rule, right?

Like let's say that you know, you and your friends go and book a hotel rule. You pay a fixed rate per night. So it doesn't matter if three of you were out there hanging out, having a meal at the restaurant or the entire time you were inside that room, you get charged the same amount. So the same thing is true with a lot of these cloud offerings is that you are renting this resource and it doesn't really matter if you're running one query in there, or 10 million qui, uh, for the most part.

So even if you have 1 million quid, even if you manage to magically make the 1% slowest quid. It takes zero time. You're not really saving that much money if you think about it, as long as the other 99% of the queries are still running in that warehouse. So that's why it's not a very viable approach. I think obviously if you have a data pipeline, existing data pipeline, I think it's a futile effort to be honest with you, to try to hand optimize it.

But it does like, you know, help from the get go before you get ahead of yourself, before you create something too complicated to really think about what you're trying to achieve. Like write proper SQL, model your data. Make sure that you know the users who are writing expensive queries, they know the basics of how database systems work, how to write efficient queries.

And it's not just a, not everything is a five page long SQL query. I think those are some of the prevention techniques, but honestly, once we are dealing with a mess, like we've seen data teams spending sometimes month on month and just getting maybe a 10% reduction in overall cost, so. I'm not sure if there's a lot of auto I for manual work.

[00:13:58] Megan Bowers: That makes sense. When you democratize like the access, like that, it can get out of control pretty quickly. 

Um, are there other challenges that you see data teams facing when it comes to like that data warehouse cost and performance optimization? 

[00:14:17] Barzan Mozafari: I do. I think what's happened is that with the macro economy that we're seeing these days, right?

Like everyone's looking for ways to cut cost. So you see sometimes like headcount disappearing. Sometimes companies unfortunately go through downsizing and layoffs, right? And a team that used to be 20 people is now 10 people strong. But the responsibilities and expectations are not going away. So that's, so those same companies that try to save money by producing the size of the data teams are also expecting those data teams.

To get more done, to support the, you know, the growth of their business, but at a lower cost. So, hey, you used to spend $5 million a year on your snowflake bill. Can you cut it down to half? Right. So that's one of the major challenges these days is that the demand has stayed the same, sometimes has actually increased, but the resources have gone down, uh, whether it's infrastructure, resources, or the budget or the headcount.

And I think that's a major challenge, like people need to do more with less. So that's a very common theme when you talk to these customers. And, and it's not always easy to support a large organization with different business units, right? Like the marketing team, the supply chain team, the inventory team, the sales, and like every, each and every one of those teams has their own use case and you have to support them.

So the biggest challenge we're seeing right now is just that data teams are spread very thin. And they don't have the resources they need, whether it's in terms of talent, headcount or just pure infrastructure. Like, you know, back in the day people used to throw money at the problem because it was all about growth at whatever the cost.

But now, you know, we're back on planet Earth companies, whether they're public or private, they have to justify their spend. And it's difficult because this is not what cloud data warehouses are created to to do, right? They essentially turn a CapEx problem into an OPEX problem. They lower the upfront cost.

But they definitely increased the operating cost. So that's part of what we're trying to do with our solution, is to also make cloud data warehousing affordable in the long term as well. Not just, you know, lowering the upfront cost, but also the ongoing operating costs in turn to cut that as well. 

[00:16:25] Megan Bowers: Yeah, that makes sense.

And I can see ways that for that problem of having. Data teams have to do more with less. I can see ways that like your company fits into that as well as like Alteryx fitting into that. Once the data pipeline is set up, then Alteryx enables data teams to automate and run workflows, things like that, or get.

Alteryx in the hands of the supply chain people, the finance people, the people that need the data polls, and being able to empower other teams to work with data efficiently while having a smaller kind of centralized data team. But I. I know that journey can be very challenging, especially if a company was already set up with their data team having a lot of resource, and now they're finding themselves in a time of a lot tighter resources.

Is there any advice you'd give to teams that are facing that situation? 

[00:17:17] Barzan Mozafari: I think we, we have actually seen a lot of Alteryx customers also, like starting with their Snowflake bill, so I've had that exposure as well. I think most of the time it's just a matter of priority, right? Like. There are times where like, you know, I sometimes joke with my own team and say, Hey, for every 10 things that are broken, I promise you we only have resources to fix one of them.

So let's decide which nine we're gonna keep broken. Right. And I think, but the beauty of. The era we are living in is that you don't actually have to leave everything broken. You can. There are certain things that only data teams and data engineers and data analysts can actually do. There are things that require deep domain expertise and manual work to get them right.

Like, you know, if you're trying to clean up your data model, if you're trying to document something, if you wanna make sure that you get alignment by amongst your team about what needs to be done, how to drive growth for your business. I would say just focus on those things like, you know, things like, Hey, how do I make this query around like 20% faster?

How do I reduce my snowflake bill by this much? How do I have visibility to these things? One of the greatest things about engineers is that they have a can-do mindset. Because they build things, but that also turns into a weakness sometimes because there's like this tendency to always build everything in house.

Like we've come across data teams at companies. For example, there was a gaming company, like their business is to develop interesting games and that's how they win. Right? But like we were seeing that, you know, their entire. Data team is like spending 90% of their time trying to build things that they could just go and buy off the shelves for a fraction of the cost it would take for those engineers to build in-house.

So I would say try to get over this build versus buy tendency. Especially now with ai, you can actually leverage teams like Keebo to just slash your cloud data warehousing spend. Overnight, and then you can, you know, spend that energy, that creativity, and you can fulfill that need for building things, by building things that are actually beneficial to your business, right?

Like, you're not in the business of selling a solution that optimizes your snowflake. You're a marketing company, or you're a e-commerce website, or like, you know, you're an auto industry. That's not what's delivering the most value to, to your stakeholders. So just focus on what's driving your business forward.

There's a lot of times where. People just feel like because they can do something, they should do it. But I think it's just a matter of priorities. Like, so if I were to summarize that advice into one word is stop building things that you can buy for a fraction of the cost of building it in-house.

[00:19:49] Megan Bowers: Definitely, and I think some of those things you said were a nice transition into the last question I wanted to ask, which was, how can data engineers and data professionals better equip themselves to leverage ai? 

[00:20:02] Barzan Mozafari: So there's a lot of people these days who are excited about ai, but there's also a lot of people who are terrified by it, by prospects of, Hey, what if like AI just comes and steals my job, replaces me, and all of that.

I would say this is gonna happen. AI is taking over a lot of things. This is where the macro trends are moving. So the only thing that you can do is to be on the right side of history. Like you can resist it. You can tell your boss that, no, no, this solution doesn't work. That solution doesn't work. Let me and my team manually build this in house.

But you're just delaying the inevitable. Whether it happens three months from now, six months from now, a year from now, if you're doing something that could be automated, it's gonna get automated. So my best advice would be to embrace the ai. Back in the day, if you needed to learn a subject, you'd have to quit your full-time job or take, you know, some classes in the evening.

But these days there's plenty of online resources. You can go and take a bunch of classes on machine learning, on statistics, on artificial intelligence. Of AI and just get up to speed and educate yourself. I think education is the best key for long-term success. You are doing something that nowadays AI can automate Well, yes.

In the near term, AI is a threat to what you're doing, but the way to defend it is I. By making sure you, you're spearheading that initiative to actually use AI to leverage that and educate yourself and figure out how to use it most effectively and then move on to lot of things that AI cannot do. If you know how to leverage ai, I think you're gonna be an, an even bigger demand in, in the long term as well as, you know, short term, to be honest with you.

So I would say just embrace it. Try to go back to school, try to find out what's out there, educate yourself in. There's gonna be even more opportunities for people who have that AI background and those equipped with those kinds of skills. 

[00:21:54] Megan Bowers: I think that's great advice. And I'm also thinking back to those challenges you said before, like.

The challenge of adoption or of monitoring the ai, you know, maybe DA data professionals can focus in on those things. If their job was to manually do spreadsheet work and that's getting automated, then it's like, how do we monitor some AI solution or some automated solution that we've put in place? How do we make sure that we're checking the quality?

How do we make sure that nothing's getting out of hand? I 

[00:22:23] Barzan Mozafari: mean, just knowing what's the right type of hammer for the nail that you have. 

[00:22:26] Megan Bowers: Yeah. 

[00:22:27] Barzan Mozafari: That requires skills, that requires expertise that you can pick up because there's so many different sub fills within artificial intelligence, within machine learning, and for you to be able to make the right recommendation and adopt the right tool at the right time for the right problem, that itself is a pretty important skill to pick up in this market.

[00:22:43] Megan Bowers: Awesome. 

Well, it's been great to have you on our show. Thanks. Really insightful chat. 

[00:22:48] Barzan Mozafari: Appreciate it. It was great chatting with you, Megan. Have a great day. 

[00:22:51] Megan Bowers: Thanks for listening. To learn more about topics mentioned in this episode, head over to our show notes on alteryx.com/podcast. And if you like this episode, leave us a review.

See you next time.


This episode was produced by Megan Bowers (@MeganBowers), Mike Cusic (@mikecusic), and Matt Rotundo (@AlteryxMatt). Special thanks to @andyuttley for the theme music track, and @mikecusic for our album artwork.