Advent of Code is now back for a limited time only! Complete as many challenges as you can to earn those badges you may have missed in December. Learn more about how to participate here!
Start Free Trial

Alter Everything Podcast

A podcast about data science and analytics culture.
Podcast Guide

For a full list of episodes, guests, and topics, check out our episode guide.

Go to Guide
AlteryxMatt
Moderator
Moderator

In this episode of Alter Everything, we chat with Alex Patrushev, Head of Product at Nebius. We discuss the gaps organizations face between data and business impact, strategies to bridge these gaps, and the role of AI in these processes. Alex explains Nebius' mission to make AI accessible, the challenges of building data centers and software from scratch, and innovative solutions like their data center in Finland. The conversation also covers key components for effectively bridging data and business impact, such as project selection, stakeholder communication, team skills, data quality, and tech stack.

 

 

 

 


Panelists

  • Alexander Patrushev, Head of Product for AI/ML @ Nebius
  • Megan Bowers, Sr. Content Manager @ Alteryx - @MeganBowersLinkedIn

Topics

 

Ep 188 (YT thumbnail).png

Transcript

Episode Transcription

Ep 188 Bridging the Gap Between Data and Impact
===


[00:00:00] Introduction to the Podcast
---

[00:00:00] Megan Bowers: Welcome to Alter Everything, a podcast about data science and analytics culture. I’m Megan Bowers, and today I am talking with Alex Patrushev, Head of Product at Nebius. In this episode, we chat about how organizations face gaps between data and business impact, strategies for bridging these gaps, and how AI fits into these challenges.

Let’s get started.


[00:00:33] Meet Alex Patrushev
---

[00:00:33] Megan Bowers: Welcome to the show, Alex. It’s great to have you on here. Could you give a quick introduction to yourself for our listeners?

[00:00:39] Alexander Patrushev: Yeah, I’m thankful, Helen. Me, first of all—my name is Alexander. You can call me Alex. I’m the kind of person who’s always looking for new challenges. I love sport, especially mountain biking.

I really love to go through the forest and the hills. Yeah. And from a professional perspective—right now I’m working as Head of Product at Nebius, one of the rising neo-clouds.

[00:01:03] Megan Bowers: Gotcha.


[00:01:04] Nebius: Mission and Challenges
---

[00:01:04] Megan Bowers: So can you tell us a little bit more about what Nebius does and just dive a little more into your role?

Our

[00:01:10] Alexander Patrushev: main mission is to make AI accessible to anyone, regardless of their level of expertise or the use case.

So ideally, we want to build a platform where any person could build a chatbot, train their own huge model—a new state-of-the-art language model—or just fine-tune something, or maybe just build their personal task planner for the day. We want to build that, and for that, we are building multiple products. What I really love at Nebius is that we control everything.

We’re building everything ourselves, literally. We’re building the data centers, we’re building the servers, the storage. We’re building the software, we’re writing software, and all of that is a cloud—it’s on-demand with self-service. So this is what we are trying to build, and I think it’s already available to a lot of people.

[00:02:09] Megan Bowers: Awesome. Is that challenging? I mean, building everything from the ground up—like when it comes to data centers—it seems like such a different thing than the actual software. How do you guys manage all of that?

[00:02:23] Alexander Patrushev: It’s not a simple thing. Honestly, I personally don’t really know a lot—maybe I know more than the average person—but since we have multiple things in Nebius, including data centers and the service, we actually have a team who’s responsible for that.

And they’ve been doing that for many years, like more than 20 years. So they know exactly what to do, how to do it. And what is also important—they’re not just doing it the same way every time. They’re continuously looking for new innovations.


[00:02:53] Innovative Data Center Solutions
---

[00:02:53] Alexander Patrushev: Like for example, our data center in Finland doesn’t use coolers. It’s in a specific location where you have a lot of cool air all year round. You can just cool with the air, and in the winter, you actually need to heat up the air before you put it in the data center.

There’s a lot of heat waste. So in Finland, that heat is actually recirculated through a special system to the next village, and there are several thousand people heated by the data center during the winter.

[00:03:25] Megan Bowers: Wow, that’s super cool. Really interesting use of that power if you know what to do.

[00:03:31] Alexander Patrushev: That’s not hard—if you know how to do it, that’s not hard. But at Nebius, we know how to do that.


[00:03:37] Bridging the Data-Business Gap
---

[00:03:37] Megan Bowers: Going back to AI and data, I’d love to hear about how you see organizations facing a gap between their data and business impact.

[00:03:48] Alexander Patrushev: Yeah, that’s a super important question. So what you can see right now—first of all, is that data is a super important thing. You cannot do anything without data.

And you need not just data—you need really high-quality data. Because like, back in the day in data science, we had that rule—in AI as well—“garbage in, garbage out.” That means you can’t do anything really high quality if you haven’t worked on the data. So the data is super important. But at the same time…

The business impact is another important thing. If you build something but that something doesn’t solve a specific problem of the business—doesn’t help the business—then no one will use it. So you always need to balance between what you could build easily, for example with existing data, and what’s super important for the business.

At the end of the day, it means that sometimes you need to stop, you need to look around, maybe collect something new, maybe buy some new data. But you always need to start from the business requirements. So there is definitely a gap, and there are a lot of ways you can close that gap.

[00:05:05] Megan Bowers: What are some of the things that contribute to that gap?

I think there’s the information side and also the infrastructure side. So on the information side—could you talk a little bit more about data drift or things that cause companies to face challenges with their data?


[00:05:22] Key Components for Data Success
---

[00:05:22] Alexander Patrushev: So, about the data as information, I think there are three most important things.

The first one will be availability of the data.

And here it’s about—do you have the data today? Maybe you don’t have it today, so maybe you need to start to collect that. Maybe you can just buy some data that you’re missing. Or there’s also a new way—to generate the data. To build a big model, you need a lot of data to fine-tune a model. But if you don’t have it, one of the things Meta released in open source—I really like that—they call it Synthetic Data Kit. So it’s a special framework, Python library, which you can use—you can provide, let’s say, some data that you have, and then you can clarify, you can describe what you want to get. Like for example, you want to get a dataset with questions and answers to train a chatbot, a support bot.

And this library will actually use big models under the hood to build you a new dataset. So it’ll be a synthetic dataset based on the real data, and you will use it.

So that’s the first thing—you can collect, you can buy, or you can just generate data. Then, what’s also important about availability—if we think about a big company, an enterprise, that has multiple teams—like 10 teams doing different projects—and especially if it’s more regulated, like FSI, like banking or healthcare—it’s not only about the number of teams, it’s also about the regulation of data access.

So for them, it’s super important not just to collect the data. For them, it’s actually really important to build a data catalog. So if you have several teams and you have some regulations, it actually means that you need to write a data catalog, because that will be the place where all the teams could exchange data. For example, someone already processed credit cards or cleaned the personal data, so they can just publish it in a data catalog and other teams could just use it.

So a data catalog is also super important if you have more than one team. And don’t forget that you need versioning of your data. For any reason, if you need to go back to check where the problem is coming from with the model, you won’t be able to do that if you just recreated your data—you just rewrote your data.

So versioning is super important, and I believe it’s super easy. There are a lot of nice tools like DVC—Data Version Control. There are open-source tools. If you want, you can just go and use any object storage—like AWS S3, Nebius S3 Object Storage, or like Google GCP—and you can enable versioning of the data.

It doesn’t cost a lot, but it gives you a way to go back and check where a problem came from. So that was the first thing—availability.

Then there’s the second thing, which is quality. So the quality of data is really important. We mentioned at the beginning—it’s about cleaning the data, removing noisy data. But it’s also about what you mentioned—data drift. We’re living in a continuously changing world around us, which means in a couple of months you might have a different population of your service, of your users. You may have different patterns, or just new data coming in—and you need to do monitoring of the data.

I wouldn’t say that a lot of people are doing that. So you need to detect things like data drift, concept drift. That will help you to achieve several things. First of all, your quality will not go down. That’s important. Your users will still be enjoying a good service.

The second, which is also super important—if we’re talking about fine-tuning or building your own models—it costs money. And you need GPUs to do that. It’s not a question of whether you need to retrain the model. It’s like back in the day with other models—you were doing the same. But now, it costs a lot because you need a lot of GPUs. If you do that just because you think, like, “I think we need to do it monthly,” or “maybe weekly,” you’ll actually burn money.

It’s lucky if the quality doesn’t drop—then you just burn money. But if the quality does drop and you don’t see that, you’ll also affect your users. So it becomes a money problem and a user satisfaction problem.

So for that, you need to continuously monitor the data—retrain only when you actually need it. If you do it this way, that’ll make sure your users aren’t churning—they’re happy with you. And second, you actually only retrain when you need to. Maybe you only need to do it once a year.

You’ll save a lot of money if you don’t do it every month just because you “think so.” And again, it’s a repeatable process. So the monitoring should be done continuously. It’s up to you to decide how often—every day, every month, every week. I think you can start more often and then slowly reduce—like, start with weekly data drift detection, do that for three months, then switch to two months. Because again, you’ll save some money—you don’t need to do it continuously.

And probably the last important thing is to make sure your data is diversified. So nowadays, we see that multi-modal models are stronger. They’re really good in terms of—you can interact with them directly. For the images. For the text. Now there’s also voice. Soon we’ll see really strong video models.

So modality is also important.

[00:11:05] Megan Bowers: That makes a lot of sense. I think some really good points there.


[00:11:08] Infrastructure Challenges and Solutions
---

[00:11:08] Megan Bowers: And then if we look at it from the infrastructure side—the physical side—where do you see organizations facing challenges?

[00:11:17] Alexander Patrushev: So the first question is—where do you store it? Because let’s say if you have structured data, like order history, you need a database. If you have images, for example, of the items that you have in the store, then you need somewhere to keep millions of images from different angles.

Again, if you look at a store use case, you also have a description of the item—like all the parameters. So you need something like a document DB in this case. So you have different types of data, and there is no one tool for everything. So you need to have multiple tools. You need object storage, you need document DB, you need a true SQL database.

Maybe you need something else. And you need to make a meta-catalog from all of that—not to search continuously. So you also need to catalog the metadata for all of that. So that’s the first thing.

And again, we discussed that you need to do data drift detection and you need to do it continuously, which means you also need an orchestrator. You need some place where you execute your scripts to process data, analyze the data. For example, you’ll do that with Spark or with Pandas or something else that you prefer to use—but still, you need infrastructure for that.

And this is from the software part, but under the hood—it also means you need hardware infrastructure. That means you need servers, you need storage, you need connections between them. You need working load balancing and absolutely different things. And I think here’s the question—how do people want to solve that?

So all the things that I just mentioned—and honestly most of the things happening in the current AI space, like with language models, with GenAI—it’s not a big difference from what we had 10 years ago with a random forest model.

The scale is different, yes. The scale is much higher now. But the principles are the same, more or less. So people can always go and build something by themselves. That’s absolutely fine—there are a lot of nice open-source tools.

At the same time, I believe if a company wants to be really effective—especially if we’re talking about digital natives, if we’re talking about startups—I know the feeling when you’re an engineer, you build something from scratch. You took virtual machines and you built everything.

It’s cool from a generic perspective. Absolutely—you’re like a Lego store. But that’s not what the business needs. I mean, is your online shop or travel agency benefiting from the fact that you know what Linux is and that you were able to deploy object storage based on five virtual machines? No,

[00:14:01] Megan Bowers: right?

They’re just like, “How long is it going to take?”

[00:14:04] Alexander Patrushev: Yeah, exactly. So that’s why I think that a lot of companies building AI nowadays could benefit from cloud providers and different cloud platforms. This is the reason why, in Nebius, we decided that we don’t want to build just a simple place where you can get virtual machines.

We decided that we want to build a full stack. So we have virtual machines. We have different storages—like databases, object storage, file storage, networked deep storage. We built the software stack for all of that. So if you want, you can go and track experiments with MLflow. If you want, you can process data with Spark.

Maybe you can just run data processing on a Slurm. Funny enough, it’s becoming really popular nowadays to use Slurm for data processing with a Pandas script. But that tool and many other tools with the same idea—you need a big model to run an inference. That means, again, you need the GPU to run it or you need some API where you can get a model.

That’s why, in Nebius, we also built the layer where you can get models. We call it AI Studio, where you can get models through the API—like 405 billion-parameter models, LLaMA 70 billion, Mistral, and other different models. So that’s why we built all of that.

And honestly—it’s not very easy to build that. For example, if you’re talking about training—you have thousands of GPUs taking the same data to train a model. We’re talking about thousands of gigabytes per second of data transfer. It’s huge storage, and it’s a huge performance requirement. And you might have multiple such clients. And at the same time, while someone is doing that, someone else is just generating images and putting them on object storage.

So that’s why it’s not really easy. But yeah, we know how to do that.

But probably you want to stay more focused on data, AI, and the connection between data and business impact.

[00:16:08] Megan Bowers: Yeah, yeah, definitely. It is super interesting, and I think listeners can relate to some of the challenges that you’ve mentioned—especially, I feel like, around availability. I know I really could have used a data catalog at my last job trying to sort through all the data there.

So you mentioned some good examples of ways to help solve this challenge, but could you give any more on how organizations can go about bridging this gap between their data and their business impact?


[00:16:36] Effective AI Implementation Strategies
---

[00:16:36] Alexander Patrushev: I think that there are five really important, let’s say, components, pillars that you need to always think about. So first of all, you need to select project. So where it’s probably where you need to start. Like I know that it’s really cool to build one more chatbot. It’s super easy, honestly, like there are millions of frameworks, samples on the GitHub and the website.

But will your business actually make anything important format it if, do you actually have a place to put a chatbot? Maybe you have like offline store, like normal grocery store. Why you wanna build a chatbot? So the project selection will be the first thing, and I really recommend to do a really simple exercise.

So think about three dimensions. Data, availability of the data. The second, the business impact. And the third is how easy you can find the solution. Is it problem already solved or not? And then for all of them, for, for all your projects, for all your ideas, put a number from one to nine for to each dimension.

Do you have a data or you don’t have a data? And then what’s impact on the business? It will be like, is it like really impactful on a business or just doing nothing? And then can you find a solution like five solutions, post potential solutions that you might reuse and you’ll end up with a really small table, depends on how much ideas you have actually, and you need to select the project where you have the really good balance between all of them.

You might have a really good project in terms of total number, like for example, you have a nine for a data, so you definitely have a data, super good data, but zero like one impact for a business. Do you need to do that? No. Probably you will do that, but business will just ask you like why you’re wasting my time and resources.

On the other hand, you might have a really good project, super for the business, but you don’t have a data at all. Or maybe this is like a super difficult project where, which, which no one solved. Right now you, you could not find anything like no algorithms, no frameworks, no blocks. Do you want to start with this project?

Maybe also? No, just because you will spend a lot of time and as a result, what I suggesting, take some project where you have median for everything. Ideally nine for everything, but it’s probably it doesn’t exist.

[00:19:04] Megan Bowers: Or it’s already been solved.

[00:19:06] Alexander Patrushev: Yeah. In this case, you’ll get something like where you have a data, so you can actually do that. It’s impactful for the business, so you’ll get support, you’ll get resources and there are some solutions. So you don’t need to do everything from the beginning. You just, you mostly just need to like take a look what people achieved and try to use their solutions. Maybe one of them will actually solve your problem with a nice quality that you’re looking for.

So that’s first. I think. Then it’s about stakeholder communication. I really believe it’s important, and when I say stakeholder, it doesn’t mean that the vice president of the company needs to continuously talk about AI in internal communication. It’s more about that. When you do a project, you need to take people into the team who knows something about it, because like data scientist, I saw a lot of MEMS about the data scientists doing project without any understanding what this data means.

So that might become a problem, honestly. So you need to collaborate with the stakeholders, with the people whose business rely on it, whose KPIs rely on it with the people who understand what’s the data. That, that will help you to, for, to create a next important thing, skill and collaborative team. So you need a team, cross-functional probably who will help you to collaborate with people around the team.

And you also need to have a skills in the team. So it’s about, it’s fine if you don’t have a skills. I really believe most of the hard skills could be learned. It’s just a question of motivation and available resources. So if you’re missing some skills, you can. Find a person on the contract or permanent.

If you one, you can teach your current team. Then another important thing is data. That all the thing that we just discussed about availability, quality, and diversity, and the last really important thing, I think, I believe it’s, it’s a tech stack. So it’s again that what we have discussed, like in nowadays there are millions of tools, millions of things around you, from open source to commercial platforms.

You need to select. Something that, first of all, you need, don’t select something absolutely new because you will spend a lot of time learning it, and you’ll have a lot of fails just because you’ve, you’re doing that for the first time. And the second, do not do anything by yourself because in the end of the day, your business, I know selling tickets, select cucumbers or something else like generating images.

So do not waste time on. Getting virtual machines or physical servers, repair them, maintain them, deploy the entire stack and support that stack. Maybe you can just take something more or less managed and just quickly go and run. Then when your business will be strong, when that use case will be widely used inside of the company at work, outside of the company.

You can think about optimization, and this is a moment where you can start to think about, okay, maybe I don’t need managed spark. Maybe I could run my own spark on the physical server or on the virtual machine and it’ll be cheaper. So I think those are five most important things like project selection, communication, team data and tech stack.

So those things, if the people will think about all of them, that will actually help to really close the gap between the data and the business.

[00:22:22] Megan Bowers: Definitely, those are really good guidelines. I really like that, the way you laid that out. I’m curious though, what you think about, you know, there’s a really big push to implement AI, get AI projects going.


[00:22:34] The Rush for AI Projects: Risks and Recommendations
---

[00:22:34] Megan Bowers: Maybe like top levels of your company is pushing AI this, AI that. Do you think that there’s like any risks around widening this gap between the data and the impact with kind of the rush and the popularity of AI projects just for a AI project’s sake?

[00:22:51] Alexander Patrushev: Great question. I believe that if people will just put AI everywhere, that will makes problems, not mean everything will goes bad and something happen. No, it just means that it might become super inefficient. I really believe that people supposed to use the simplest solution. So if there is something that you can solve without implementing ai. Now to use.

In the end of the day, artificial intelligence is not, deep learning is not LOM, artificial intelligence is a bigger, all those lms, deep learning, it’s inside of artificial intelligence. Like 15 years ago, maybe even 20, uh, 20 years ago, I remember I had, um, I think it was HTC with a Windows Mobile, and there was a program to navigate me through the city. At that time, there was no, by George, there was no LLM. It was actually just simple road optimization algorithm. That’s it. But it was artificial intelligence. Yes, it was.

So, implementing AI doesn’t mean that you need to burn a thousands of GPUs and use the biggest model in the world. I believe AI could help a lot everywhere. Like for example, even with the data, AI could help. Imagine data catalog, and you just mentioned that you were spending a lot of time just searching the data for it. Why not to use AI to annotate the data? Make a really nice notation for each column of the data, and it’ll make data catalog bright. Everyone will start to use it because it’s super easy to search for it. AI could help here. Yes, but what I want to avoid people from doing is to put the most state-of-the-art model everywhere. It’s just non-efficient.


[00:24:30] Nebius' AI Journey and Insights
---

[00:24:30] Megan Bowers: I feel a good place to end would be to just hear a little bit about how your company’s implementing AI and, and what you’ve learned along the way from that.

[00:24:38] Alexander Patrushev: Ah, yeah. That’s, that’s good. So we’re building the cloud and the platform with the GPUs, lops and AI for the API. Well, we have our own AI team and they’re doing a lot of the research like. One of the latest thing that they open sourced was flash attention implementation in the jugs. Before that, they open source the data set, which they used to train the agent called assistant, which achieved 3% of on S3 E bench. Uh, that time it was like state of the art for non-commercial models for open source models.

So they open source, that data set, it’s on a hanging phase. They run, they create a lot of articles around how they train it, how they clean the data. They also, like back in the days they were training 300 billions mixture of expert model, and right now we implemented a lot of the chatbots. So we have a chatbot for the support.

So our support is fully automated with the chatbot. It’s really convenient for the users. We have internal hr. We have internal solution architect board, which can now have access to the data, to the confluence in many other different places. And you can just ask the question like what the speed of the network, how to configure something, and it’ll just give you an answer based on documentation.

Confluence, some Slack channels. So it’s super convenient, but. While we’re doing all of that. So while we’re training our own 300 billion models, while we are training all those software agents, prepare data and doing internal chatbots, why we’re doing that? Because we learn from that. And this is super important for us because I said our main business is to build a cloud for ai.

And here is a good question, like how you can build a cloud for AI specifically specialized on ai. But if you have no idea what is AI and how to build AI project.

[00:26:33] Megan Bowers: Right?

[00:26:34] Alexander Patrushev: So. The insights that we are getting from this team available. So they help us to burn GPUs. They help us to optimize networking. They help us to optimize hypervisor, they help us to optimize storage, different patterns.

So we see a lot of insights and what is super important. It’s not just insights that you see, that it’s actually, you can go and ask why this is happening. They give you, okay, so this is what’s happening. This is how infrastructure is set up, this is how environment is set up. And you can find a lot of things.

For example, storage, some storage might have a cache. You read ahead and for example, if you’re reading small blocks, but you have a big cache for either ahead, you actually reading a lot, you, your storage start to underperform just because you have a lot of reads. While those reads are not necessary, it’s just because the system.

That’s configuration of the system so you can actually find all the things if you have access and. That team, they’re giving all of that. And it’s the same for all those chatbots, internal chatbots. While we’re building them, we learn which database we supposed, what the best database, and we can build a database as a managed service for our clients.

What does it mean to build the rug? For example, what the parameters, what the important parameters, and we can actually build with that kind of knowledge, we can build rock as a service because we know which parameters should be optimized, which should be configurable by the user, which should be hidden from the user because we still want to build, manage solution.

So in our organization, we’re doing a lot of things. We’re doing research, training, dataset creation, open sourcing, all of that, and we are doing internal tools and all of that help us to be efficient. And give us a lot of insights about how to build the cloud for that.

[00:28:21] Megan Bowers: That’s awesome. Yeah, I can see how going through all of those efforts can help you basically build your product even better. ’Cause you’re using it and you’re using AI and figure out the pain points and things like that. So. Very cool.


[00:28:34] Conclusion and Farewell
---

[00:28:34] Megan Bowers: And thanks so much for joining us on the show today. It’s been super interesting and it’s great to have you.

[00:28:40] Alexander Patrushev: Thank you so much. It was amazing to chat with you.

[00:28:44] Megan Bowers: Thanks for listening. To learn more about Alex and Nebius, head over to our show notes at alteryx.com/podcast. And if you like this episode, leave us a review. See you next time.


This episode was produced by Megan Bowers (@MeganBowers), Mike Cusic (@mikecusic), and Matt Rotundo (@AlteryxMatt). Special thanks to @andyuttley for the theme music track, and @mikecusic for our album artwork.