Free Trial

Alter Everything Podcast

A podcast about data science and analytics culture.
Podcast Guide

For a full list of episodes, guests, and topics, check out our episode guide.

Go to Guide
AlteryxMatt
Moderator
Moderator

Data ops is more than just a cool-sounding name for a video game. Data Engineering has transformed and adapted to the way we use data in everyday life so much over the past decade! We are joined by Nick Schrock, founder of Dagster Labs, as he discusses what Data Engineering means to him and how the current world of data is handling the 328.77 million terabytes of data created every single day. Interested in sharing your feedback with the Alter Everything team? Take our feedback survey here

 

 


Panelists


Topics

 

Ep 148 (YT thumb).png

 

Transcript

Episode Transcription

Ep 148 Present & Future of Data Engineering

[00:00:00] Megan: Hi everyone. We recently launched a short engagement feedback survey for the Alter Everything podcast. Click the link in the episode description wherever you're listening to let us know what you think and help us improve our show.

Welcome to Alter Everything, a podcast about data science and analytics culture. I'm Megan Dibble, and today I'm talking with Nick Schrock, CTO and founder of Daxter Labs. We discuss data engineering trends, challenges in the field, why he started his company, and what makes him excited about the future of data engineering.

Let's get started.

Hi Nick. It's great to have you on our show today. Thanks for being here. Thanks Yeah. Could you start off by giving an introduction to yourself for our listeners? 

[00:00:48] Nick: Sure. My name is Nick Schrock. I'm the CTO and founder of Dagster Labs, which is the company behind Dagster, which is a data orchestration framework.

Prior to doing this, I was a engineer at Facebook from 2009, 2017. While I was there, I founded a team called Product Infrastructure, whose goal was to make our application developers more efficient and productive. And a bunch of open source work came out of that, actually. One of which is React, which I had nothing to do with, but actually the CEO of Daxter Labs co created, and then I personally co created GraphQL.

So as I like to say, Pete and I were present at the creation of the full hipster stack. I moved on Facebook in 2017. Figuring out what to do next in this data engineering, data orchestration problem really got me hooked actually quite soon after I left and the rest is history. I'm sure we'll get into that more.

[00:01:36] Megan: That's awesome. And yeah, I'm definitely excited to hear more about that. A lot of our listeners are in the field of data analytics. That's my background as well. So I'd love if you could explain some of the similarities and differences between data engineering and data 

[00:01:51] Nick: analytics. Yeah, well, as you know, in this field, titles change all the time.

And in reality, the world is much messier than that. And there's overlapping work where like an analyst in one company might actually look like a data engineer at another company. But someone who describes working with data analytics is typically getting business problems. That are mandated them by business stakeholders and they are figuring out the data sets that exist and finding the needles in the haystack to figure out how to answer that question for the business leader.

So there's often a lot of time doing ad hoc analysis, figuring it out. And then maybe translating that into a dashboard data engineering frequently is one notch below that in the stack. It's more about doing usually software engineering, and it's the practice of building and designing software for collecting, storing and managing that data.

Also, they frequently are themselves enabling other stakeholders to productionize those data pipelines, one kind of in between. The other title that has emerged in the last few years is the analytics engineer, which is sort of a hybrid data analyst and engineer, typically using dbt. They are responsible for building data pipelines and data transformations that end up being productionized.

But those are kind of the differences I would say, is that typically a data engineer, it's more of a software engineering discipline. And working in data analytics is more frequently a business user who knows how to use SQL, who can answer questions of the company's data and answer questions directly for the business.

That 

[00:03:32] Megan: makes a lot of sense to me. I've also seen what you're talking about, titles can be relative. And sometimes if you're at a smaller company, you're just the data person, right? And it could, you could be wearing all of those hats. So yeah, you should 

[00:03:46] Nick: think of them as more roles. Yeah. Humans. Like in a lot of different like smart companies, the analyst, you have to get the data from somewhere and there's no one else on the team who can do that.

You have to become a temporary data engineer to figure out how to do that. And then the boss is like, can we get this automated so that it shows up every day without having to talk to you? You're, you're like, okay, well I guess I'll figure that out. And then all of a sudden you're a data engineer for another 

[00:04:09] Megan: day.

Definitely. And I think like some of the people who use our product, who use Alteryx were maybe in the business role doing supply chain finance, and then they need the data. So then they become the data analyst. So there's always that cycle of learning new technologies, learning those new roles, which is exciting and can be challenging.

Another thing I'd love to hear from you is kind of an overview of the current state of data engineering, what you're seeing in the field. 

[00:04:40] Nick: That's a good question. I think it's worth talking about the lineage of the term. You know, I think data engineering really only became a term in, let's call it the mid 2010s.

I think some of the bigger tech companies sort of invented the term. And it used to be like ETL engineers or database administrators would be responsible for replicating data from an operational store to an analytical engine. So, I defined data engineering. Typically it's kind of software engineering coming to data and that is the thrust of it and building software, writing code to do that rather than using drag and drop tools.

You know, there's been enormous evolution in data engineering in the last 10 years. From my standpoint, I guess I'm biased. Like I think that data engineering is the core discipline across ML analytics and increasingly production applications. But there was a data science boomlet in the mid 2010s, and then everyone, as they worked through it, realized that 90 percent of the work they were doing was data engineering.

And so, you know, it's almost like upstream people wanted data science, but then. In the reality of the world, what they had to do was data engineering. I think that similarly, there's been the rise and fall and re rise of data engineering. I think for a while there was a belief through the so called modern data stack, which is kind of a new architecture that's been around for the last eight years or so, while the premise of that is that actually with the right off the shelf tools and the right cloud infrastructure, you can kind of abstract away the data engineer.

You like install a tool like. Fivetran, which can replicate data between a SAS app and a data warehouse. You use Snowflake or BigQuery or something and you use dbt and they use another tool to like do charting over that and another tool like jam data into another operational store. And then it was like, well, our data engineer is going to be necessary.

And likewise, Data engineers are making a comeback again. Um, and you know, it's just increasingly, even if you adopt all those tools, typically to a data engineer who is responsible for keeping them alive and automating them and cohering that experience into a single data platform that the stakeholders can grapple with.

The constant theme through all that stuff is that data engineering is a field which kind of has this lineage from before it was called data engineering of not being considered a software engineering discipline. And it said, Oh, you can like outsource this to people using drag and drop tools and whatnot.

And tons of the innovations in the last five to 10 years in the space are effectively taking the lessons learned from software engineering and applying them into the data world. And that's been called data ops by someone like everything is ops written out nowadays. But like, you know, that's been called data ops by some as well.

But this trend, there's still a long way to go with data engineering. If you come to data engineering from an adjacent discipline and software engineering, you're like, Whoa, I just felt like I went 10 years back in time. The feedback loops are slower. The tools are more clunky. Everyone feels like they're drowning in complexity.

So there's been a ton of progress in data engineering, but I still think we're at the beginning of this. You know, I, the series B announcement for the company was entitled, like the decade of data engineering. I actually think this is like one of the most important unsolved problems in the software industry and affects everything, right.

Including from day to day operations of company, all the way to fancy LLMs that you hear about. Like every, you know, AI and LLMs are all the rage and the ability of companies to actually get value out of LLMs are going to be dependent on them being able to access their own structured proprietary data in a sane way.

And the only way that exists is from data engineering. So it's both a new field and there's a ton of unsolved problems to have. So that's why I'm so excited about it. That's 

[00:08:40] Megan: great. And that lines up with kind of what I've been seeing on LinkedIn, some shift from all the talk about data science and ML to more talk about, well, we need the foundations.

We need the data engineering. We need high quality data that we can trust before we can even start to think about implementing other data science best practices. So I'm excited about it as well. really interesting field and super necessary, like you said, too, with all of the talk of LLMs and AI, and it's a huge buzzword right now.

We've been talking about it a lot on our podcast and you're only as good as. Your data is and as your data engineers are, I think that's a great summary of where the field is at now. Next I just wanted to talk about a lot of our listeners, like I said, are in analytics roles, but they'll still interact with data engineers, whether that's on a project team, on a data council for ad hoc requests.

What does this mean for analysts? 

[00:09:42] Nick: I think it completely depends on the context of the organization. That relationship can be very different in some organizations. It means that there's a centralized data engineering team that is responsible for maintaining the data assets that are important to the company.

And then as an analyst, you have to talk to them if you want to do anything. You might be able to detect from my tone of voice that I don't think that's a good idea. Um, the um, the best sort of relationship is when the data engineer is viewed as someone whose job is to empower their stakeholders to own as much of the process as possible, make them the sort of masters of their own fate as much as possible.

That can mean different things in different contexts. 

[00:10:32] Megan: Let's talk about what led you to start your company. What challenges were you looking to solve? 

[00:10:39] Nick: I left Facebook in 2017 and I took some time off, but I wanted to continue working on interesting projects. So I started to talk to lots of companies, both inside and outside the Valley about what their engineering challenges were and what they felt on a technical level was preventing them from.

Making progress in his business. And I especially wanted to talk to companies outside the Valley, actually outside of kind of the newfangled digital native businesses, another way to put it. Because one of the things I really liked about GraphQL and open sourcing that, which I thought was really cool is that early in the company's lifestyle, lifestyle, life cycle.

It was adopted by, you know, so called legacy companies. So yes, companies that are based out of San Francisco, like Airbnb adopted GraphQL, but also like KLM and Walmart and it started getting value immediately. So it got me excited to get into DevTools externally in general. Cause like, wow, all these kinds of older companies are adopting new technologies.

There's a much bigger opportunity here, an opportunity to make a bunch of impact. So I was talking to companies inside and outside the valley, and I just was asking him, what's your biggest. Technical liability, what are the engineering problems you worry about? And it was remarkable across industries, across company life cycle, across company age, across domain, data and ML infrastructure kept on coming up over and over and over.

Yeah, I remember talking to a healthcare company and I expected to hear about HIPAA compliance and like privacy issues. And they were like, Whoa, no, no, no, we're not even there yet. Like we don't even have these like basic problems solved. Like we don't even have the opportunity to solve a privacy problem because we can't model the data effectively.

And then I remember in the meeting, I said like, wait, you're telling me what you think is preventing you from making serious progress in American healthcare is the ability to do reliable, regularized computation on a CSV file. And they're like, yeah, pretty much. And I'm like, okay, maybe I should be looking into this.

And what I like to say is that I found the biggest mismatch between the criticality and complexity of a problem domain and the tools and processes to support that domain that I've ever seen in my entire career. The only thing that was similar feeling was like web development and full stack application development say around 2010.

Where everyone wanted web apps, but all the developers were drowning under the complexity of building complicated software in the web browser. There's been so many frameworks and innovations in front end and like you fast forward 10 years and it's a world transformed. But building web apps now is just a completely different universe than it was 10, 15 years ago.

And I was fully convinced that a similar transformation was needed. In data and ML and looking into it, I very quickly gravitated towards this orchestration layer and what orchestration is, it's the piece of software that is responsible for scheduling and ordering computations in production, meaning that let's say every day you want to kick off a process, which does a bunch of computation where it ends up in a data warehouse table that you can drive a dashboard from.

And typically that has many, many steps. Often, it has different computational technologies that are being invoked, written by different humans, as the orchestrator kind of makes this all work. And orchestration is utterly critical, because all data comes from somewhere and goes somewhere. And so, if you're a practitioner, you want your data assets in production, meaning updated on a regular basis, you have to interact with the orchestrator.

And then in turn, the orchestrator invokes every computational runtime, and in turn, touches every storage layer. So it's this kind of like universal layer. And I thought it was just like, one, the existing solutions just had like very bad developer ergonomics. And that motivates me very deeply. Cause it's like an inefficiency in the world, but more profoundly than that, I thought it was just like very like unexploited point of leverage in data systems, because it could be this unified control plane that kind of like brings order to the chaos.

You know, if this thing is the responsible for ordering what the computations that produce the underlying assets are, why can't it understand the dependencies between the assets? And actually, why doesn't it drive it the other direction? Why don't you start with the dependencies, start with the outcome you're trying to get, and then let the system do a bunch of work for you.

And in my view, like data quality, cataloging, observability, scheduling, transformation, all of these different capabilities you need should all come out of the box in the orchestrator and then be able to incorporate third party solutions in a first class way. And it needs to be a very engineering forward approach.

So I gravitated towards that very quickly. The shape of the project has changed, but the underlying thesis has been constant for the last five years. That's really cool. 

[00:15:35] Megan: What you said about the UI being outdated, the challenges for existing solutions, just some of the challenges that you've faced, I feel like are very relatable, can be chaotic with data coming in and going out constantly from different sources.

I used to be at a company where they had over a hundred different ERP systems that were sending data. It was, you know, global company. data coming from all the markets, all the places they acquired. And so I'm imagining for an example like that, when you have that many data sources, how valuable the software would 

[00:16:11] Nick: be.

Today's world is total madness, right? Like typically, you're talking about like a company has a really heterogeneous stack because acquisitions and Legacy infrastructure, but I always drive from like, let's talk about a new business that's adopting the newest, latest and greatest. And that's crazy too.

Like an early stage startup has more SaaS apps than employees. I believe we do. I think we have like on the order of like 60 apps and like 50 members of the team. In order to operationalize that, you want a data platform that ingests all that information and like puts automation around it. Then you start to get into like, well, we also are producing a piece of software that produces data.

We need to understand the analytics on that. Oh, and now actually the way that people construct their systems nowadays, they often have multiple operational systems and that needs to be merged in the data warehouse. And then you have ML deriving the actual behavior in the app. So the complexity sprawl, it's huge, right?

Like the, in the last decade, the problem of data was big data. I mean, like the ability to actually run computations at scales of data that companies were producing. There's like a whole generation of technologies purpose built to solve that problem. And the problem of this decade is big complexity.

Meaning that people are trying to get more stuff to be data driven, there's more external services, people need to be more real time, it's more incorporated into their real time systems, and that just is like exploding the complexity. And increasingly, data is also a competitive advantage for a bunch of these apps that are being built.

The underlying data that matters more than the application quality. Or the data drives the application quality, I should say, rather than like how good the UI is. So, there's a ton of different factors here. Anyway, I went off on a tangent there, but No, no, 

[00:17:59] Megan: that's great. And, I mean, what you're saying about the complexity and that being the next challenge of this decade, I think our product Alteryx to can help with that kind of at the end of the analytic cycle, you know, after the data engineering is complete, or even for just smaller scale projects, being able to pull in sources, whether it's a more established data source and snowflake, or whether it's an Excel sheet or a SharePoint list, or when you're using all those SaaS apps, you have a lot of data outputs.

So that's what I enjoy about using Alteryx too, is being able to get it all in one place to work with. We've been talking a lot about some things that you're excited about for the future of data engineering. So I'd love if you could just elaborate on that a little bit more on what you see in store for the future and what direction you're headed.

[00:18:52] Nick: What I'm excited about is that everyone agrees that the current state of the world is sub optimal, which means change is coming. The question is, what sort of change is required? I think one. One issue that's particularly acute for data leaders at companies is vendor fatigue and technology fatigue. Like people don't want to bring in like 20 vendors to assemble a data platform.

And the question is like, how is consolidation going to happen? So there's one kind of obvious answer, which is complete vertical consolidation, vertical integration. So you decide that you're a snowflake shop. Or a Databricks shop or an Amazon shop. You have one vendor relationship. You assemble your entire stack with tools from that vendor.

So you use Amazon's workflow engine, Amazon's data warehouse, Amazon's compute infrastructure, et cetera, et cetera. And rinse and repeat for the different, for the different vendors. Like Microsoft just came out with Fabric, which is kind of their play to do this, for example. I think that will be, that will work with some customers, but I don't think it's a world that anyone wants to live in, really.

And we say that with our customers anyway. Like almost all of our big customers, they use both Snowflakes and Databricks, if nothing else, so that they can kind of play them off each other for leverage during a contract negotiation. They're good at different things. And so if people aren't going to consolidate on like one or two vendors, then the question is how else do you consolidate the world?

And if you're not going to do it vertically, bet on one vendor in that way and use them for all your point solutions. The other way is doing like a horizontal integration and adopting some sort of layer that all the other pieces can plug into nicely so that you get sort of a best of both worlds where you can allow engineers to compose a platform, but still have this cohesive layer.

And so now I'm definitely pitching Daxter because like I think Daxter is the way forward to that. But there are other possibilities too. Like I think a lot of people think that. A universal cataloging and lineage solution could be the way to bring order to the chaos. And there is still a lot of the story here left to right.

I think it's super exciting. And it's also great that there's all this interest. Definitely. In the field. There are a lot of really, really talented technologists out there that identify that this is a really impactful An important problem to solve. It's cool to see all these new companies and projects spring up and see how they interact with each other.

And yeah, I think we're going to get it to a lot better place. over time for practitioners in the field, but there's no certainty in my mind exactly about what that looks like. And so that's sort of the fun here. 

[00:21:34] Megan: Yeah, and when you talk about, like, the opportunities for impact, that example you mentioned earlier about the healthcare field, and that one of their biggest challenges was just having the data, like, at the beginning stage of it.

That's exciting to me that healthcare is so important. And so being able to affect change there, being able to optimize the data processes in that field and other fields, there's like so much opportunity, I think. 

[00:22:01] Nick: Yeah, totally. I mean, when I first was getting. My feet wet in this domain in a real way. It was exciting, but it was kind of like terrifying in a way, insofar as I felt like all the software built in this domain was like rickety and a bit of a house of cards and that bothers me as like a craftsman in the discipline, but also these systems are determining how healthcare products are priced, who gets loans and not.

Like automated decision making and very critical contexts. This is like a very important plumbing for our modern society, which is built on software and getting this all on solid footing. I think is an incredibly important charge to keep for engineers who want to work on this. So I think this stuff is really important.

[00:22:51] Megan: Yeah, I agree. On our online community, we've been seeing more interest and more conversations around governance and standardization. And yeah, when you're talking about important decisions being made. There's some companies where those decisions are based on crazy formulas in an Excel spreadsheet, even, uh, how do you, how do you govern that?

How do you set up some better processes to make sure you're making the best decisions? It's really important. That kind of brings us to the end of what I wanted to chat about. Thanks so much for joining us. I'll make sure to include information about Daxter Labs in our show notes in case listeners want to learn more, but thanks so much for joining and for just.

[00:23:33] Nick: Yeah. Thanks for having me. It was a great conversation. 

[00:23:35] Megan: See you next time. Thanks for listening. To check out topics mentioned in this episode, head over to our show notes on community. alltricks. com slash podcast. See you next time.


This episode was produced by Megan Dibble (@MeganBowers), Mike Cusic (@mikecusic), and Matt Rotundo (@AlteryxMatt). Special thanks to @andyuttley for the theme music track, and @mikecusic for our album artwork.