Bring your best ideas to the AI Use Case Contest! Enter to win 40 hours of expert engineering support and bring your vision to life using the powerful combination of Alteryx + AI. Learn more now, or go straight to the submission form.
Start Free Trial

Alter Everything Podcast

A podcast about data science and analytics culture.
Podcast Guide

For a full list of episodes, guests, and topics, check out our episode guide.

Go to Guide
AlteryxMatt
Moderator
Moderator

In this Alter Everything Episode, we sit down with Nick Schrock, CTO and founder of Dagster Labs, to discuss the essentials of AI data readiness, the challenges organizations face in context engineering, and the importance of governance in AI-driven data workflows. If you choose to tune in, you will learn how to prepare your data for AI, implement effective data pipeline strategies, and navigate organizational AI mandates.

 

 

 

 


Panelists


Topics


Ep 194 (YT thumbnail).png

Transcript

Episode Transcription

Ep 194 AI and Data Pipelines
===


[00:00:00] Introduction and Guest Welcome
---

[00:00:00] Megan Bowers: Welcome to Alter Everything, a podcast about data science and analytics culture. I'm Megan Bowers, and today I am talking with Nick Schrock, CTO and founder of Dexter Labs. In this episode, we chat about what AI data readiness really means, challenges organizations face for building data pipelines, organizational AI mandates, and more. Let's get started.


[00:00:33] Nick Schrock's Background and Dexter Labs
---

[00:00:33] Megan Bowers: Hey Nick, it's great to have you back on our show today. Could you give a quick introduction to yourself for our listeners?

[00:00:39] Nick Schrock: Yeah, sure. Thanks for having me. So I'm Nick Schrock. I'm the CTO and founder of Dexter Lab, which is the company behind Dexter, which is a data orchestration platform. We're in the same category as Airflow, Prefect, Union AI, and other folks who orchestrate data pipelines in production. Prior to that, I cut my teeth at Facebook engineering, working on internal developer tools, which ended up producing open source projects. The group I worked in produced React, a big popular JavaScript framework, and then I personally was one of the co-creators of GraphQL, which also became a relatively well-adopted piece of technology. So I have open source developer tools in my blood at this point.

[00:01:20] Megan Bowers: That's awesome. And I know for folks that have been listening to the podcast for a bit now, we had Nick on back, I think it was end of 2023, to talk about data engineering trends. But now, as we all know, everything is about AI.


[00:01:35] Understanding AI Data Readiness
---

[00:01:35] Megan Bowers: We're talking about AI a lot, and I wanted to have you back on to talk about data readiness, getting your data ready for AI. So I'd love to start off with what does data readiness mean to you when we talk about getting data ready for AI?

[00:01:51] Nick Schrock: I actually don't think the definition of it has changed at all, because what's funny about dealing with these AI systems is what's good for the goose is good for the gander, so to speak. And what I mean by that is that AI is very successful dealing with the data that humans are successful with. If you are doing natural language to SQL, every BI tool has some form of this feature, or will soon. The only way that is successful is if you have a well-structured schema. It's documented if there's semantics that are accurate, if there's no overlapping definitions. All these words are words that would've been said when only humans were in the loop.

Hmm. I think one of the actual interesting things that AI provides, and I, I think about this in multiple domains, like I think about in terms of Python APIs, is these AI agents provide a good acceptance test for, is your underlying stuff good enough? Meaning that if you build, if you build an API, and the agents are hallucinating against it immediately, it probably is not clear enough or well-documented enough and therefore not good enough for human consumption. Right. And I think the same is true with data, whereas if the AI can't accurately generate SQL, it's probably not the AI's fault. It's probably the data is underdocumented, especially since the models have gotten to a point where they're at. In my opinion, the definition hasn't really changed. The arc of getting value out of AI in the enterprise is known as context engineering in the majority of cases. Getting the right context to the right model at the right time, that's an engineering problem. In order to do that, you need good metadata, good documentation, all that meat and potatoes stuff. So I guess what's surprising is how little it's changed.

[00:03:48] Megan Bowers: That's really interesting and I like what you said about context engineering.


[00:03:53] Challenges in Context Engineering
---

[00:03:53] Megan Bowers: I'm wondering what are some challenges that you see businesses face when it comes to that context engineering point in the process?

[00:04:03] Nick Schrock: It's still very early, right? The issue is getting, like I said, this is kind of a, a term that has come of age in the last two months and when it, and context engineering people used to call it prompt engineering, but to get real AI applications going, the prompts have become much more targeted. You use different models for different steps and you're doing this complex orchestration and the prompt engineering had this kind of like weird. So engineering now is like a 14-year-old typing prompts into some bot. It's like, no, that's not what we're talking about. We're talking about using models and LLMs as primitives in complex software systems and then getting the right context to 'em at the right time, and it's a very important cause. What context you provide to one of these models totally dominates the quality of its behavior. You put contradictory information in a model's context, it will get confused and hallucinate, right? So everyone thinks the more context, the better. No, that's not true. The performance of critical parts of inference steps is quadratic with respect to context window size. So that's important. And the more context that you have, the more likely there will be context rot. Meaning that if you've ever tried to generate images using the LLMs, you've actually felt this. As time goes on, the images get crazier and you can't get it to go back to like the normal image. And that's because like they've kept on adding images to the context window and like the LM goes crazy. So actually it's often better when dealing with image generation to like iterate on the prompt only and keep one-shotting it so that the context is not polluted with the previous like crazy image. Yeah, that's

[00:05:48] Megan Bowers: super interesting. I have seen that happen.

[00:05:50] Nick Schrock: It's

[00:05:51] Megan Bowers: like

[00:05:51] Nick Schrock: the AI adds like a third arm growing out of someone's head and you're like, "How about you get rid of that arm?" And then all of a sudden it's growing two more arms somewhere else. You're like, "What's going on?" It's just, it's confused itself. So context engineering is a super real problem. The only way to do it is to be able to like selectively query the right high-quality context. That can actually be a super challenging problem.

[00:06:14] Megan Bowers: Yeah. And you mentioned it's a new thing, so I'm like interested to see where that goes.


[00:06:21] Governance in AI Data Pipelines
---

[00:06:21] Megan Bowers: But another thing I wanted to ask about was, when it comes to just building the data pipelines for AI, what other types of challenges, aside from context, do you see folks running into? Something on my mind is challenges around governance that I think we've seen customers have. But I'm curious what your take on that is.

[00:06:43] Nick Schrock: So, define governance. It's a word that everyone says, but I think there's like a hundred different definitions of it in people's minds. So when you're talking about governance in this context, what are you talking about precisely?

[00:06:58] Megan Bowers: That's a great question. I do think different enterprises probably define it differently. I'm thinking of governing the AI process from like data pipeline to model, being able to have visibility into the process and the inputs, the outputs. Maybe visibility into the model. I think there's the risk component as well, at least being able to explain the risk or the key points in the process.

[00:07:21] Nick Schrock: Yeah, so I'll talk to it in the part that I'm thinking about the most. So there's a lot of different aspects to AI, but I think the lessons apply across domains. But I've been thinking about how to get people to use AI tools responsibly in the context of authoring data pipelines. We just think that having guardrails on AI governance is just incredibly important. So we're designing our system. We kind of, this new abstraction layer we call components, and one of its design principles was designed for AI. And so we thought a ton, a higher level abstraction that's far more constrained. So we have a YAML, which is a file format, DSL that people author against, where we tightly define the schema and have very rich information. So the platform engineers can set the rules of the road for all the different stakeholders. We really focused on making it so that if someone is changing a data pipeline in the system, they are really focused on like one file or one very small subcomponent of the project, and that makes it so that AI doesn't go across an entire code base and do God knows what, right? Mm-hmm. They're like very focused. I like to call these AIs technical debt super spreaders. They'll pick up on bad patterns in a code base, replicate it, and then like spew all this garbage everywhere. Okay. And you completely lose control of the process. So it's important to have this sort of like compartmentalized units where you can draw a box around like, "Hey AI, here's your sandbox. Generate your code." We're gonna put metadata structures around it and governance, and then like, if it messes up, we can delete it super easily because in the day age of AI, like code is more ephemeral. So you can regenerate it very easily. It's also important to, to construct these systems, so the code is more disposable. So governance is very important. We think a lot about context in terms of what you're feeding into your system. So let's say you're using AI to summarize documents in your data pipeline. Mm-hmm. You process tons of unstructured information. A bunch of unstructured documents and you want to present AI summaries curated in a specific way to your users. To do that, you'll have a prompt somewhere that tells you how to, how to construct the prompt, and then usually what ends up happening is that people add to that to tell the LLM, "When you see this term, it actually means this, or like prior to 2021, we use this term to describe this and we rename that as part of company policy. So don't get confused on and on." You can do all sorts of stuff. Now those type of corrections we think are very important to treat as code. Cause it's basically code. Oh, that's interesting. Code, meaning that if you mess up, you should be able to roll back to the previous version, just like code. Right. Often the more kind of no-code tools have like software engineering-esque features in them where you can roll back to a previous version. And like those sort of things like we think that's equally as important with context. I think model observability is a total undiscovered country. It is very difficult to know what the models are doing, why they're doing it, what changed their behavior. So I think that for anyone using models to run mission-critical processes in any sort of way that taking evals very serious. Evals is short for evaluations, which means that you can kind of track over time to make sure that the quality of your AI outputs hit a certain bar, even if other things change in the system. That's actually a very complex problem that requires tools and a lot of infrastructure, so that has to be taken seriously. It's a really exciting time. We're still at the early innings of deploying AI into businesses and getting true value out of it, but the opportunity's enormous for people who can address these issues.

[00:11:32] Megan Bowers: Definitely, and I see that even just in this podcast when we bring on people who have solved something with AI and are starting to get real business value out of it, put something into production. Like those episodes really take off cause people are like, "Okay, I really want the meat and potatoes. How did you actually do this? Is this more than a demo?" So that's my comment on that piece. But also, I thought it was interesting what you said about model observability because putting myself in the shoes of a business executive thinking about governance, I feel like, "Yeah, I wanna know why is the model saying to do it this way? Or why are these the outputs?" Being able to like really explain how a model is behaving and everything, but I just don't know if that's even a realistic expectation when it comes to, to projects like this.

[00:12:24] Nick Schrock: I don't think anyone knows right now. Yeah, which is kind of scary. I think that one of the interesting things about what I described, that these models can only accept a limited amount of context and that radically alters their behavior. It means you're gonna have multiple agents evaluating things that have different angles on stuff. So when I'm writing code and I really pivoted into, I'll have an agent vibe code, like a persistence layer, and then I have it clear the context, and then I'm like, "Hey, new agent, you're a grumpy security engineer. Feel free to speak to me in that way. And I want you to analyze this strictly from the standpoint of like finding SQL injection and all these other things," and it can go and attack it from that angle. So I think the way that we are going to get at least correctness, observability is harder, but correctness is having overlapping probabilistic processes that come together to give you a lot of confidence if you have something with a 99% success rate. In the context of a mission-critical process that isn't that great actually, but if you have another process which can then check 99% of those errors, now you're getting somewhere interested in terms of reliability, in terms of determining the why of why something is making a mistake. I think that is very difficult. That's one of the things why I think this governed context is very important. If you have some sort of system like that, you can at least like A/B test and be like, "Listen, if we, we know this, there's a context update. Like a week ago. If we did the same thing with the previous context, do we get different outputs?" That difference can be extremely illuminating in terms of why something happened. But yeah, it's very challenging.

[00:14:16] Megan Bowers: Those are some good examples though. Hopeful to get thinking about it. We've been talking about quality of the outputs, the quality of the models.


[00:14:24] Balancing Quality and Speed in AI Development
---

[00:14:24] Megan Bowers: I'm wondering how you balance quality and speed when you're asked to develop these data pipelines for AI with ever-increasing business want for AI.

[00:14:37] Nick Schrock: I think one of the things that is really cool about AI tools in general is that it allows speculative construction of software to be far cheaper and also that people who are less technical can at least participate in the process more. Mm-hmm. In the natural medium of code. For example, like PMs can like vibe code prototypes, which is like very useful. Now what does that do? Data pipelines, if you're just figuring out if AI is going to work. You can speculatively create lower quality pipelines just to like test stuff out because it's so easy to reconstruct software. You can kinda like learn the lessons of what you did and see if this use case is actually gonna work, and then like restart from scratch. You can build new stuff in phases because the cost of construction has gone down by one or two orders of magnitude. In the end when you're actually gonna deploy this thing for real, you have to like lock down the requirements. And even if you use AI to assist the development of this thing, you need to have QA processes. Maybe that takes slower in the vibe coding, that's fine, but the business stakeholders can get confused about the nature of reality in this front in terms of like, they're like, "Oh, it's so easy now you can just talk to it and build this feature." You can do that within a prototyping tool, but these are still like the fundamental rules of software. Were like, you're just building layers of abstraction and if you do it the wrong way, it'll all collapse. That can happen. You know, you kind of visualize a skyscraper and if you, with AI, you can think about it. It's actually easy to build an unstable foundation and it's easy to build like a hundred stories in one shot and it all collapses. Right.

[00:16:31] Megan Bowers: I think if I were to summarize, it feels like there's AI is obviously generative AI, such a powerful tool. It can really go pretty far in either direction, whether that's consolidating things with review bots or building a hundred story building. It's important to have those standards and to be super intentional about how it's used because it can go, can be really good, could be really destructive in some ways.

That's right.

[00:16:56] Nick Schrock: With great power comes great responsibility.

 


[00:16:59] Advice for Teams with AI Mandates
---

[00:16:59] Megan Bowers: So then shifting gears like a little bit, do you have any advice for teams facing an AI mandate to start using AI? Or they need to start even building AI into their processes who maybe feel like they don't have the right data or enough of it? Do you have any advice for folks in that situation?

[00:17:19] Nick Schrock: I guess first of all, if you ever get a mandate of, "We have to use AI now," it's like, "Well, okay, what are we trying to accomplish? Using AI." Mm-hmm. AI is a tool, it's not a goal. So I think it's very important to ground whatever mandate you're talking about in terms of actually delivering value and delivering incremental value, because I think there's gonna be a lot of heartache going forward with all these AI mandates where people have spent, God knows how much money and they haven't gotten anything from it, or they're in a worse place where they started. So that's kind of the first piece of advice. Which is like, AI is certainly a tool by which you can accomplish goals that you wouldn't be able to otherwise. I'm not denying that. So people aren't crazy that they're like, "We gotta use the, there's gotta be value from AI in the business." What's critical is to very quickly figure out what that value is and then get alignment with whoever's saying the mandate about like, "This is what we think we can do. Does that sound good? Is this worth the investment?" Then in terms of getting the right data, I guess you have to get the right data. One of the opportunities here is if you have a goal in mind and you're like, "I want to use AI to accomplish this task." Often in these systems you can set up prompts. You can really precisely define a concrete goal, which is like, "Listen, we want this customer or this internal user, this user, to be able to perform this activity using a natural language interface." Define end-to-end acceptance criteria and use that lens. Mm-hmm. To figure out what's the right data that you need to have. But then past that, it's like we're back to basics. Like 15 years ago it was data science saying everyone figured out that all the value was in the data pipelines. And it's gonna be the same thing with AI stuff, where to deliver value in the enterprise. It's all about this context engineering, it's all about structuring your data. So the actual way to get all the value from an AI mandate is good data engineering and getting to good clean data sets that are understandable by humans and machine alike.

[00:19:26] Megan Bowers: makes a lot of sense. Data science and thinking back to when data science is the thing everybody was talking about, and then it was like they're spending a lot of time cleaning the data. Actually these data scientists that is actually foundational and a huge part of their time. So it makes sense that it comes back around for AI for sure.


[00:19:44] Future of Data Engineering and AI
---

[00:19:44] Megan Bowers: Where I wanted to wrap up was just talking about the future in 2023 when we had you on, we talked about what made you excited for the future of data engineering. Has your perspective changed or what makes you excited now? What opportunities do you see in the field?

[00:20:00] Nick Schrock: I've never been more bullish in the future of data engineering, especially in this age of AI, cause of what I just described, where I think like the underlying value of these systems can only be exploited with good engineering, good data engineering, and therefore like the opportunities to transform businesses using this our, this discipline is huge. I think the domain of data engineering is particularly amenable to a lot of AI tooling. A lot of work in data engineering is often like doing these. Extremely broad investigations around like different data sources and there's tons of metadata, tons of different systems and it's complex and all that stuff that is like highly amenable to AI systems. So I think that's exciting. And this is kind of a generalized software engineering problem, but I think it's quite acute. And data engineering too is how to apply AI native techniques in large real code bases. I think there's incredible amounts of value to deliver there. Someone said you can carbon date a company by its data infrastructure. Cause it's like so difficult to migrate and move around systems. There's a system from the seventies and the eighties and the nineties and there's all these horrible interoperability layers. Yeah, and I think that AI has the opportunity to change the calculus there. AI is just extremely good at like, "Hey, I had this thing in Python, rewrite it in Rust." And you can imagine once you have a regularized pattern for migration, you can dramatically accelerate with AI. So I think that's exciting. But I think that because of this dynamic of that, AI allows a stakeholder to communicate with the other stakeholder in their medium. Very powerful. We have this internal Slack bot where our stakeholders can talk to it and generate SQL. They do it in the context of a Slack channel, which means it's collaborative and it's social. So what you see is our data team and our go-to-market people are in the same channel and the go-to-market people are speaking English in the language that they know, which is their domain, their business, as producing SQL, which is what the data people know, and there's this natural kind of collaboration that occurs that would not occur otherwise. Like bef in the old world, right? There's some dashboard somewhere, no one knows about where the data person has produced it and the stakeholder forgets about it, but the stakeholder gets frustrated. They ask for, "I need this and this kind of data in this format," but the data people don't fully understand the domain of the business stakeholders. So there's huge communication overhead. Now, this could just happen live, right? You can literally see in the Slack bot, like the business language to SQL transformation process, and then that just facilitates everything. And I think that can happen across many different stakeholder relationships, both in and out of data. But I think that in data in particular, that problem is very important. You can also generalize it to like, usually there's like a data platform engineer and a data scientist. And now the data scientist could attempt to vibe code stuff in the data engineering code base to integrate their stuff. And then the data engineer could be like, "Ah, that's close, but we can do this." And then being able to speak the language of your stakeholder is a transformative thing, so I'm very excited about that.

[00:23:17] Megan Bowers: Yeah, I really love that use case and that idea, having been a data analyst before this, so many conversations that the communication overhead, like you said, can really build up. And if you're able to skip past the first 10 steps and be like, "Your stakeholders basically come to you with some SQL code and it's 80% of the way there," I could see how that would really speed things up and could allow just like a lot less frustration.

[00:23:44] Nick Schrock: It's not just delivering the SQL code, it is observing the language they use. You can like literally see the process of them speaking in English in the language of their domain and it producing the SQL and then you know the schema and you can be like, "Oh, it is mapping this word to that." Yeah. And therefore that is the disconnect. So it's not just getting the SQL, it's like observing the process,

[00:24:10] Megan Bowers: seeing that translation happen. And

[00:24:13] Nick Schrock: so rather than there be like documentation or a seminar on all the jargon that Marketing Ops uses, you can just kinda like learn it in real time and see how it unfolds. It is just really exciting.

[00:24:30] Megan Bowers: Very cool. That makes me excited too, honestly, and I'd love for like anyone listening, you know, if they have things that make them excited, use cases like that. Feel free to leave them in the comments. Would love to hear what excites everyone listening about future of the data space with all these AI tools and everything.


[00:24:47] Conclusion and Closing Remarks
---

[00:24:47] Megan Bowers: But yeah, it's been really awesome to have you on. Again, Nick, thanks so much for joining and for sharing your perspectives.

[00:24:53] Nick Schrock: Yeah, no problem. It was great to be here.

[00:24:56] Megan Bowers: Thanks for listening to connect with Nick. Head over to our show notes on alteryx.com/podcast and if you like this episode, leave us a review. See you next time.


This episode was produced by Megan Bowers (@MeganBowers), Mike Cusic (@mikecusic), and Matt Rotundo (@AlteryxMatt). Special thanks to @andyuttley for the theme music track, and @mikecusic for our album artwork.