Alter Everything

MaddieJ · ‎04-23-2019

We're joined by Tobias Macey, host of the Data Engineering podcast, and Podcast.__init__, along with Neil Ryan for a chat about foundations and ethics of data engineering.

Panelists

Brian Oblinger - @BrianO, LinkedIn, Twitter
Tobias Macey - LinkedIn, Twitter, Data Engineering Podcast, Podcast.__init__
Neil Ryan - @NeilR, LinkedIn

Topics

Community Picks

Tobias:

Data Engineering Podcast, Episode 76: Serverless Data Pipelines On DataCoral

Neil:

Alteryx Data Science Blog

Brian:

Malcolm Gladwell, Keynote Speaker announced for Inspire 2019, Nashville

Transcript

Episode Transcription

BRIAN 00:13	[music] Welcome to Alter Everything, a podcast about data science and analytics culture. I'm Brian Oblinger and I'll be your host. Neil Ryan and I chat with Tobias Macey, host of The Data Engineering Podcast and Podcast.init, about foundations and ethics of data engineering. Let's get into it. [music] All right. Tobias, Neil, welcome to Alter Everything.
TOBIAS 00:47	Thanks for having me.
NEIL 00:48	Yes. I'm here.
BRIAN 00:49	So, Neil, you've been on the show a couple of times. So, we know who you are for the most part. So we're going to start with Tobias on this one. Tobias, why don't you give us a little bit of your background? How did you get into this field? Tell us about your journey a little bit.
TOBIAS 01:04	Sure. So I started in the data space largely by being a systems administrator. I was the only person running the servers for a small company up in Vermont. And so it was just a lot of trial by fire, learn by doing. And from there, all of my different roles had some measure of backend administration or building databases or interacting with databases. And now, I work as the manager for the DevOps team at Open Learning at MIT. And so I'm dealing a lot with our backend systems and trying to provide data access to our data scientists, to be able to build business intelligence dashboards and just starting to figure out how to try and integrate all of our different data sources to make them useful. I also run a couple of podcasts. I've been running Podcast.init for about for years now? Something on that order. And a couple of years ago, I started The Data Engineering Podcast. And so I've been spending the last couple of years talking to people all across the data management space, as far as how they do things, distributed systems problems, storage issues, data cleaning. It's just an area that's interesting and vast and doesn't get nearly as much attention as it should. And yeah. So that's sort of what I've been up to.
BRIAN 02:22	Cool. Yeah. I've gotten several jobs in my career as well, it was just like literally being like, "Well, he was the only guy who was doing it. So I guess we'll just let him [laughter] do that now, going forward." So yeah. That's really awesome. And, Neil, for our new listeners, I guess, that are tuning in for the first time and maybe don't know all about you, maybe give a quick recap for us.
NEIL 02:44	Sure. I've been doing analytics my whole career. From working at an insurance company, setting prices in the actuarial department, to consulting and doing fraud detection analytics. I've been at Alteryx now for over four years. Right now, I'm in the community team. So writing articles about analytics and Alteryx, as well as using the data we have that backs up the community to help inform decisions to make improvements to the community.
BRIAN 03:18	Very cool. So one of the things I wanted to talk about-- and Tobias mentioned this off the top. And I know, Neil, you've been kind of thinking a lot about data engineering. Let's talk a little bit about that. What is it? Why is it important? And how do we think that that's going to take off? Maybe, Tobias, we'll start with you on that one.
TOBIAS 03:37	Sure. So, I mean, just like with terms like DevOps or data science, data engineering is one that can be open to interpretation. But, in the sort of broadest sense, it means that it's the role for the person who's responsible for ensuring that all of the company's data is available and accessible and stable and usable for advanced analytics capabilities. So it largely evolved from things like business intelligence admins or DBAs, who were responsible for ensuring that the data warehouse was up and that it had all of the data and that the ETL pipelines were running. And, as the velocity and volume and variety of data has been increasing, the requirement for more advanced systems understanding and ability to deal with data at sort of larger velocities, real-time streaming data-- it started to encapsulate all of those responsibilities. But, depending on the organization where you are, the needs for the data engineering team are different, based on the types of data that they're working with. So it could still mean the person who's responsible for the data warehouse or it could mean somebody who is keeping the Kafka pipeline running to their Spark engine, to make sure that they're able to do real-time ML modeling. But, in the broadest sense, it means the person who's responsible for ensuring that the data is usable and properly secured and accessible to the people who need it.
BRIAN 05:05	Yeah. I popped onto our good friend glassdoor.com, just wanted to kind of take a look there. And I'm seeing pretty high salary ranges, looks like they're pretty highly sought after. I mean, what are the skills that are required before someone gets to that level? How would you suggest someone starts building up their career in that direction?
TOBIAS 05:28	I mean, there are any number paths to that type of career. But some of the common needs for a data engineer are understanding of data formats, so things like JSON or AVRO or Parquet flat files; understanding of storage concepts, so how do I ensure that there's high availability; understanding of how to transform data and some of the business needs around it, so you might be willing to accept lossy transformations if the end result provides the value that it needs or you might need to preserve all of the data at some step of the pipeline. So you can come in from Academia having a high level of understanding of data modeling and relational algebra and things like that or you could come in from a systems admin perspective of somebody who knows how to run complex systems and make sure that they're up. There are, sometimes, subspecializations within data engineering where you're focusing on the data infrastructure, where you're responsible for making sure that your database is up, Kafka is up, Spark is running, everything is clustered together properly, or it might be the person who's writing all of the transformation logic to populate their ETL pipeline, whether it's running Airflow or Luigi or just a series of bash scripts. So there isn't really one path into data engineering. But some of the common needs are a decent understanding of software engineering principals, to make sure that your code is reliable; decent understanding of systems and some of the issues that are sort of inherent to distributed systems and network environments; understanding of the business needs around the data to make sure that you're able to provide the data that is necessary and any requirements as far as extraction, making sure that you're not overloading the source systems by just doing a full dump of everything on a nightly basis; understanding the requirements as far as how fresh the data needs to be. So a lot of different things that go into it. And it's highly variable, depending on the organization that you're working with and the needs that they have.
NEIL 07:34	Yeah. I would just add, just in terms of skillsets, something that our data engineers here focus on a lot is automation. So it's something that Tobias alluded to with bash scripts. But making sure that you have processes running on a regular basis to make sure your data is all kept up to date. The other thing that, Tobias, you mentioned to an earlier question about how did data engineering come about-- and you mentioned coming from kind of DBA, SysAdmin world. I think another reason it's exploded in the last few years is just because it's a big part of the data scientist's job as well to make sure the data's in good shape, it's fresh, it's ready for advanced analytics, as you said. But it's still just such a big job that data scientists were spending so much time just on that data engineering aspect that, in the last couple years, I think teams have realized it's worth splitting out the function and dedicating resources just to that exercise so that data scientists can be more efficient with their analysis on the good data.
TOBIAS 08:54	Yeah. And I think that that's a big part of why it started to become its own position and title and organizations. Because of the fact that they are employing these data scientists who are generally highly paid, have an advanced understanding of the analytical needs but end up having to sort of self-learn a lot of the aspects of the data cleaning and collection process. And so having somebody who can specialize on that piece of it because it's not a small portion and it's definitely critical to the final success of the analytics. So the companies are willing to invest in having a separate role for that, to make sure that their data scientists are able to be leveraged to their maximum effect.
BRIAN 09:33	Yeah. And I would just say too, from my perspective, I'm more of the layman in the room, right? And it seems to me as the rise of machine learning and AI and these kinds of things you hear about-- as those become more prevalent in organizations, I think they-- a lot of companies kind of go into that full force, right? "Yeah, we're going to go do AI." And then they get one foot in that door and they realize, "Oh. I better figure out my data stack--" right? "--and figure out how to get that going." And that's probably where that role is going to expand over time and be more important in organizations, because of all these AI and machine learning related initiatives that they're [trotting?] out there. You got to have that data first.
TOBIAS 10:15	Yeah. And that's also where you're starting to see even further specialization of roles, where-- I read in a O'Reilly post a while ago and I've seen it pop up other places with the concept of a machine learning engineer that sits somewhere in between data engineer and data scientist, of somebody who understands their needs of the data collection process and what's required to be able to effectively train and deploy a machine learning model. And the data scientists who are potentially going to be working more on the theoretical or experimental aspects of figuring out what should the model even be doing. And the data engineers who are responsible for managing the infrastructure and the data collection and cleaning process that goes into that modeling process. So a lot of these things, as far as how many degrees of separation there are between the different roles or even if they are different roles, is dependant of the size of the organization and the scale that they're operating at. Where a small company might have a data scientist who's also doing the data engineering and is, essentially, the machine learning engineer at the same time. Or you have places on the scale of Facebook and Google, where you have data infrastructure engineers, you have ETL engineers, you have database management engineers, and then you have machine learning engineers, and then you have data scientists who are trying to push the boundaries of AI research. So the size and sort of resources of the company can help push the degree of separation between those different roles and how many different people are filling those needs.
NEIL 11:40	I find the topic of specialization in data science pretty fascinating, just because it's all being figured out right now. So I think for the really large companies, the Googles and Facebooks, it obviously makes a lot of sense to specialize. For the really small companies where you only have one data scientist on staff, it makes sense that they're going to kind of do everything. But for the midsize companies, I think there's no script for how to build these teams and divvy out roles. So I think it's just interesting how all these companies are just kind of figuring it out, testing different things. I read a blog from Stitch Fix recently where they were really kind of making the argument against specialization that, getting too specialized, you lose kind of the high-level view of everything that's going on into end. And it kind of results in less innovation. So I found that a pretty interesting article. As the industry tends to specialize here more and more, it's interesting reading those differing viewpoints where maybe some companies are taking it too far.
TOBIAS 13:01	Yeah. And that's a conversation that happens in basically every aspect of the technological industry, where you have people who argue against specialization of frontend versus backend for maybe a web application engineer, where they're arguing for full-stack engineering or specialization in terms of systems management, where you have somebody who is the DBA and then somebody else who is the network engineer versus somebody who is responsible for all of the systems. And all of the different technological paradigms shift and push the needs for specialization back and forth of whether it's better to be a generalist or a specialist. And, yeah. It's an ongoing debate. And I think that you're right. It's useful to have a broad understanding of the needs of any of your tasks. And then there are cases where you need to have deep specialization in a particular field of it. And that's where I think it's interesting to see the conversations that are happening in the area of topics like DataOps or DevOps, where it's important that the way that you structure your teams helps to enforce and encourage collaboration between team members so that you do have that broad scope and so that you don't have somebody going too deep down the rabbit hole and losing the broader context so that other people can understand how the entire system fits together and they can build a system that actually functions, rather than have one piece of it that works really well and then the rest of it falls apart because nobody knows what it needs or what it's supposed to output.
NEIL 14:29	Yeah. I guess you can even take it to kind of the human level, the philosophical level about, if you get too specialized and you're just doing the same tasks over and over, it's not as fun of a job. You won't have your employees being quite as satisfied. But you obviously have to balance that with making things as efficient as possible, thinking about the ROI. So it's an interesting debate. And as you say, Tobias, it's more than just data science. It's pretty much-- every industry probably looks at that issue. But I guess the most interesting part about it from data science is just that it's such a new industry that's figuring it out right now.
TOBIAS 15:10	Yeah. And there are increasingly new areas that you could specialize in that maybe didn't exist six months or a year ago.
BRIAN 15:17	Yes. So before we move on, Neil, I wanted to just pause for a moment and just ask you. So you just dropped Stitch Fix as a name. So what were you doing, man? You're doing some shopping? You're getting some clothes? Can we put links in the show notes to your new threads or what?
NEIL 15:33	I have tried Stitch Fix before. It was an okay experience for me. But I actually just read their blog because I find what they do pretty fascinating. They're doing cool stuff [laughter].
BRIAN 15:45	Sure. Sure you do. Sure. Yeah. Use code "Neil" at checkout for 10% off your-- no [laughter]. Awesome. Okay. So, Tobias, one of the things Neil and I-- in preparation for this show, we obviously went and listened to some of your other podcasts, which are pretty amazing by the way. So congrats on that. We'll make sure we drop some links in the show notes for our audience to go check those out.
TOBIAS 16:10	Thank you.
BRIAN 16:13	Yeah. Really good stuff. So one of the things that Neil and I kind of noticed when we were listening to these is that your guests range pretty widely. So you've had some evangelists on there, like Wes McKinney, kind of talking about some different things. Calvin French-Owen, for example. Sort of this idea of open-source-- and I'm not going to say versus proprietary. But open-source and proprietary kind of working together to make the dream unfold. What's your kind of position on that? And how do you see open-source, either in conjunction with or - in some cases - versus proprietary?
TOBIAS 16:51	All right. Well, before I start talking, I'm going to put on my asbestos suit to avoid the flame wars [laughter]. It's a very nuanced discussion, but - in broad scope - I'd say that both are necessary to be able to push the industry forward. So open-source is valuable because it allows for a lot of experimentation in the open. It allows for innovation of people building on top of stuff that other people have used. It helps to empower new businesses because they don't have to pay out thousands of dollars for the SAP Suite or whatever to be able to get off the ground. They can start with some open-source tools and some experimentation to make sure that they can get things running. But, at the same time, you have companies that say, "I just need to be able to meet my bottom line and get my products out the door. I don't want to have to figure out how all these open-source projects are supposed to fit together," because open-source is only free if your time is worth nothing, as the saying goes. So it's useful to be able to have these proprietary solutions where you can just hand them a check and say, "Do what it is that you do best. And let me get on with my business."
TOBIAS 17:57	And at the same time, by having organizations that produce proprietary software and are able to bring in revenue, it helps them be able to have people on staff who are able to contribute back to open-source. Because - particularly in this day and age - every company that is producing proprietary software is also probably using open-source at some level, even if they're not producing it on their own. So being able to employ engineers to contribute back to the open-source that they use or companies like Stripe, that has a sort of internship program or - what's the word that I'm looking for? - a fellowship where they'll give somebody a stipend to work on some open-source project for a while. There's a lot of figuring out that's happening right now as far as how corporate companies can be responsible stewards of open-source or how to build sustainable open-source funding models, like the folks that are at Tidelift are doing. And then you have these issues that are coming up around sort of the lower-level components and how they fit into open-source, with places like MongoDB and Kafka that are trying to change their licensing to avoid companies like AWS sort of consuming their profits. But I think that's a bit of a false dichotomy. I'm not going to delve too deep into it because that's not my area of expertise. But I have spoken to some people on that area. So it's definitely worth exploring on your own. But I think both open-source and proprietary software are necessary to be able to have a vibrant technical ecosystem, particularly because people who are solely focused on open-source, they make amazing technical contributions. And they make amazing tools. But they don't always have the type of polish that you might need or want from something that you're just consuming as an end-user.
NEIL 19:52	Yeah. I think that's right. You have to have both. And I think they so often work hand in hand. A lot of amazing open-source projects come from proprietary roots. Google developed TensorFlow and then open-sourced it. So they work hand in hand and go together quite nicely I think, usually.
BRIAN 20:16	Anything else we want to piss off the entire internet with [laughter]?
TOBIAS 20:20	Pi versus Emax. Tabs versus spaces.
BRIAN 20:24	Oh, God. So tabs versus spaces. Let's-- no. I think I actually did that with Neil on our prior podcast [laughter]. And, yeah. I don't think we heard much about it. Maybe we were too small at that point. But yeah. We were trolling, essentially.
NEIL 20:38	Don't forget R versus Python.
TOBIAS 20:40	Oh, yeah. That's a good one. Why not both? That's what Arrow's for.
BRIAN 20:42	All right. Well, let's talk about that. Yeah [laughter]. So yeah. R versus Python, Tobias. Where are you coming down on that one?
TOBIAS 20:51	Whichever one makes the most sense for your needs. I mean, there are definitely packages that exist in R that don't have a useful analog in Python. And so, if that's what you need, by all means, use it. Similarly, if your primary concern is being able to incorporate the broader Python ecosystem and incorporate your machine learning model with your Jangle application and deploy it to your infrastructure using SaltStack or Ansible, well then it probably makes sense to use Python. Because R is fabulous for statistical analysis because that's what it was built and designed for. It's not necessarily what you want to use for your production environment, running all of your transactions and costumer-facing environment. You can use the machine learning model that you built in R and embed it as a microservice within a broader application, but it's not something that I would want to use to build my website with because it's not what it was designed for.
NEIL 21:44	They're both super popular. They're both never going away. And they both have their strengths. Although, what I find interesting is there is so much overlap in functionality from the data science perspective. And, yeah. I listened to the Wes McKinney episode of your podcast, Tobias. And what I like so much about his effort around Arrow is that he's just trying to reduce the duplication of effort across the development on both languages. If it's done really well in one language, why not share that so that you can build off that and have their core competencies shine even more and not waste time just duplicating effort on both languages?
TOBIAS 22:35	Yeah. And that's a trend that I see more broadly as well, is this idea of being able to unify and standardize in certain interfaces so that we can reduce the amount of effort that's needed across communities so that we can just sort of keep the acceleration of technical progress going, rather than having to spend time rewriting everything because the tool that you need isn't written in your favorite language of the day. So projects like Arrow that allow for being able to share data across different runtimes, projects. There are standardizing on the Sidekick-style APIs and the Python ecosystem. There's work being done to try and standardize the non-PI API so that it can be used as an interface for multiple different projects without necessarily having to have the specific non-PI runtime underneath because that brings in other dependencies, like C++ and Fortran. Projects like Apache Beam that provide an abstraction layer for streaming systems. SQL [laughter]. It's a universal standard. It's used everywhere, so you can-- there are sort of caveats to that. But just being able to have these standard interfaces that you can use everywhere to make it easy to build on top of, rather than having to spend your time rewriting everything from the ground up.
NEIL 24:00	It's hard though, right? To kind of form consensus across these types of things. An interesting analog I was just reading about recently is the history of RSS, the XML based web syndication format that is not as popular as it used to be and, some would say, has kind of died out. But I was just reading about the history of that. I didn't know much about it. And there were a few groups that were trying to work out a common unified format and couldn't come to an agreement and ended up with two separate RSS formats. And the article I was reading was kind of arguing that while they were fighting about that and losing valuable time where they could be improving upon the format, social media rose up. Twitter, Facebook. And basically, that's what people use now to see syndicated feeds of content. So I'll follow Wes McKinney's efforts with Arrow. I'm curious to see how he'll be able to pull that off.
TOBIAS 25:07	Yeah. It's actually being used pretty widely already. And before I go too far down this topic, I will say that - for what it's worth - RSS is still alive and well on the podcast [laughter] ecosystem. Although, there are efforts to try and circumvent that as well because it prevents data collection and sort of personalization. So that's another whole topic that we don't need to get into right now. So, to your point about Arrow, there are projects such as [inaudible] that rely on Arrow for being able to provide an in-memory layer to make it easier to join across multiple different data sources to provide analysis on top-- be able to build a business and television's platform there. It's used, optionally, in Pandas so that you can have data frames that can be used for both R and Python. It's able to be used in Spark so that you can reduce some of the serialization and deserialization cost going between things like PySpark and the JBM. So it's definitely being pretty widely used already. But yeah. I agree that it's interesting to see where it's progressing. Because it started as primarily just a means of having a standard data frame layer for being able to have in-memory data sharing. But it's starting to grow to include a lot of aspects of reading and writing data to and from different formats and storage engines. Because, as was said in the podcast, it's a systems-level problem. And, in order to make sure that it is the most useful that it can be, it requires incorporating some of these other layers into it to be as fast and efficient as possible.
NEIL 26:48	Cool. Well, that's good to hear that it's getting some traction out there,
BRIAN 26:51	Yeah. Neil, one of the things you hit on that I wanted to talk about with both of you is-- you were talking about social media and people kind of going there for their syndicated content feeds and learning and things like that. Where are you both going these days? Where should we point people from a best practice, resource perspective? What are the best tools that people should be looking at to either take their career to the next level or maybe they're just starting out in something like this?
NEIL 27:18	I'd say it really depends on your learning style. There's so many great - I'll call them freemium - courses out there now. So you can take free courses on Coursera. I'm calling it freemium just because you need to pay to get kind of certified or a degree. But, if that's kind of the way you like to learn - in a more kind of course, lecture-style atmosphere - there's so many great free courses out there around data science. I love Data Camp for that kind of stuff. But there's tons of other Codecademy, Coursera courses out there as well. If you like to learn differently, like with books, O'Reilly is the authority on that. And then, kind of finally, just Youtube [laughter]. You can learn all that stuff on Youtube if you just kind of like watching the videos.
TOBIAS 28:24	And to your point too about courses, I'll also put a plug out there for edx.org. And the MITx brand has a lot of useful material, as well as a number of other universities and organizations. There's also stuff out there from the Cloud Native Computing Foundation for things like Kubernetes, which is becoming increasingly used in the data and machine learning environment. I'll also say, conferences are a valuable way to boost your career, both from an educational perspective and it's a great way to do a lot of sort of fast networking, meet a lot of people, understand what the problems are in the industry, talk to vendors to understand what types of problems they're trying to solve. It's also a place where a lot of companies would go to try and recruit. Local meetups are useful if you happen to have any in your area. If there isn't one, there might be remote meetups that you can join or you can try starting one if you're motivated. I also second the choice of books. And O'Reilly does have a great suite of books on various topics pertaining to data engineering, data science, infrastructure. Yeah.
TOBIAS 29:31	And just get out and talk to people. If you see a company that's doing interesting work, follow their blog. As somebody who runs a podcast, I've found that people who are doing interesting work really like to talk about it. So send them a message if you can find their contact information and just say, "Hey. I really like what you wrote about in this blog post. I'd be curious to just talk to you for 15-20 minutes to learn more." If they happen to be in sort of your geographic region, invite them for a coffee. Otherwise, just send them an email. Maybe ask a few questions. Try to provide some value. People who are working in technical fields really like to mentor. So you might say, "Hey. I'm new to this field. I'm trying to learn. Would it be possible for me to sort of periodically ask you some questions?" And just make sure that there's some sort of value exchange. Not necessarily money. But make sure that they feel that it's worth their time to answer your questions, whether it's through your own personal progression or contributing back to some of the projects that they work on. It's hard to overestimate the value of networking and just getting out and talking to people, in addition to more sort of personalized learning where you're consuming material.
NEIL 30:46	And I guess just kind of-- if you are new to the industry, data engineering or data science, my advice would be, don't get too caught up on kind of specific technologies. So Tobias probably mentioned a dozen different kind of Apache projects already. And these things are just changing so fast that it's really impossible to keep up with them completely. And so, it can be a little overwhelming. So if you're just getting started, I'd say focus on kind of the more basic techniques and the concepts. And the learning about all the different projects out there-- the Spark, the Kafka, the Hadoop. That'll come. You'll kind of learn that through osmosis. I don't think you have to try to kind of learn all those all at once.
TOBIAS 31:39	Yeah. And I'll also advocate for trying to understand the fundamentals because the specific technologies are going to change over time but the fundamentals are always going to be there. So understanding aspects of storage and some of the issues around networking and distributed systems concepts. I'll also advocate for newsletters. There are a lot of great ones out there. There's Data Engineering Weekly, which does a good job of curating interesting and sort of topical news. Podcasts. I'll put out a plug for The Data Engineering Podcast, in case we haven't done that enough already. Yeah. So there are tons of resources out there. Lots of them are free. It's also worth it to find some paid ones if you find that they're going to sort of provide the acceleration that you need. Because sometimes trying to consume free material can be useful but it can take a bit longer. Whereas, if you can find something that's paid and more curated, it can give you a sort of a faster ramp-up.
BRIAN 32:39	The other thing I want to talk to you, Tobias, about is-- well, and Neil, I'd love to get your thoughts. We've been talking a lot on this show about ethics and privacy and all those hot topics of the day. Most of the folks we've talked about it with are more analysts and data scientists type roles. I'm interested kind of more from the maybe behind the scenes, backend, data engineering piece. How do you think about those topics, the ethics of what you're doing and how you design for privacy? Curious to hear your thoughts on that matter.
TOBIAS 33:12	If you don't absolutely require personally identifiable information for your business, don't store it ever. If you don't have it, then you can't lose it. That's sort of rule number one. If you do need it, make sure that you have strict controls around access to it. Make sure that it's encrypted in transit, at rest, everywhere that you can. Just be very diligent when you're dealing with people's personal information. From an ethical perspective, that's a deep well to get into. But obviously, just try to do what you think it's right. Don't be afraid to push back at the business. I mean, I recognize that there is some implied privilege in that statement because some people might be in a position where they're not able to have that sort of leverage. But, whenever you're able to, either advocate for yourself or, if you see somebody on your team who is being told to do something that you think is unethical, that they think is unethical but they're not in a position to sort of push back at the business, try to do it for them. It's the responsibility of the business and the organization who's using the data to be ethical. But it's also everybody else's responsibility, too. So don't abdicate your ethics just because you think that you're being pressured into it. It's everybody's responsibility. I guess that's where I'll leave it at.
NEIL 34:37	Yeah. I'll totally second what you said there about-- especially in terms of PII. If you don't need it, if you're not using it for something, don't keep it. Which-- it's so funny. When I was in consulting, say 10 years ago, when Hadoop was just starting to get really popular. And part of Hadoop HDFS is being able to store tons of data on cheap hardware for the first time. It was really the opposite, is what we were telling our clients. "Keep everything. You never know when you might be able to use it to optimize your business processes." But I think we've learned a lot since then. And Brian, in terms of kind of the data science side of the coin rather than data engineering-- instead of thinking about what to store but what to use when building your models, that's a hot topic these days. Just because you could accidentally build a biased or discriminatory system and not even know it. So yeah. Make sure you kind of know exactly what demographic information you're feeding into your models. Lest you build something that is going to bite you later and be unfair in practice.
TOBIAS 36:06	Yeah. And bringing that back to the data engineering layer too, that factors into your data collection strategy. And it's also important to track provenance of the data and useful metadata about the life cycle as far as what transformations were made. Because any of those things can start to introduce bias in terms of how you clean the data, how you normalize the data. Do you only accept the first name and the last name field? Because that's going to exclude huge portions of the global population because there are a lot of places where those ideas don't really make sense in terms of how they refer to themselves or-- there's a great talk I'll refer people to by-- of course I'm going to forget her name. I'll have to send it to you later for putting it in the show notes. But there's a great talk that I've seen that goes into the ideas of how form fields can just sort of implicitly exclude people because of the assumptions that go into building them. So things like name, gender, age, ethnicity. Addresses can-- they're different all over the world. So just trying to take all of that input as freeform as possible. And then do whatever normalization you can after the fact, rather than constricting the ways that people can provide information to you. That's one way that bias can creep in. And then in terms of collection. So polling is an interesting idea as far as how bias gets introduced, because who do you poll? Where do you poll? Are you sure that you're getting a decent cross-section of the sort of demographics of the populus that you're trying to create estimates for? And then as far as data collection from a privacy perspective, are you using tracking systems that are also farming that information out to third parties? Just trying to maintain ownership of the data throughout its entire lifecycle and make sure that you have a good understanding of where it came from, what happened to it, and where it's going.
BRIAN 38:14	Okay. Anything else we wanted to talk about?
NEIL 38:19	I would just add, just on the last topic, I think people are starting to build ethics into their processes at this point. So just the other day, one of our machine learning engineers posted on Slack this new library they came across called the-- what is it? Ethical ML? I haven't gotten a chance to test it out yet. But basically, it's a toolbox to kind of check for biases in machine learning models. So some cool work being done out there.
TOBIAS 38:58	Yeah. And there's an interview I did a little while ago. And of course, the name is escaping me again because I've talked to so many people. But basically, focused on a concrete implementation of the O'Reilly post of the value of checklists from an ethics perspective, because it's hard to have an automated system that can just run through everything and say, "Yep. You're good." But it's useful to have that checklist process just to make sure that you're thinking about all of the different aspects of how ethics can creep into the system that you're building as you're going through the lifecycle of the project and not just at the outset say, "Yep. This is what we're going to do." But having to go back periodically and check your assumptions and check your drift to make sure that you are adhering to the standards that you set out for yourself and to make sure that everybody's thinking about it throughout the entire lifecycle of the analytical process.
NEIL 39:51	Tobias, I kind of wanted to ask you - going back to a topic we were doing earlier - about kind of how data scientists, data engineers work together and how organizations are specializing even further between those two roles. So you've talked to tons of people, just through your networking and your podcasts. But how does that all work at your organization at MIT? How do you work with data scientists and others?
TOBIAS 40:21	So in the group that I'm with, there's one data scientist that we have on staff. And he actually sits right next to me. So I talk to him fairly regularly about the sort of types of data access that he needs. If he's trying to solve some problem, I'll try and understand. Not just, "What date are you asking for?" But, "What is the end result that you're trying to get to?" Because the way that he's thinking about the problem and trying to gain access to certain data sources isn't necessarily the only way or possibly the best way. And I might be able to come up with a different solution that's easier or better in terms of maintaining a stricter control as to the data for the end-user. So just making sure that there is that alignment as far as what is the end goal and not just being somebody who takes orders and fulfills them. Work together to make sure that everybody is working towards the same ends and trying to find the optimal solution from an end-to-end perspective.
BRIAN 41:22	Great. All right. Let's go into our final segment here, which is the community picks. So we've already name-dropped about a hundred different people - by my rough count - and many, many links that we'll put in the show notes. But what should we focus people on? What are the one or two things from each of you that has been interesting lately that we want to point people to for kind of further delving into the topics here?
TOBIAS 41:47	So one of the sort of top of mind things right now, because it's an episode that I'm editing right now that will go out shortly, is I was speaking with the founder and CEO of a company called Datacoral about the way that he's leveraged serverless technologies to make an abstraction layer over the end-to-end batch processing of data to make it easier to integrate systems without having to worry about all of the nitty-gritty details of building your ETL pipeline and making sure that it's working reliably. And just trying to bring the data engineers up a level to just think about what is the actual business need, where do I need get data from and too, and not have to worry about all the processing steps in between. So that was really interesting, the way that he's thinking about it, the way that he's approaching it. So that was pretty fascinating. And then, yeah. I guess I'll leave it at that as far as things that are interesting in the community. I mean, there's so many different things to talk about. I could go on ad infinitum for that. So I'll stop myself here [laughter].
BRIAN 42:52	Cool. Neil, how about you?
NEIL 42:54	Yeah. I mentioned it before. I guess I'll give a shoutout to the Stitch Fix blog. Just because we're talking about how most organizations don't have the resources that the Facebooks and the Googles have. I think Stitch Fix might be kind of that in-between size where they're doing a lot around AI and ML and have a lot of great data scientists on staff. So I like what they're doing. And they share a lot of what they're doing on their blog. And then, of course, I'll plug our own blog, the Alteryx Data Science blog. Just because we were talking earlier about how it is important to understand the fundamentals, especially when you are just getting started. And lately, on the Data Science blog - the Alteryx one - we have been talking a lot about those fundamentals. Like Occam's razor, the no free lunch. Things like that. Things that you should know when you're getting started in data science.
BRIAN 43:58	Awesome. And so, my pick-- kind of going in a little bit different direction. But we just announced that we have booked Malcolm Gladwell to come and be the keynote speaker at our upcoming conference in Nashville. And if you don't know Malcolm Gladwell, first of all, shame on you. But second of all, he's got I think five different books that he's put out over the years. They're all incredible. He has a podcast called "Revisionist History" that I think it's on its 4th season now, I think. Really, really insightful guy. Really amazing stuff. And I think something that almost anybody can kind of dig into. And he has a pretty cool way of articulating his points and his thoughts and feelings on different matters. So we'll link to a bunch of that stuff in the show notes. But, if you don't know who Malcolm Gladwell is, definitely go check him out. He's a really interesting guy. All right. Well, thanks, gents, for being on. This has been crazy insightful. I think, like I said, we've dropped so many different names and different links and things. The show notes are going to be super packed and plenty of great stuff for people to follow up on. So thanks for being on. It's been great.
TOBIAS 45:09	Thanks for having me.
NEIL 45:10	Yeah. Thanks, Brian. Thanks, Tobias. [music]
BRIAN 45:22	Thanks for listening to Alter Everything. Go to community.alteryx.com/podcast for show notes, information about our guests, episodes, and more. If you've got feedback, tweet us using the hashtag #AlterEverything or drop us an email at podcast@alteryx.com. Catch you next time. [music] So can I just say, Tobias, that one of the things that frustrates me is, you come on the show and you talk about all your credentials and all of this wonderful stuff you're doing and all of the amazing things, and then you subtlely drop the fact that you're a carpenter in there and then don't explain it. So we got to talk about this, man. Talk to me about being a carpenter. How did you get into that? What's that like? Why do you have so many skills? [laughter] Explain yourself.
TOBIAS 46:24	Well, my father's a carpenter. He's been self-employed my whole life. And so, I just grew up doing work for him. So I started swinging a hammer when I was about four years old. And, as I was growing up, I realized that it's not really the career path I wanted. Because it's a valuable skill, it teaches a lot of useful lessons, but it's also hard and doesn't pay as well as it should. And it's just a lot of back-breaking work. So I wanted to do something that was a little bit more focused on sort of brainpower and not necessarily trying to wreck my body day after day. So I've acquired all those skills by virtue of growing up with them. But it's not something that I do as part of my sort of primary occupation.
BRIAN 47:07	I see. So I recently had some new doors installed in my house, like interior doors. And I watched the guys come in and shave them down and shim them up and all of that. And they were perfect. And then the painters came and took them off and took them outside and painted them and brought them back in. And now, one of the doors is just hanging weird. Do you have any advice [laughter] for those of us-- if you're still listening to the show, do you have any advice for how I might solve this door hanging problem?
TOBIAS 47:34	Well, I guess the first question is, which way is it hanging weird? Is it tilted up? Is it tilted down? So open up the door. Take a small level. Put it across the top of the door. See which way it's tilted. And then, unscrew the hinge that is most likely to be the culprit. And then, just put a piece of cardboard or a thin sliver of wood behind it to try and shim it out a little bit. You also want to make sure that there's space on the other side of the door where it closes so it's not going to rub against the jam when you do add that shim. So you might need to take out a shim from the other one. So just sort of tinkering with it, figuring it out, based on is the door level worth the spacing on either side of it? Also, by virtue of growing up as a carpenter, I've gotten used to tinkering with my plumbing. So a lot of the plumbing in my house I do myself [laughter]. I've had to replace the passive water heater in my house. So I've picked up soldering skills. So don't be afraid to tinker. Don't be afraid to learn new things. Everything is a system. Figure out how it works and just get in there and do it.
BRIAN 48:37	So now you're a plumber too? That's what you're telling me [laughter]?
TOBIAS 48:41	Not professionally. I'll never admit it to somebody who wants to pay me money for it [laughter].
BRIAN 48:46	That's a solid plan. All right. Well, I'm going to go try and figure out what's going on with my door. If I need anything, I'll hit you up. Maybe we'll start a new podcast. Carpentry with Tobias.
TOBIAS 48:58	Standard rates start at $300 an hour [laughter].
BRIAN 49:02	Man, dude, I'm in the wrong business. What are we doing?
TOBIAS 49:04	Well, that's how I make sure that nobody asks me for help, is I [laughter] misprize myself out of the market [laughter].

BRIAN 00:13 [music] Welcome to Alter Everything, a podcast about data science and analytics culture. I'm Brian Oblinger and I'll be your host. Neil Ryan and I chat with Tobias Macey, host of The Data Engineering Podcast and Podcast.init, about foundations and ethics of data engineering. Let's get into it. [music] All right. Tobias, Neil, welcome to Alter Everything. TOBIAS 00:47 Thanks for having me. NEIL 00:48 Yes. I'm here. BRIAN 00:49 So, Neil, you've been on the show a couple of times. So, we know who you are for the most part. So we're going to start with Tobias on this one. Tobias, why don't you give us a little bit of your background? How did you get into this field? Tell us about your journey a little bit. TOBIAS 01:04 Sure. So I started in the data space largely by being a systems administrator. I was the only person running the servers for a small company up in Vermont. And so it was just a lot of trial by fire, learn by doing. And from there, all of my different roles had some measure of backend administration or building databases or interacting with databases. And now, I work as the manager for the DevOps team at Open Learning at MIT. And so I'm dealing a lot with our backend systems and trying to provide data access to our data scientists, to be able to build business intelligence dashboards and just starting to figure out how to try and integrate all of our different data sources to make them useful. I also run a couple of podcasts. I've been running Podcast.init for about for years now? Something on that order. And a couple of years ago, I started The Data Engineering Podcast. And so I've been spending the last couple of years talking to people all across the data management space, as far as how they do things, distributed systems problems, storage issues, data cleaning. It's just an area that's interesting and vast and doesn't get nearly as much attention as it should. And yeah. So that's sort of what I've been up to. BRIAN 02:22 Cool. Yeah. I've gotten several jobs in my career as well, it was just like literally being like, "Well, he was the only guy who was doing it. So I guess we'll just let him [laughter] do that now, going forward." So yeah. That's really awesome. And, Neil, for our new listeners, I guess, that are tuning in for the first time and maybe don't know all about you, maybe give a quick recap for us. NEIL 02:44 Sure. I've been doing analytics my whole career. From working at an insurance company, setting prices in the actuarial department, to consulting and doing fraud detection analytics. I've been at Alteryx now for over four years. Right now, I'm in the community team. So writing articles about analytics and Alteryx, as well as using the data we have that backs up the community to help inform decisions to make improvements to the community. BRIAN 03:18 Very cool. So one of the things I wanted to talk about-- and Tobias mentioned this off the top. And I know, Neil, you've been kind of thinking a lot about data engineering. Let's talk a little bit about that. What is it? Why is it important? And how do we think that that's going to take off? Maybe, Tobias, we'll start with you on that one. TOBIAS 03:37 Sure. So, I mean, just like with terms like DevOps or data science, data engineering is one that can be open to interpretation. But, in the sort of broadest sense, it means that it's the role for the person who's responsible for ensuring that all of the company's data is available and accessible and stable and usable for advanced analytics capabilities. So it largely evolved from things like business intelligence admins or DBAs, who were responsible for ensuring that the data warehouse was up and that it had all of the data and that the ETL pipelines were running. And, as the velocity and volume and variety of data has been increasing, the requirement for more advanced systems understanding and ability to deal with data at sort of larger velocities, real-time streaming data-- it started to encapsulate all of those responsibilities. But, depending on the organization where you are, the needs for the data engineering team are different, based on the types of data that they're working with. So it could still mean the person who's responsible for the data warehouse or it could mean somebody who is keeping the Kafka pipeline running to their Spark engine, to make sure that they're able to do real-time ML modeling. But, in the broadest sense, it means the person who's responsible for ensuring that the data is usable and properly secured and accessible to the people who need it. BRIAN 05:05 Yeah. I popped onto our good friend glassdoor.com, just wanted to kind of take a look there. And I'm seeing pretty high salary ranges, looks like they're pretty highly sought after. I mean, what are the skills that are required before someone gets to that level? How would you suggest someone starts building up their career in that direction? TOBIAS 05:28 I mean, there are any number paths to that type of career. But some of the common needs for a data engineer are understanding of data formats, so things like JSON or AVRO or Parquet flat files; understanding of storage concepts, so how do I ensure that there's high availability; understanding of how to transform data and some of the business needs around it, so you might be willing to accept lossy transformations if the end result provides the value that it needs or you might need to preserve all of the data at some step of the pipeline. So you can come in from Academia having a high level of understanding of data modeling and relational algebra and things like that or you could come in from a systems admin perspective of somebody who knows how to run complex systems and make sure that they're up. There are, sometimes, subspecializations within data engineering where you're focusing on the data infrastructure, where you're responsible for making sure that your database is up, Kafka is up, Spark is running, everything is clustered together properly, or it might be the person who's writing all of the transformation logic to populate their ETL pipeline, whether it's running Airflow or Luigi or just a series of bash scripts. So there isn't really one path into data engineering. But some of the common needs are a decent understanding of software engineering principals, to make sure that your code is reliable; decent understanding of systems and some of the issues that are sort of inherent to distributed systems and network environments; understanding of the business needs around the data to make sure that you're able to provide the data that is necessary and any requirements as far as extraction, making sure that you're not overloading the source systems by just doing a full dump of everything on a nightly basis; understanding the requirements as far as how fresh the data needs to be. So a lot of different things that go into it. And it's highly variable, depending on the organization that you're working with and the needs that they have. NEIL 07:34 Yeah. I would just add, just in terms of skillsets, something that our data engineers here focus on a lot is automation. So it's something that Tobias alluded to with bash scripts. But making sure that you have processes running on a regular basis to make sure your data is all kept up to date. The other thing that, Tobias, you mentioned to an earlier question about how did data engineering come about-- and you mentioned coming from kind of DBA, SysAdmin world. I think another reason it's exploded in the last few years is just because it's a big part of the data scientist's job as well to make sure the data's in good shape, it's fresh, it's ready for advanced analytics, as you said. But it's still just such a big job that data scientists were spending so much time just on that data engineering aspect that, in the last couple years, I think teams have realized it's worth splitting out the function and dedicating resources just to that exercise so that data scientists can be more efficient with their analysis on the good data. TOBIAS 08:54 Yeah. And I think that that's a big part of why it started to become its own position and title and organizations. Because of the fact that they are employing these data scientists who are generally highly paid, have an advanced understanding of the analytical needs but end up having to sort of self-learn a lot of the aspects of the data cleaning and collection process. And so having somebody who can specialize on that piece of it because it's not a small portion and it's definitely critical to the final success of the analytics. So the companies are willing to invest in having a separate role for that, to make sure that their data scientists are able to be leveraged to their maximum effect. BRIAN 09:33 Yeah. And I would just say too, from my perspective, I'm more of the layman in the room, right? And it seems to me as the rise of machine learning and AI and these kinds of things you hear about-- as those become more prevalent in organizations, I think they-- a lot of companies kind of go into that full force, right? "Yeah, we're going to go do AI." And then they get one foot in that door and they realize, "Oh. I better figure out my data stack--" right? "--and figure out how to get that going." And that's probably where that role is going to expand over time and be more important in organizations, because of all these AI and machine learning related initiatives that they're [trotting?] out there. You got to have that data first. TOBIAS 10:15 Yeah. And that's also where you're starting to see even further specialization of roles, where-- I read in a O'Reilly post a while ago and I've seen it pop up other places with the concept of a machine learning engineer that sits somewhere in between data engineer and data scientist, of somebody who understands their needs of the data collection process and what's required to be able to effectively train and deploy a machine learning model. And the data scientists who are potentially going to be working more on the theoretical or experimental aspects of figuring out what should the model even be doing. And the data engineers who are responsible for managing the infrastructure and the data collection and cleaning process that goes into that modeling process. So a lot of these things, as far as how many degrees of separation there are between the different roles or even if they are different roles, is dependant of the size of the organization and the scale that they're operating at. Where a small company might have a data scientist who's also doing the data engineering and is, essentially, the machine learning engineer at the same time. Or you have places on the scale of Facebook and Google, where you have data infrastructure engineers, you have ETL engineers, you have database management engineers, and then you have machine learning engineers, and then you have data scientists who are trying to push the boundaries of AI research. So the size and sort of resources of the company can help push the degree of separation between those different roles and how many different people are filling those needs. NEIL 11:40 I find the topic of specialization in data science pretty fascinating, just because it's all being figured out right now. So I think for the really large companies, the Googles and Facebooks, it obviously makes a lot of sense to specialize. For the really small companies where you only have one data scientist on staff, it makes sense that they're going to kind of do everything. But for the midsize companies, I think there's no script for how to build these teams and divvy out roles. So I think it's just interesting how all these companies are just kind of figuring it out, testing different things. I read a blog from Stitch Fix recently where they were really kind of making the argument against specialization that, getting too specialized, you lose kind of the high-level view of everything that's going on into end. And it kind of results in less innovation. So I found that a pretty interesting article. As the industry tends to specialize here more and more, it's interesting reading those differing viewpoints where maybe some companies are taking it too far. TOBIAS 13:01 Yeah. And that's a conversation that happens in basically every aspect of the technological industry, where you have people who argue against specialization of frontend versus backend for maybe a web application engineer, where they're arguing for full-stack engineering or specialization in terms of systems management, where you have somebody who is the DBA and then somebody else who is the network engineer versus somebody who is responsible for all of the systems. And all of the different technological paradigms shift and push the needs for specialization back and forth of whether it's better to be a generalist or a specialist. And, yeah. It's an ongoing debate. And I think that you're right. It's useful to have a broad understanding of the needs of any of your tasks. And then there are cases where you need to have deep specialization in a particular field of it. And that's where I think it's interesting to see the conversations that are happening in the area of topics like DataOps or DevOps, where it's important that the way that you structure your teams helps to enforce and encourage collaboration between team members so that you do have that broad scope and so that you don't have somebody going too deep down the rabbit hole and losing the broader context so that other people can understand how the entire system fits together and they can build a system that actually functions, rather than have one piece of it that works really well and then the rest of it falls apart because nobody knows what it needs or what it's supposed to output. NEIL 14:29 Yeah. I guess you can even take it to kind of the human level, the philosophical level about, if you get too specialized and you're just doing the same tasks over and over, it's not as fun of a job. You won't have your employees being quite as satisfied. But you obviously have to balance that with making things as efficient as possible, thinking about the ROI. So it's an interesting debate. And as you say, Tobias, it's more than just data science. It's pretty much-- every industry probably looks at that issue. But I guess the most interesting part about it from data science is just that it's such a new industry that's figuring it out right now. TOBIAS 15:10 Yeah. And there are increasingly new areas that you could specialize in that maybe didn't exist six months or a year ago. BRIAN 15:17 Yes. So before we move on, Neil, I wanted to just pause for a moment and just ask you. So you just dropped Stitch Fix as a name. So what were you doing, man? You're doing some shopping? You're getting some clothes? Can we put links in the show notes to your new threads or what? NEIL 15:33 I have tried Stitch Fix before. It was an okay experience for me. But I actually just read their blog because I find what they do pretty fascinating. They're doing cool stuff [laughter]. BRIAN 15:45 Sure. Sure you do. Sure. Yeah. Use code "Neil" at checkout for 10% off your-- no [laughter]. Awesome. Okay. So, Tobias, one of the things Neil and I-- in preparation for this show, we obviously went and listened to some of your other podcasts, which are pretty amazing by the way. So congrats on that. We'll make sure we drop some links in the show notes for our audience to go check those out. TOBIAS 16:10 Thank you. BRIAN 16:13 Yeah. Really good stuff. So one of the things that Neil and I kind of noticed when we were listening to these is that your guests range pretty widely. So you've had some evangelists on there, like Wes McKinney, kind of talking about some different things. Calvin French-Owen, for example. Sort of this idea of open-source-- and I'm not going to say versus proprietary. But open-source and proprietary kind of working together to make the dream unfold. What's your kind of position on that? And how do you see open-source, either in conjunction with or - in some cases - versus proprietary? TOBIAS 16:51 All right. Well, before I start talking, I'm going to put on my asbestos suit to avoid the flame wars [laughter]. It's a very nuanced discussion, but - in broad scope - I'd say that both are necessary to be able to push the industry forward. So open-source is valuable because it allows for a lot of experimentation in the open. It allows for innovation of people building on top of stuff that other people have used. It helps to empower new businesses because they don't have to pay out thousands of dollars for the SAP Suite or whatever to be able to get off the ground. They can start with some open-source tools and some experimentation to make sure that they can get things running. But, at the same time, you have companies that say, "I just need to be able to meet my bottom line and get my products out the door. I don't want to have to figure out how all these open-source projects are supposed to fit together," because open-source is only free if your time is worth nothing, as the saying goes. So it's useful to be able to have these proprietary solutions where you can just hand them a check and say, "Do what it is that you do best. And let me get on with my business." TOBIAS 17:57 And at the same time, by having organizations that produce proprietary software and are able to bring in revenue, it helps them be able to have people on staff who are able to contribute back to open-source. Because - particularly in this day and age - every company that is producing proprietary software is also probably using open-source at some level, even if they're not producing it on their own. So being able to employ engineers to contribute back to the open-source that they use or companies like Stripe, that has a sort of internship program or - what's the word that I'm looking for? - a fellowship where they'll give somebody a stipend to work on some open-source project for a while. There's a lot of figuring out that's happening right now as far as how corporate companies can be responsible stewards of open-source or how to build sustainable open-source funding models, like the folks that are at Tidelift are doing. And then you have these issues that are coming up around sort of the lower-level components and how they fit into open-source, with places like MongoDB and Kafka that are trying to change their licensing to avoid companies like AWS sort of consuming their profits. But I think that's a bit of a false dichotomy. I'm not going to delve too deep into it because that's not my area of expertise. But I have spoken to some people on that area. So it's definitely worth exploring on your own. But I think both open-source and proprietary software are necessary to be able to have a vibrant technical ecosystem, particularly because people who are solely focused on open-source, they make amazing technical contributions. And they make amazing tools. But they don't always have the type of polish that you might need or want from something that you're just consuming as an end-user. NEIL 19:52 Yeah. I think that's right. You have to have both. And I think they so often work hand in hand. A lot of amazing open-source projects come from proprietary roots. Google developed TensorFlow and then open-sourced it. So they work hand in hand and go together quite nicely I think, usually. BRIAN 20:16 Anything else we want to piss off the entire internet with [laughter]? TOBIAS 20:20 Pi versus Emax. Tabs versus spaces. BRIAN 20:24 Oh, God. So tabs versus spaces. Let's-- no. I think I actually did that with Neil on our prior podcast [laughter]. And, yeah. I don't think we heard much about it. Maybe we were too small at that point. But yeah. We were trolling, essentially. NEIL 20:38 Don't forget R versus Python. TOBIAS 20:40 Oh, yeah. That's a good one. Why not both? That's what Arrow's for. BRIAN 20:42 All right. Well, let's talk about that. Yeah [laughter]. So yeah. R versus Python, Tobias. Where are you coming down on that one? TOBIAS 20:51 Whichever one makes the most sense for your needs. I mean, there are definitely packages that exist in R that don't have a useful analog in Python. And so, if that's what you need, by all means, use it. Similarly, if your primary concern is being able to incorporate the broader Python ecosystem and incorporate your machine learning model with your Jangle application and deploy it to your infrastructure using SaltStack or Ansible, well then it probably makes sense to use Python. Because R is fabulous for statistical analysis because that's what it was built and designed for. It's not necessarily what you want to use for your production environment, running all of your transactions and costumer-facing environment. You can use the machine learning model that you built in R and embed it as a microservice within a broader application, but it's not something that I would want to use to build my website with because it's not what it was designed for. NEIL 21:44 They're both super popular. They're both never going away. And they both have their strengths. Although, what I find interesting is there is so much overlap in functionality from the data science perspective. And, yeah. I listened to the Wes McKinney episode of your podcast, Tobias. And what I like so much about his effort around Arrow is that he's just trying to reduce the duplication of effort across the development on both languages. If it's done really well in one language, why not share that so that you can build off that and have their core competencies shine even more and not waste time just duplicating effort on both languages? TOBIAS 22:35 Yeah. And that's a trend that I see more broadly as well, is this idea of being able to unify and standardize in certain interfaces so that we can reduce the amount of effort that's needed across communities so that we can just sort of keep the acceleration of technical progress going, rather than having to spend time rewriting everything because the tool that you need isn't written in your favorite language of the day. So projects like Arrow that allow for being able to share data across different runtimes, projects. There are standardizing on the Sidekick-style APIs and the Python ecosystem. There's work being done to try and standardize the non-PI API so that it can be used as an interface for multiple different projects without necessarily having to have the specific non-PI runtime underneath because that brings in other dependencies, like C++ and Fortran. Projects like Apache Beam that provide an abstraction layer for streaming systems. SQL [laughter]. It's a universal standard. It's used everywhere, so you can-- there are sort of caveats to that. But just being able to have these standard interfaces that you can use everywhere to make it easy to build on top of, rather than having to spend your time rewriting everything from the ground up. NEIL 24:00 It's hard though, right? To kind of form consensus across these types of things. An interesting analog I was just reading about recently is the history of RSS, the XML based web syndication format that is not as popular as it used to be and, some would say, has kind of died out. But I was just reading about the history of that. I didn't know much about it. And there were a few groups that were trying to work out a common unified format and couldn't come to an agreement and ended up with two separate RSS formats. And the article I was reading was kind of arguing that while they were fighting about that and losing valuable time where they could be improving upon the format, social media rose up. Twitter, Facebook. And basically, that's what people use now to see syndicated feeds of content. So I'll follow Wes McKinney's efforts with Arrow. I'm curious to see how he'll be able to pull that off. TOBIAS 25:07 Yeah. It's actually being used pretty widely already. And before I go too far down this topic, I will say that - for what it's worth - RSS is still alive and well on the podcast [laughter] ecosystem. Although, there are efforts to try and circumvent that as well because it prevents data collection and sort of personalization. So that's another whole topic that we don't need to get into right now. So, to your point about Arrow, there are projects such as [inaudible] that rely on Arrow for being able to provide an in-memory layer to make it easier to join across multiple different data sources to provide analysis on top-- be able to build a business and television's platform there. It's used, optionally, in Pandas so that you can have data frames that can be used for both R and Python. It's able to be used in Spark so that you can reduce some of the serialization and deserialization cost going between things like PySpark and the JBM. So it's definitely being pretty widely used already. But yeah. I agree that it's interesting to see where it's progressing. Because it started as primarily just a means of having a standard data frame layer for being able to have in-memory data sharing. But it's starting to grow to include a lot of aspects of reading and writing data to and from different formats and storage engines. Because, as was said in the podcast, it's a systems-level problem. And, in order to make sure that it is the most useful that it can be, it requires incorporating some of these other layers into it to be as fast and efficient as possible. NEIL 26:48 Cool. Well, that's good to hear that it's getting some traction out there, BRIAN 26:51 Yeah. Neil, one of the things you hit on that I wanted to talk about with both of you is-- you were talking about social media and people kind of going there for their syndicated content feeds and learning and things like that. Where are you both going these days? Where should we point people from a best practice, resource perspective? What are the best tools that people should be looking at to either take their career to the next level or maybe they're just starting out in something like this? NEIL 27:18 I'd say it really depends on your learning style. There's so many great - I'll call them freemium - courses out there now. So you can take free courses on Coursera. I'm calling it freemium just because you need to pay to get kind of certified or a degree. But, if that's kind of the way you like to learn - in a more kind of course, lecture-style atmosphere - there's so many great free courses out there around data science. I love Data Camp for that kind of stuff. But there's tons of other Codecademy, Coursera courses out there as well. If you like to learn differently, like with books, O'Reilly is the authority on that. And then, kind of finally, just Youtube [laughter]. You can learn all that stuff on Youtube if you just kind of like watching the videos. TOBIAS 28:24 And to your point too about courses, I'll also put a plug out there for edx.org. And the MITx brand has a lot of useful material, as well as a number of other universities and organizations. There's also stuff out there from the Cloud Native Computing Foundation for things like Kubernetes, which is becoming increasingly used in the data and machine learning environment. I'll also say, conferences are a valuable way to boost your career, both from an educational perspective and it's a great way to do a lot of sort of fast networking, meet a lot of people, understand what the problems are in the industry, talk to vendors to understand what types of problems they're trying to solve. It's also a place where a lot of companies would go to try and recruit. Local meetups are useful if you happen to have any in your area. If there isn't one, there might be remote meetups that you can join or you can try starting one if you're motivated. I also second the choice of books. And O'Reilly does have a great suite of books on various topics pertaining to data engineering, data science, infrastructure. Yeah. TOBIAS 29:31 And just get out and talk to people. If you see a company that's doing interesting work, follow their blog. As somebody who runs a podcast, I've found that people who are doing interesting work really like to talk about it. So send them a message if you can find their contact information and just say, "Hey. I really like what you wrote about in this blog post. I'd be curious to just talk to you for 15-20 minutes to learn more." If they happen to be in sort of your geographic region, invite them for a coffee. Otherwise, just send them an email. Maybe ask a few questions. Try to provide some value. People who are working in technical fields really like to mentor. So you might say, "Hey. I'm new to this field. I'm trying to learn. Would it be possible for me to sort of periodically ask you some questions?" And just make sure that there's some sort of value exchange. Not necessarily money. But make sure that they feel that it's worth their time to answer your questions, whether it's through your own personal progression or contributing back to some of the projects that they work on. It's hard to overestimate the value of networking and just getting out and talking to people, in addition to more sort of personalized learning where you're consuming material. NEIL 30:46 And I guess just kind of-- if you are new to the industry, data engineering or data science, my advice would be, don't get too caught up on kind of specific technologies. So Tobias probably mentioned a dozen different kind of Apache projects already. And these things are just changing so fast that it's really impossible to keep up with them completely. And so, it can be a little overwhelming. So if you're just getting started, I'd say focus on kind of the more basic techniques and the concepts. And the learning about all the different projects out there-- the Spark, the Kafka, the Hadoop. That'll come. You'll kind of learn that through osmosis. I don't think you have to try to kind of learn all those all at once. TOBIAS 31:39 Yeah. And I'll also advocate for trying to understand the fundamentals because the specific technologies are going to change over time but the fundamentals are always going to be there. So understanding aspects of storage and some of the issues around networking and distributed systems concepts. I'll also advocate for newsletters. There are a lot of great ones out there. There's Data Engineering Weekly, which does a good job of curating interesting and sort of topical news. Podcasts. I'll put out a plug for The Data Engineering Podcast, in case we haven't done that enough already. Yeah. So there are tons of resources out there. Lots of them are free. It's also worth it to find some paid ones if you find that they're going to sort of provide the acceleration that you need. Because sometimes trying to consume free material can be useful but it can take a bit longer. Whereas, if you can find something that's paid and more curated, it can give you a sort of a faster ramp-up. BRIAN 32:39 The other thing I want to talk to you, Tobias, about is-- well, and Neil, I'd love to get your thoughts. We've been talking a lot on this show about ethics and privacy and all those hot topics of the day. Most of the folks we've talked about it with are more analysts and data scientists type roles. I'm interested kind of more from the maybe behind the scenes, backend, data engineering piece. How do you think about those topics, the ethics of what you're doing and how you design for privacy? Curious to hear your thoughts on that matter. TOBIAS 33:12 If you don't absolutely require personally identifiable information for your business, don't store it ever. If you don't have it, then you can't lose it. That's sort of rule number one. If you do need it, make sure that you have strict controls around access to it. Make sure that it's encrypted in transit, at rest, everywhere that you can. Just be very diligent when you're dealing with people's personal information. From an ethical perspective, that's a deep well to get into. But obviously, just try to do what you think it's right. Don't be afraid to push back at the business. I mean, I recognize that there is some implied privilege in that statement because some people might be in a position where they're not able to have that sort of leverage. But, whenever you're able to, either advocate for yourself or, if you see somebody on your team who is being told to do something that you think is unethical, that they think is unethical but they're not in a position to sort of push back at the business, try to do it for them. It's the responsibility of the business and the organization who's using the data to be ethical. But it's also everybody else's responsibility, too. So don't abdicate your ethics just because you think that you're being pressured into it. It's everybody's responsibility. I guess that's where I'll leave it at. NEIL 34:37 Yeah. I'll totally second what you said there about-- especially in terms of PII. If you don't need it, if you're not using it for something, don't keep it. Which-- it's so funny. When I was in consulting, say 10 years ago, when Hadoop was just starting to get really popular. And part of Hadoop HDFS is being able to store tons of data on cheap hardware for the first time. It was really the opposite, is what we were telling our clients. "Keep everything. You never know when you might be able to use it to optimize your business processes." But I think we've learned a lot since then. And Brian, in terms of kind of the data science side of the coin rather than data engineering-- instead of thinking about what to store but what to use when building your models, that's a hot topic these days. Just because you could accidentally build a biased or discriminatory system and not even know it. So yeah. Make sure you kind of know exactly what demographic information you're feeding into your models. Lest you build something that is going to bite you later and be unfair in practice. TOBIAS 36:06 Yeah. And bringing that back to the data engineering layer too, that factors into your data collection strategy. And it's also important to track provenance of the data and useful metadata about the life cycle as far as what transformations were made. Because any of those things can start to introduce bias in terms of how you clean the data, how you normalize the data. Do you only accept the first name and the last name field? Because that's going to exclude huge portions of the global population because there are a lot of places where those ideas don't really make sense in terms of how they refer to themselves or-- there's a great talk I'll refer people to by-- of course I'm going to forget her name. I'll have to send it to you later for putting it in the show notes. But there's a great talk that I've seen that goes into the ideas of how form fields can just sort of implicitly exclude people because of the assumptions that go into building them. So things like name, gender, age, ethnicity. Addresses can-- they're different all over the world. So just trying to take all of that input as freeform as possible. And then do whatever normalization you can after the fact, rather than constricting the ways that people can provide information to you. That's one way that bias can creep in. And then in terms of collection. So polling is an interesting idea as far as how bias gets introduced, because who do you poll? Where do you poll? Are you sure that you're getting a decent cross-section of the sort of demographics of the populus that you're trying to create estimates for? And then as far as data collection from a privacy perspective, are you using tracking systems that are also farming that information out to third parties? Just trying to maintain ownership of the data throughout its entire lifecycle and make sure that you have a good understanding of where it came from, what happened to it, and where it's going. BRIAN 38:14 Okay. Anything else we wanted to talk about? NEIL 38:19 I would just add, just on the last topic, I think people are starting to build ethics into their processes at this point. So just the other day, one of our machine learning engineers posted on Slack this new library they came across called the-- what is it? Ethical ML? I haven't gotten a chance to test it out yet. But basically, it's a toolbox to kind of check for biases in machine learning models. So some cool work being done out there. TOBIAS 38:58 Yeah. And there's an interview I did a little while ago. And of course, the name is escaping me again because I've talked to so many people. But basically, focused on a concrete implementation of the O'Reilly post of the value of checklists from an ethics perspective, because it's hard to have an automated system that can just run through everything and say, "Yep. You're good." But it's useful to have that checklist process just to make sure that you're thinking about all of the different aspects of how ethics can creep into the system that you're building as you're going through the lifecycle of the project and not just at the outset say, "Yep. This is what we're going to do." But having to go back periodically and check your assumptions and check your drift to make sure that you are adhering to the standards that you set out for yourself and to make sure that everybody's thinking about it throughout the entire lifecycle of the analytical process. NEIL 39:51 Tobias, I kind of wanted to ask you - going back to a topic we were doing earlier - about kind of how data scientists, data engineers work together and how organizations are specializing even further between those two roles. So you've talked to tons of people, just through your networking and your podcasts. But how does that all work at your organization at MIT? How do you work with data scientists and others? TOBIAS 40:21 So in the group that I'm with, there's one data scientist that we have on staff. And he actually sits right next to me. So I talk to him fairly regularly about the sort of types of data access that he needs. If he's trying to solve some problem, I'll try and understand. Not just, "What date are you asking for?" But, "What is the end result that you're trying to get to?" Because the way that he's thinking about the problem and trying to gain access to certain data sources isn't necessarily the only way or possibly the best way. And I might be able to come up with a different solution that's easier or better in terms of maintaining a stricter control as to the data for the end-user. So just making sure that there is that alignment as far as what is the end goal and not just being somebody who takes orders and fulfills them. Work together to make sure that everybody is working towards the same ends and trying to find the optimal solution from an end-to-end perspective. BRIAN 41:22 Great. All right. Let's go into our final segment here, which is the community picks. So we've already name-dropped about a hundred different people - by my rough count - and many, many links that we'll put in the show notes. But what should we focus people on? What are the one or two things from each of you that has been interesting lately that we want to point people to for kind of further delving into the topics here? TOBIAS 41:47 So one of the sort of top of mind things right now, because it's an episode that I'm editing right now that will go out shortly, is I was speaking with the founder and CEO of a company called Datacoral about the way that he's leveraged serverless technologies to make an abstraction layer over the end-to-end batch processing of data to make it easier to integrate systems without having to worry about all of the nitty-gritty details of building your ETL pipeline and making sure that it's working reliably. And just trying to bring the data engineers up a level to just think about what is the actual business need, where do I need get data from and too, and not have to worry about all the processing steps in between. So that was really interesting, the way that he's thinking about it, the way that he's approaching it. So that was pretty fascinating. And then, yeah. I guess I'll leave it at that as far as things that are interesting in the community. I mean, there's so many different things to talk about. I could go on ad infinitum for that. So I'll stop myself here [laughter]. BRIAN 42:52 Cool. Neil, how about you? NEIL 42:54 Yeah. I mentioned it before. I guess I'll give a shoutout to the Stitch Fix blog. Just because we're talking about how most organizations don't have the resources that the Facebooks and the Googles have. I think Stitch Fix might be kind of that in-between size where they're doing a lot around AI and ML and have a lot of great data scientists on staff. So I like what they're doing. And they share a lot of what they're doing on their blog. And then, of course, I'll plug our own blog, the Alteryx Data Science blog. Just because we were talking earlier about how it is important to understand the fundamentals, especially when you are just getting started. And lately, on the Data Science blog - the Alteryx one - we have been talking a lot about those fundamentals. Like Occam's razor, the no free lunch. Things like that. Things that you should know when you're getting started in data science. BRIAN 43:58 Awesome. And so, my pick-- kind of going in a little bit different direction. But we just announced that we have booked Malcolm Gladwell to come and be the keynote speaker at our upcoming conference in Nashville. And if you don't know Malcolm Gladwell, first of all, shame on you. But second of all, he's got I think five different books that he's put out over the years. They're all incredible. He has a podcast called "Revisionist History" that I think it's on its 4th season now, I think. Really, really insightful guy. Really amazing stuff. And I think something that almost anybody can kind of dig into. And he has a pretty cool way of articulating his points and his thoughts and feelings on different matters. So we'll link to a bunch of that stuff in the show notes. But, if you don't know who Malcolm Gladwell is, definitely go check him out. He's a really interesting guy. All right. Well, thanks, gents, for being on. This has been crazy insightful. I think, like I said, we've dropped so many different names and different links and things. The show notes are going to be super packed and plenty of great stuff for people to follow up on. So thanks for being on. It's been great. TOBIAS 45:09 Thanks for having me. NEIL 45:10 Yeah. Thanks, Brian. Thanks, Tobias. [music] BRIAN 45:22 Thanks for listening to Alter Everything. Go to community.alteryx.com/podcast for show notes, information about our guests, episodes, and more. If you've got feedback, tweet us using the hashtag #AlterEverything or drop us an email at podcast@alteryx.com. Catch you next time. [music] So can I just say, Tobias, that one of the things that frustrates me is, you come on the show and you talk about all your credentials and all of this wonderful stuff you're doing and all of the amazing things, and then you subtlely drop the fact that you're a carpenter in there and then don't explain it. So we got to talk about this, man. Talk to me about being a carpenter. How did you get into that? What's that like? Why do you have so many skills? [laughter] Explain yourself. TOBIAS 46:24 Well, my father's a carpenter. He's been self-employed my whole life. And so, I just grew up doing work for him. So I started swinging a hammer when I was about four years old. And, as I was growing up, I realized that it's not really the career path I wanted. Because it's a valuable skill, it teaches a lot of useful lessons, but it's also hard and doesn't pay as well as it should. And it's just a lot of back-breaking work. So I wanted to do something that was a little bit more focused on sort of brainpower and not necessarily trying to wreck my body day after day. So I've acquired all those skills by virtue of growing up with them. But it's not something that I do as part of my sort of primary occupation. BRIAN 47:07 I see. So I recently had some new doors installed in my house, like interior doors. And I watched the guys come in and shave them down and shim them up and all of that. And they were perfect. And then the painters came and took them off and took them outside and painted them and brought them back in. And now, one of the doors is just hanging weird. Do you have any advice [laughter] for those of us-- if you're still listening to the show, do you have any advice for how I might solve this door hanging problem? TOBIAS 47:34 Well, I guess the first question is, which way is it hanging weird? Is it tilted up? Is it tilted down? So open up the door. Take a small level. Put it across the top of the door. See which way it's tilted. And then, unscrew the hinge that is most likely to be the culprit. And then, just put a piece of cardboard or a thin sliver of wood behind it to try and shim it out a little bit. You also want to make sure that there's space on the other side of the door where it closes so it's not going to rub against the jam when you do add that shim. So you might need to take out a shim from the other one. So just sort of tinkering with it, figuring it out, based on is the door level worth the spacing on either side of it? Also, by virtue of growing up as a carpenter, I've gotten used to tinkering with my plumbing. So a lot of the plumbing in my house I do myself [laughter]. I've had to replace the passive water heater in my house. So I've picked up soldering skills. So don't be afraid to tinker. Don't be afraid to learn new things. Everything is a system. Figure out how it works and just get in there and do it. BRIAN 48:37 So now you're a plumber too? That's what you're telling me [laughter]? TOBIAS 48:41 Not professionally. I'll never admit it to somebody who wants to pay me money for it [laughter]. BRIAN 48:46 That's a solid plan. All right. Well, I'm going to go try and figure out what's going on with my door. If I need anything, I'll hit you up. Maybe we'll start a new podcast. Carpentry with Tobias. TOBIAS 48:58 Standard rates start at $300 an hour [laughter]. BRIAN 49:02 Man, dude, I'm in the wrong business. What are we doing? TOBIAS 49:04 Well, that's how I make sure that nobody asks me for help, is I [laughter] misprize myself out of the market [laughter].

This episode of Alter Everything was produced by Maddie Johannsen (@MaddieJ).

Alter Everything

Episode Guide

31: Everything is a system

Panelists

Topics

Community Picks

Transcript