Data Science Mixer

Tune in for data science and cocktails.
MaddieJ
Alteryx Community Team
Alteryx Community Team

Ria Cheruvu, AI Ethics Lead Architect for Intel’s Internet of Things engineering group, is passionate about making sure IoT and AI systems are created ethically and used fairly. She shares her tips for how to stay in touch with trends in this space, and how data scientists can research and engage with groundbreaking technologies and techniques. 

 

 


Panelists

 


Topics


Cocktail Conversation

 

Ria CC.png

 

Ria shared strategies for coming up with new ideas and approaches to data science challenges. Which of those strategies was most intriguing to you? Do you have your own ways of challenging your brain to get outside its familiar comfort zone in data science?

 

Join the conversation by commenting below!

 


Transcript

 

Episode Transcription

RIA: 00:00

I think one of the critical focus points that may not be considered as much but is starting to really come up as a critical trend, is how this might apply for Internet of Things devices where human control is not defined as clearly. And what I mean by this is if we have a digital surveillance endpoint device and we have humans, we have authorized personnel in other words, we're able to kind of go into the device, monitor the AI model and make sure everything is going accurately, there are some interesting concerns that start to play in.

SUSAN: 00:30

[music] Welcome to Data Science Mixer. A podcast featuring top experts in lively and informative conversations that will change the way you do data science. I'm Susan Currie Sivek, senior data science journalist for the Alteryx community. For today's episode, I talked with Ria Cheruvu. You just heard her talk a bit about her interest in the ethical concerns around AI and the Internet of Things, and that's a huge focus of her work. She's also spent significant time considering how data scientists can brainstorm fresh, new ideas to explore in their everyday work and in research. I think you'll find her frameworks for thinking about both of these topics really compelling and exciting. Let's meet Ria and jump right in.

RIA: 01:17

Hey, everyone. I'm Ria Cheruvu. I'm an AI Ethics Lead Architect at Intel Corporation's Network and Edge Engineering Group within the IOTG Engineering Group, is where I work. And I'm currently based off of AZ, working on some awesome new projects and technologies.

SUSAN: 01:32

Sweet. And I'm looking forward to hearing more about that. Would you mind also sharing with us which pronouns you use?

RIA: 01:38

Yes. I use the she, her pronouns.

SUSAN: 01:40

Awesome. Thanks. And so you may know that on Data Science Mixer, we often try to have a special drink or a snack or something as we're chatting. Do you happen to have something tasty there with you today?

RIA: 01:52

Right now I've got water. I just had a snack right before this podcast. I had a [inaudible].

SUSAN: 01:58

Excellent. Good.

RIA: 01:59

So I'm just going to enjoy this water and have a good time.

SUSAN: 02:02

Sounds [good?]. I, too, am having water. It is actually, though, a cranberry lime seltzer water, so at least that's a little fancier for my afternoon [inaudible], so. Cool. Cool. All right. So I would love to hear a little bit about your journey in data science and how you arrived in your current position at Intel.

RIA: 02:21

Absolutely. So I completed my bachelor's degree in data science a few years ago. I'm sorry. My bachelor's in computer science a few years ago, where at that point in time I had the opportunity to try to cover a bunch of different fields, whether it be philosophy, your more traditional CS topics, and then start to think about which route do you really want to get into, hardware and software. And as part of exploration into software, artificial intelligence has been a constantly emerging technology. I think since the last decade, it's just been a constant influx of different technologies, methods, and perceptions or hype about it as well. So naturally, I did want to kind of gravitate towards it and explore it and understand, is this field right for me? And I think throughout that exploration with connections to philosophy and neuroscience, I did find a passion in AI. I initially wanted to specialize pretty deeply in the intersection of AI and security with a field known as neural cryptography, which is the use of neural networks for cryptographic operations. Later, I discovered each domain has its own disadvantages and weaknesses, and I did want to generalize to the broader field of AI. Neural cryptography is not something that's very popular and has a lot of different drawbacks, so what if we considered the broader space of AI, philosophy, neuroscience, security, all of these concepts? And at that point in time, I was able to join Intel as an intern and start to pursue a mix of these technologies, specifically deep learning with a focus on hardware. And in parallel, I was able to pursue a master's in data science at that point in time, which really solidified my interest in AI and also connecting these different domains together, which I'm very happy to share.

SUSAN: 03:58

Yeah. That's terrific. It's really exciting, I think, to be at that intersection of so many different areas and bringing together so many different insights into your work.

RIA: 04:06

Thank you. Absolutely. And I think one of the interesting points that really comes out of the integration is this idea of an interdisciplinary researcher or an engineer who's really able to connect these different insights together and then put together a plan for pathfinding, leadership, or similar. And this is an insight that I learned from many experienced colleagues and friends in this space as well, where they're really taking a step back from the knowledge they've learned, questioning what are other domains that can be connected or integrated as part of the thought processes, and then pursuing those directions pretty aggressively. Just kind of going for it, defining these goals, identifying, really, is this worth my time? And I think that type of approach to where it's a very interesting way of fast and rapid exploration, but also getting a lot of insights throughout that process, especially for professionals who don't have as much time outside of their day job to explore other domains and hobby projects connected to AI, so.

SUSAN: 05:03

Well, and it's interesting to hear you mention that you've encountered so many people who are kind of questioning those basic fundamentals of the field and what they're doing. I don't know how common that is in other occupations, and yet it sounds like you're encountering a lot of people who are open to that questioning and thinking more deeply about what they're doing.

RIA: 05:22

Absolutely. And I think a core foundation of these technologies is really built around some mathematical concepts which are really critical to implementation, but also a lot of it is around the use case and the application. It's an interesting fact that I discovered as I started to explore data science, which is when you're building your models, gathering your data sets in similar, you'll be using similar techniques across different applications and domains, but the requirements that are coming with each use case can be specific, and they give you the opportunity to frame the problem differently in each scenario as you see fit. And then now it really starts to raise very interesting questions around, is my project data-driven or question-driven, user-driven, etc. So you're really trying to identify who is the focus of that end use case, and what am I able to question or change in this particular framework outside of the critical techniques and components to best personalize the use case and the machine learning model accordingly.

SUSAN: 06:22

Yeah. And so speaking of questions, the way that I actually learned of you and your work was a talk that you gave for women in data science, the Puget Sound Conference or virtual gathering, I guess, where you were speaking about some of the challenges that you've encountered in your work in terms of coming up with new ideas and new questions to explore in data science, research and development. So I was curious, kind of what was behind you choosing to present on that topic? What was your inspiration?

RIA: 06:50

Right. I think the primary inspiration here is just so many papers, techniques, methods getting published every day in the AI space. It's challenging to be able to navigate between different domains and spaces in the field, especially when one particular exciting opportunity will have so many directions to pursue. I encountered this throughout my exploration in computer science and particularly data science, because right now, even during this conversation, I've really been focusing on content related to machine learning. But if we consider the whole topics of data science from exploratory data analysis, again application focused development, etc., there's just so much research and engineering that's already out there. And I would say still a novice when it comes to innovating ideas and then moving forward with them. It's something that I'm continuously learning about. And again, some insights that I gathered from experienced colleagues in the space were really around that rapid exploration space where you just keep connecting the dots and moving forward, that I thought was insightful and something worth sharing to the women in data science audience. So that's kind of the motivation behind that at a high-level.

SUSAN: 08:00

Yeah. Awesome. And so out of that, what were some of the top tips or suggestions that you've come up with for people who are doing kind of typical everyday data science projects in business, who are trying to be more creative or trying to use new approaches in their day-to-day work?

RIA: 08:17

Right. And I think here-- the core part of this that I was able to figure out is leveraging frameworks that we're already comfortable with as data scientists. So papers are a pretty great venue for academicians to kind of explore and publish their results. Of course, they're equally as popular in industry where you're able to breach your insights or findings and the interesting things that you've uncovered, to the community. So I think first, just narrowing down on that particular component. The way that papers are structured themselves is really interesting and can inspire the design of an entire machine learning use case end-to-end. An example of this might be starting from gathering your data set, training your machine learning model, fine tuning it, finding the right hyper parameters, and again deploying it. I'm very much interested as well in the application of the use case, as you can tell, so I think that having that end-to-end picture can really be inspired by the way that papers are structured from your literature review where you're really deciding what is the minimal, viable project or product that I can make with my data science algorithm or my end-to-end pipeline. And similarly, what are the novel items that I can really bring forth as part of my project as a whole? Structuring it this way, similarly to how you would structure a paper from the ground up or from the beginning of the project itself, I think is a interesting way of looking at it. Now similarly, if we take two other frameworks, we could look at patents and then projects themselves, we can also uncover some similar details. For patents, I see this as having a lot more emphasis on the novelty compared to papers, where you're doing a literature review but you're not doing too much of a detailed literature review, because that's kind of offloaded to the attorneys. But what we're trying to look at here is, what are the novel components or ideas that you can really bring out that could be used in the future or currently? And I think that type of framework is really interesting in terms of how can people detect your solution and how would your solution fit into other components? And this type of thinking, I think, is what patents inspires as a framework.

RIA: 10:17

Now if we move on to projects, the open-source code development and deployment model, some other interesting insights there where you're basically able to release code, understand, how are my users going to use this? What are the API calls associated with it? How do I make this easier for a user to install, get ready on their system, and good to go? Do I create a Docker container or use [inaudible] environments or similar? And I think those three frameworks can be used in a combination or separately to figure out, how do I get started with this data science project? How do I go through the processing and all those components, and then how do I get it in the hands of the user?

SUSAN: 10:52

Yeah. That's really interesting. Do you have an example, maybe, that could help us understand how you've applied any of those frameworks in your work?

RIA: 11:01

Absolutely. So I currently specialize in the field of AI ethics as part of my work at Intel. What I've started to do is look at artificial intelligence pipelines as pipelines that need to be analyzed, need to have compliance, as well as technologies that help developers create models. Again, it's not necessarily directly tied to ethical components at the moment, because a lot of ethical implications for AI systems really depends on the way that you're using the model or the system. But a lot of these components are also pretty foundational. For example, am I able to visualize what's going on inside the model, or when I have different types of models, am I keeping track of the accuracy and performance metrics? Am I using metrics other than accuracy, like balanced accuracy, because I know that metrics like accuracy can be biased sometimes because of imbalanced data. These types of considerations. And now when it comes to using one of these types of frameworks, I personally have been really applying the project framework a lot for this use case, which is being able to say, if I'm going to try to create a technology that's making AI development easier, what is it going to look like in the hands of the user? Do I really want the user to have as much control as possible over being able to look into their neural network or their machine learning model, explore their data, etc., or do I want to do a lot of that for them already and then have it ready and prepared for them so that with a click of a button they get the insights that they need. Both of those have their own arguments, I'd say, but taking the project approach has really helped me lean towards the latter, where I'm kind of doing a best of both worlds type of approach. You're providing the insights you think the user might need right away, but you're also giving them the option later on to deep dive if they're more advanced users of the technology. And I think this is how the project framework has really helped shape my thinking, but I use a combination of the patent and paper frameworks as well in my daily work.

SUSAN: 12:55

That helps to illustrate the concepts that you're talking about. And it's interesting, too. It makes me think about an upcoming episode that we have with the podcast, where we talk a little bit about AutoML and the various kinds of interfaces and levels of guidance that might be provided for people, particularly in this example that you just gave, seems relevant there, so interesting stuff.

RIA: 13:16

Exactly.

SUSAN: 13:16

Cool. Well, let's get to your AI ethics work a little bit more. We've talked about AI ethics on the podcast before in different contexts. We had Abhishek Gupta of the Montreal AI Ethics Institute on a while back. But you have a particular focus in your AI ethics work, which I think is super interesting, on the Internet of Things. So I would love to hear a little bit about what got you interested in that particular area of AI Ethics?

RIA: 13:41

Sure. So as part of initial exploration in AI ethics, as a data scientist just kind of separate from the work that I've been doing at Intel, I think it's something that's always in the back of the mind for a data scientist. This may also just be my personal opinion, but as I mentioned, even those foundational concepts like being able to track across [runs?], whether or not your accuracy scores are changing, creating confidence intervals, working to raise visualizations. Even exploratory data analysis where the fundamental concept there is that you're getting a sense and an understanding of what's going on in the data so that you can detect problems and then fix them beforehand, whether that be imbalanced data, co-founders or similar, you're being able to understand and get a grasp of what's going on with the AI system. But I think outside of this, there's an interesting sort of interest in the space across the community around the implications of these AI algorithms, whether it be carbon footprint consumption, and a lot of very critical concerns targeted around this in addition to the fairness of AI systems and how they are implementing and also implicating unfortunate outcomes for certain populations. A lot of the onus is in the hands of the developer. But I think one of the critical focus points that may not be considered as much but is starting to really come up as a critical trend, is how this might apply for Internet of Things devices where human control is not defined as clearly. And what I mean by this is if we have a digital surveillance endpoint device and we have humans, we have authorized personnel in other words, who are able to kind of go into the device, monitor the AI model and make sure everything is going accurately, there are some interesting concerns that start to play in. First of all, we need these authorized personnel to be trustworthy. This is a critical part, I think, of the societal component of AI ethics. But now moving forward, when we get to the technological component, how much control do we give to the personnel in order to influence the AI algorithm's decision-making? Do we want to give them a lot of control, which again really depends on the trustworthiness of the personnel.

RIA: 15:46

In some cases, we actually do want to directly assume this. We make a very strong assumption saying that the human is right, which is correct in a subset of use cases. And then what we'll do here is start to say, how much human labor or effort do I need in order to consistently monitor that AI algorithm? Can I design these techniques that are allowed to help the human or the person that's behind the wheel, be able to identify problems quicker? And then that's when we start to get to a very interesting part of AI ethics for the Internet of Things, which is, can I create these automated mechanisms for detecting certain problems or violations of AI models to help the people who are kind of helping the AI model make decisions or [who are?] behind the deployment of the system. Whether it be algorithms that can kind of test and check for concept drift or data drift that can help visualize what's going on inside the model or even track energy efficiency related metrics. How do we do this autonomously? How do we do this without a lot of compute on the small edge devices, and how do we do this without impacting user experience? Because if I've got a laptop and I'm running some AI ethics checking tools, I don't want it to be taking up a bunch of bandwidth and compute, I really want it to be doing it in the background, making sure everything's going all right, and then letting me know if there's any problems that need to be done instead of taking action all by itself. And the statements I'm making here are pretty sweeping, but each use case, again, I think necessitates different AI ethics considerations. So that's the interesting flavor that the Internet of Things is adding, in my opinion.

SUSAN: 17:20

Yeah. Absolutely. So you mentioned surveillance devices as a general example, I think, but what are some of the particular things that you're working on in this area right now that you can share?

RIA: 17:31

Right. I think as part of this-- we've been taking a pretty generic approach towards the way that we're looking at this, at least again reflecting on my personal experience. The first area of impact for AI ethics from the data science perspective should be the developer themselves, because we are essentially empowering the creation of tools, the evaluation, and the use cases there. So when it comes to just being able to download a tool kit that's compatible with your framework of choice with your models that's allowing you to analyze these types of insights, for me personally, that should be the primary focus. Because again, developer communities are very strong and are building tools to help each other. Industry is adopting those methods, those techniques, and those insights for their own use cases as well. So when it comes to tools being able to analyze the fairness of a machine learning model, given a data point or a data set, the transparency and the robustness, I'm very interested in furthering the exploration and development of these tools. This doesn't necessarily reflect the view that industry has. Of course, there are far more critical problems when we're starting to put AI systems, or systems enabled along with AI, into hands of consumers and users such as ourselves, which is, what are the monitoring mechanisms being taken place, etc., and this is where regulations, I think, are coming into play. I'm getting a hold of, trying to explore and understand. When we have these regulations, they're really applied to where it's industry related products. How do they apply to the developer or the data scientist, and then how do we develop tools that could potentially help with this connection?

SUSAN: 19:08

Yeah. Those are really big questions. And again, I'm wondering, can we go a little bit more concrete? Are there particular examples of devices that are in the hands of consumers that you are especially concerned with? Not necessarily projects that you're directly working on at the moment, but just generally speaking, things that you're especially curious about or would like to better understand the impact of.

RIA: 19:30

Yes. I definitely think medical devices are at the forefront of the discussions that I'm interested in, and of course, the field is seen as something critical because there are lots of interesting regulations that are coming up for this. At this point in time-- again, one of my sweeping statements, it seems that most of these are at the guideline stage. There's a lot to think about when it comes to AI algorithms being used in smartwatches used for medical monitoring, all the way to software as a medical device, which the FDA has been releasing some guidelines around. So there's a lot of thoughts around this. In addition, I think there are some core principles when it comes to AI ethics that are being emphasized that are easy to [pull out?]. Things like best practices when it comes to machine learning algorithm development. Transparency, as I mentioned. Being able to understand what's going on inside the model, and from different stakeholder's perspectives as well. A regulator would perhaps want a different explanation or a more detailed explanation of what's going on inside a machine learning model compared to the average user who would most likely want to know what's going on with their data and how is the AI algorithm going to help them and similar. A developer would similarly want different types of information as well when it comes to medical devices, and there are a lot of different setups and paradigms that are coming up at the data science space as well, like federated learning and similar, where now you're starting to have this concept of sensitive local data and then aggregation of data points on a collected server. So given all of these technologies, techniques, concerns and similar, I think one of my primary concerns and interests as well to be able to solve is, when we have these medical devices that are using AI or potentially AI as a medical device itself, what are the inputs and the outputs? Where are they going, transparency on this, to the user and to the stakeholders that are involved that need to know this information? And then similarly, those other principles of AI ethics like robustness and fairness, how are they getting implemented? It's a pretty high-level concern. There's a lot of deep dives into this further, but I would say that's kind of a summary of my interests there.

SUSAN: 21:34

And you mentioned federated learning. Can you tell us a little bit more about that and how that is playing a role in dealing with some of these issues?

RIA: 21:42

Right. So federated learning is a paradigm where at a very high-level and kind of cutting out a bunch of details, you're able to keep this local sensitive data on your mobile phone, let's say, for training, fine tuning, or similar. And then you're able to send out that data to an aggregated model server with only a summary of those data points. So, for example, in the medical domain, it can be used for hospitals that are sharing data and collaboratively working on machine learning models, to kind of avoid the need to share data beyond local processing. Again, these technologies do have their advantages and disadvantages, and they're constantly evolving. There are multiple different schemes of federated learning that can be implemented. Notions like vertical and horizontal federated learning, different stakeholders, etc., but it's one of the approaches out there that's being used to help protect data. Others include technologies like differential privacy, homomorphic encryption, etc., where you're trying to protect data to a certain extent, each of these offering their own guarantees. And what I've seen is an interesting tidbit picked up from discussions in the developer communities that I'm involved in, is that a lot of these techniques seem to have great traction within the academic and developer communities. When it comes to industry, though, there's a lot of hard evaluation on whether or not it's useful, it's applicable, etc., for example, for differential privacy as well. So I think these conversations are constantly evolving. Lots of great technologies that are coming up, as long as their disadvantages are also accounted for.

SUSAN: 23:11

Sure. And what do you think it would take, or do you think that it would be desirable, even, to bridge that gap in accepting some of these approaches between developer and academia attitude versus the industry attitude?

RIA: 23:27

I think it may require common practices. And again, given AI ethics is so broad, even the establishment of common practices itself, I would argue would come under the ethical AI domain. Because what we would want to look at is, again, from a technological perspective on mechanisms we can enable for data scientists, are there any tools that we could create to standardize measurements, benchmarking, or evaluations across the different communities? At Intel, I've been working on creating essentially a benchmarking related frameworks for our own internal use, where we're able to identify different types of technologies in the space. And then we're able to say, okay, so given a particular technology, how do we feel this is going to impact the market, or how is this going to impact technological innovation? How does it integrate well with the capabilities for offering, etc.? Gartner's Hype Cycle as well, is a great illustration of this from the industry wide perspective in terms of the technologies that are really going to be considered up there that are going through the hype phase, etc. And I think as part of this responsible AI for the Internet of Things and similar types of use cases have been identified, where now developer communities are able to understand the way the industry is perceiving these technologies and come to a common best practice framework.

SUSAN: 24:46

Yeah. That was something that actually Abhishek Gupta and I talked about quite a bit, was the challenges of taking sort of these general frameworks or guidelines and so forth and actually implementing them at practical everyday level, that that's a difficult thing to do, and that a lot of the challenges that people have encountered in doing that aren't necessarily being documented and shared so that they can be a source of communal growth and development. I wonder if you've observed that in your work?

RIA: 25:17

Exactly. I think this is the same problem that we're encountering. An equivalent of doing work as well as development for data science as a personal hobby, it's just very challenging to be able to identify best practices for particular use cases. For example, even in the interpretable AI transparency related domains, when we have these technologies that are being deployed, again, with their own disadvantages and advantages, for example pixel level attribution methods, versus causal explanation methods or similar. And one of the key things that we've actually found as a potential solution but we're still kind of evaluating, reiterating back and forth and understanding, is this kind of something that is reflected across the industry, is considering the inputs and the outputs of the pipeline. I mentioned this in relation to medical devices as well. But diving into that further, the items that we've been considering to try to solve this as a potential solution would be, what are the associated workloads or the data types that we're really looking at? And again, I think the data science perspective is so beneficial here, because as we continue to approach problems, we really wouldn't look at it in the sense of, okay, I need to solve this particular problem. We would break it down and say, these are the particular data types that I have for the data set I'm given or for the data set that I've searched for. This is the machine learning model and this is the particular use case, whether that be sentiment analysis for natural language processing, object detection for computer vision, and similarly reinforcement learning for a particular use case like Wi-Fi, bandwidth prediction, or something similar. We really define all of those elements before we start to dive in. And of course there's flexibility there, but we have some foundation or context there that we start with. And I think keeping those in mind may be critical for being able to identify and predict, how do we get these frameworks - again, as you mentioned with your conversation with Abhishek - to a practical day-to-day implementation level.

SUSAN: 27:19

Yeah. Makes sense. So I am curious. Generally speaking, beyond these topics or maybe building on these topics, what are some things that you are most excited about in the future of data science and AI? And this could go anywhere in the field that you were curious about.

RIA: 27:37

Sure. Then I'll take that opportunity to put a pretty wild theory out there with artificial general intelligence, but I have to clarify before saying that. I'm not a firm believer at the moment-- I think perceptions can change at any time, but I'm not a firm believer in the IDF. Uploading thoughts to devices and similar because it's just really interesting, but also challenging to see how this might be applicable. And nevertheless, I think when it comes to artificial general intelligence, the concepts that I like to tease out a thought that I am personally very interested and excited about seeing, is that very clear intersection between neuroscience and AI, in the sense that nowadays a very cool upcoming trend is having these neuroscience inspired AI, which is more of a hype related term, but machine learning or deep neural networks related architectures, where what you could start to do is-- I guess there are two points here. The first is, make parallels or try to measure the similarity between networks in the brain and the networks that are being created today. Popular convolutional neural networks like [resonant?] and similar. I mean, there's a very interesting research project that was done in a center, by MIT I believe, called [Brain Square?], which is essentially a research project to be able to understand the similarity between particular networks in a area of the brain and the comparison to a very popular everyday neural networks that we are leveraging. But if we move beyond that as well in terms of actually creating neural networks now that are inspired by brain functions, fields like neuromorphic computing, cognitive computing and similar, are just so exciting to think about. Now, on the flip side, when it comes to data science applied for neuroscience, I think that's equally as interesting. I think I've recently read a research paper as well where you're able to reconstruct images from the brain.

RIA: 29:25

Again, it's pretty shabby at the moment, I would say, because more research needs to be done there. But it's similar to what you would see with a generative, adversarial neural network, where you're able to reconstruct neural activities and image. Super interesting line of research, I think, that's really lying at the intersection. But, yeah. That's a summary of some of the areas that really excite me, and lots of great developments to come.

SUSAN: 29:46

Yeah. Those are really interesting things. And I can see that they would also completely set off your interest in ethics as well, because it seems like there's lots of interesting questions to explore at the intersection of neuroscience and AI ethics for sure, so lots to discuss there.

RIA: 30:02

Absolutely. I think one of the points I wanted to share, I have seen a new field of neuroscience called neuro ethics that's starting to emerge as an interesting trend, and I think that community is very interested as well in scoping out ethical implications of AI systems and how that might relate back to neuroscience. So I did want to raise that as a interesting topic as well for potentially the listeners of this podcast as well. I've started to explore it through some trainings by INCF, but essentially depending on the implications that AI systems have for ethical considerations from both societal and technological perspectives, how might that reflect back to neuroscience? Either just thinking about it psychologically or also when we are starting to consider MRI data or similar, how would that reflect back? So I just want to put that out there.

SUSAN: 30:48

Yeah. Yeah. It's fascinating. Very thought-provoking stuff. Super cool. So one question that we always ask to our guests on the podcast is what we call the alternative hypothesis segment here. And that question is, what is something that people often think is true about data science or being a data scientist or working in AI ethics, that you have found to be incorrect?

RIA: 31:12

It's a great question. I think the common answer that I've seen-- I'll provide my own answer in a moment, but I think the common answer that definitely needs to be stated is, a lot of effort where perspectives really do not emphasize on the data exploration and data collection points of view. Essentially, those really hard, grueling parts of the data science process where you need to find data, you need to clean it. Most of the times, automated techniques might not work if you're filling NA values in a database. Do we know the types of non applicable values that are available? And it depends on the type of data stream so much. Time series data would have a completely different way of looking at it, potentially. But outside of that, I think the larger concern that's connected to that, that I personally feel is something that may be not initially perceived, is the idea of these end-to-end pipelines. During my academic data science career, and also when I was serving as a teaching fellow for the same curriculum that I graduated from, which was a great experience but also allowed me to reflect on a few components, I think a lot of what we see and what is taught as well in curriculum is really subsets [who are?] components of the pipeline. For example, one course will focus exclusively on machine learning, model building, and hyper parameter tuning. In my education personally, I actually had the opportunity to, and it was a great idea, but to also take courses that were specialized for different algorithms. For example, you have a course specifically dedicated to data mining only. Neural networks was very briefly mentioned there, but it's all-- even decision trees, very briefly mentioned, but everything about regression, how you might detect the variables, how you detect for covariance and similar. All of those were really covered there. And then you have a completely separate course for machine learning algorithms, where you're just tuning all the time SVMs, decision trees, create those visualizations and get them out there.

RIA: 33:03

And then another course specifically for deep learning. And then you can keep going on like that, another course for reinforcement learning in the similar. But a lot of these are really segmenting, I'd say, parts of the pipeline. The way that I discovered this is by taking one course during my master's on big data processing, which is really around now these really popular tools as part of the landscape like Dask, Redis, Cassandra database technologies, as well as some interesting visualization technologies like Cabana, and then starting to put those together in a pipeline. And that's when I realized that just machine learning model tuning or the exploratory data analysis is not the end of data science, it is really that end-to-end picture. Again, something I'm very passionate about. Really the inputs that are coming into the system that you're really deciding on framing the problem all the way to the outputs. How are you optimizing that pipeline? How are you getting it into the hands of the user? How are you improving user experience for it by building a website like Flask or similar? All of these components come under data science. So to summarize, I think that perception or that emphasis could really be placed more on end-to-end development and deployment of data science pipelines.

SUSAN: 34:14

Yeah. And it's interesting. I mean, obviously, I think you're right to include all of those items. I think what's challenging then is when people-- for example, some listeners of the podcast, we'd be like, "Oh, my gosh. I have to learn every single part of that process in so much detail." And for a lot of folks, even learning different modeling approaches has been a lot of information. And so, what's your advice, then, to folks who are wanting to get into data science, who may be some of our listeners, who are listening to your description and going, "Oh gosh. That's so much."

RIA: 34:50

Definitely. I have been in the same place and continue to be there as well when it comes to, again, the overwhelming amounts of techniques there as well. I think what I'd like to do, is provide assurance that there really doesn't need to be any broad expertise regarding this. I would not call myself an expert on any one of these technologies. And something that just really isn't emphasized as well, I think, is you don't necessarily need to know the depths of it to be able to practice it. I know this is a pretty controversial opinion in some cases. Because for example, when it comes to learning the mathematical concepts of data science, I've seen a lot of different excellent approaches that either emphasize on learning the math first and then diving into the technique, or dive into the code and the technique first and then learn the mathematics. I personally learned it as a combination of the both, but more leaning towards the latter side. But I would really say personalizing it and making it your own journey is the critical part. So for example, when I say these end-to-end pipelines, what I was able to do to tackle this overwhelming amount of technology [as in?] similar, is to pick maybe three technologies or so and then challenge myself to create a entire pipeline from start to finish just using those three technologies. You don't really need to consider the others at a particular point in time. And coursework really requires you to focus like this and narrow it down, but I think even if you're not part of a curriculum, it is totally possible to do as part of self-learning. As an example, if I were to go about doing this, I have an interest in neuroscience, I may pick up some data from the Human Brain Project, look into some mechanisms and some popular research papers, but also code bases so that I don't confuse myself too much with the paper descriptions but I actually see the code corresponding to it. I'd look through that, play around with the code, and then start to apply those mechanisms to my data. Then afterwards, I got the data ready, now I'm ready for those machine learning algorithms. I'll go read three to five machine learning papers in the space, where if you're feeling really ambitious, a huge literature review, that's fine as well. Just read a survey paper and then get a sense of the machine learning models in the space.

RIA: 36:52

Figure out the one that's most interesting to you. And again, this can be based off of a lot of factors. Maybe the model that's most interesting to you is the one that uses less compute, from an energy efficiency AI ethics perspective. Or maybe the one that's most interesting to you is a deep [neural?] network instead of a traditional machine learning model. Or maybe you really want to pick the machine learning model that's basic like SVM or decision tree, because you want to be able to interpret it better because deep neural networks are kind of perceived as black boxes. So based off of all of that, you pick the model that you feel is right for the use case and then you'll kind of move forward from there. Next, when you get to the output after the hyper parameter tuning and everything, what you might consider there is, all right, I've got this [data?], I've got the machine learning model, those are the first components of it. Now, how am I going to put this in the hands of the user? And this is where, when you get all of these technologies out there, you start to think, okay, I want to create a website, that's the most popular option, where if user is able to enter something and the machine learning model is working in the back-end or maybe even very simplistically, I don't even want the machine learning model to give me real time predictions, I just want to take the outputs of it and put it on a web page. And then every five days it'll refresh itself or something similar. I think that that is a fantastic idea as well. So then you would maybe go and investigate Flask or those types of technologies, and then boom, you're done with your end-to-end pipeline. Now the final point I'll add here, when it comes to optimization, what you'll do is you'll kind of run through your pipeline. And when you're clicking through your website, you may notice it's a little laggy or maybe you have a huge amount of data and you just want to figure out, how do I optimize that? Then you might turn to more sophisticated database technologies rather than Pandas or [inaudible] [data frames?]. Maybe you might use Dask or something similar and then optimize that process. And then you'll feel happy because your algorithm or your end-to-end pipeline is really optimized and efficient.

RIA: 38:44

Again, according to how much you want, you don't have to optimize it 100%. Just like how when we're fine tuning a machine learning model and finding high parameters, we can just spend hours and hours looking for the right parameters. Eventually, we make a best guess in most cases and move on, or we use tools like weights and biases to help with automation or similar, so using a combination of these. My final summary to kind of summarize everything I've just said, is to find the tools that are working for you by doing a quick search. Pick a couple of them, assemble them in your pipeline, and then when you've got that output and then you're finished, maybe you write a blog post about it, present to a local conference or a little workshop or similar and get feedback on it. And then add that to your resume, because that is an awesome demonstration of how, in industry as well, we would deploy an end-to-end pipeline and give it in the hands of the user, and maybe even create a startup or something similar if you're invested in it. But I think these are the skills that I am ramping up on as well to be able to excel in data science.

SUSAN: 39:45

Yeah. That's great. And I love that you included the public sharing of the learning process and the product, and I think that's [very?] important. So very good. [music] Awesome. Well, Ria, thank you again for joining us today to chat. I know our listeners are really going to appreciate all of the great information and advice that you've offered, so we sure appreciate it.

RIA: 40:05

Thank you, Susan. I'm so happy to be on the podcast, and thank you for your time.

SUSAN: 40:14

Thanks for listening to our Data Science Mixer chat with Ria Cheruvu. Join us on the Alteryx community for this week's cocktail conversation to share your thoughts. Ria shared some strategies for coming up with new ideas and approaches to data science challenges. Which of those strategies was most intriguing to you? Do you have your own ways of challenging your brain to get outside its familiar comfort zone in data science? Share your thoughts and ideas by leaving a comment directly on the episode page at community.alteryx.com/podcast, or post on social media with the hashtag DataScienceMixer and tag Alteryx. Cheers.


This episode of Data Science Mixer was produced by Susan Currie Sivek (@SusanCS) and Maddie Johannsen (@MaddieJ).
Special thanks to Ian Stonehouse for the theme music track, and @TaraM  for our album artwork.