Advent of Code is back! Unwrap daily challenges to sharpen your Alteryx skills and earn badges along the way! Learn more now.
Free Trial

Alter Everything Podcast

A podcast about data science and analytics culture.
Podcast Guide

For a full list of episodes, guests, and topics, check out our episode guide.

Go to Guide
AlteryxMatt
Moderator
Moderator

Susan Walsh, The Classification Guru, joins us to chat about dirty data. Susan sheds light on how unclassified and inaccurate data can lead to significant, often hidden, business expenses and operational disruptions. From supply chain procurement challenges to missed opportunities in data governance, Susan walks us through the common pitfalls and offers actionable tips to help organizations clean up their data efficiently.

 

 

 


Panelists


Topics

 

Ep 167 (YT thumb).png

Transcript

Episode Transcription

Ep 167_Final Mixdown

[00:00:00] Ep 167_Final Mixdown: Welcome to Alter Everything, a podcast about data science and analytics culture. I'm Megan Bowers, and today I'm talking with Susan Walsh, founder of the Classification Guru Limited. In this episode, we chat about the risks of dirty data and supply chain and procurement, how to mitigate these risks with data cleansing and analytics and methodologies for data governance.

Let's get started.

Hey, Susan, it's great to have you on our podcast today. Thanks for joining. Thank you so much for having me on. Of course. Could you give a quick introduction to yourself for our listeners? Yeah. Hello everyone. I'm Susan Walsh, known as the classification guru Pixer of Dirty Data, and I've had my business for seven years now.

We've been helping clients in their procurement and supply chain space, clean and classify their data. We've also helped sales and marketing with cleaning CRM systems. We've worked with global data sets in multiple languages, and we have seen the worst of the worst. In fact, I am cleaning the worst of the worst right now.

So, yeah, right now, so what kind of data problems do you come across when you're talking about the worst of the worst? Or in general, what kind of data problems do you see with clients in procurement and supply chain? I think from their side, the biggest challenge they have is either they can't get all their data from all the various systems together to get a full picture of what's actually going on with their spend data or their supply chain.

And the other issue that I see is in a lot of ERP systems, you might only be able to have a one or two level taxonomy, but you need more levels of detail than that in procurement specifically. Getting the data so that we can classify it is quite often a challenge from the business. And then from our side, we see either completely unclassified data.

They have no visibility on their spend whatsoever or there or their product. We've done a lot of material master data cleaning too, where we've literally had to manually clean each individual material master description because there's missing information. There's typos, there's abbreviations of words.

We've seen misclassification, or we quite often see a lot of classification under unclassified, which is helpful to me. Yeah. Or unknown. Quite often if there's a dropdown option, people would automatically just choose the first thing. Because we're inherently lazy. We're always looking for the shortcut, and so you find a lot of stuff under the wrong information.

We see things within supplier names where the supplier might have changed their name. Being bought out, being sold. I've even seen IBM down as International Business Machine as a supplier name. I don't think it's been that since the eighties. Creative, so there's a lot of stuff going on and most people are experiencing multiple different issues at the same time.

That sounds super challenging and I'm curious what kinds of business impacts that kind of data situation has for these companies. Well, everyone's looking to procurement for cost savings, so that's the first thing. How can you negotiate better rates with your suppliers if you don't even know what you're buying from them?

And on a global scale, you're not getting the best deal. You're not even in a position to negotiate properly 'cause you don't have all the information. It could be that you have too many suppliers in one category. We often see this in office supplies, where you might have. 50 to a hundred different companies where people are buying office supplies.

You literally need no more than five and you negotiate a really good deal with them and everything should go through them, but we see people sneaking things through on credit cards and P cards and all that good stuff. There are things like supplier risk. If you don't know the name and the supplier, you can't run the right checks to make sure that they're compliant, that they've not breached any local laws in a country somewhere that is unethical.

Also, fraud. You know, we don't think about that an awful lot, but actually there could be an awful lot of fraud going on, and if you're not classifying your data and. Monitoring that bend. And even like in, you know, in supply chain, what are you buying in and what's shipping out? If you don't track all that, how will you know if something's gone wrong?

And then we've got what's called rogue or Maveryx spend, which is again, when employees will go up and buy something they need without getting permission or checking first. And quite often we'll see people buying software licenses on, say, a credit card. Actually, what they don't realize is the company's already negotiated directly with the company, with Microsoft.

They've gone directly to Microsoft. They actually have spare licenses within the business that could have been given to this person, but they've decided to go and buy it as a consumer for way more on a credit card. Yeah. And then how do you try and claw that back? Probably most of the time you don't.

It's a lot of wasted money. I think also we maybe don't think about this an awful lot, but if in terms of efficiencies too, if you have that information to hand, you're not spending hours or weeks trying to pull this data together just to get an answer for somebody who's asked a question. There's a load of different reasons.

And actually if we go, if I go back to thinking about the material master days, the cleaning that we did once we cleaned up these 20,000 descriptions, there were so many duplicates. You might be storing in the warehouse, the same product under multiple codes, and you don't even know that you're stocking all this inventory and you're buying more in because one of the codes is low and you don't realize that you've got more stock over here.

Yeah. There are so many different things. That's so interesting. Like we think of, oh, of course companies know what they have in their warehouse, but when they're operating at such scale. The data seems like it's just even more crucial if you don't have the data, right? You can't have people scouring a ginormous warehouse or tens or hundreds of warehouses, so that's really interesting.

A really great way to just check quickly in terms of product codes and things is get a pivot table or use an aggregated table in something like Alter, for example, and oh, aggregate by the product code and look at how many different descriptions are assigned to that. One code. One code. Also do it the other way too.

Flip it, so aggregate the description and see how many codes you have against one description, because that also happens. So much that can happen and some of the business risks you mentioned for not doing this, like higher costs, more risk, potential fraud. That seems like there's a lot of business problems that can result from this dirty data.

Yeah. What if one of your suppliers who is supplying like 50 or plus percent of something that you use or need to make something. You don't realize that they are in financial difficulties 'cause you haven't done a credit check on them. You are at risk of not only having paid for goods that you might not get, but also that's going to interrupt your supply chain, your manufacturing, your customer base, your sales, your profit.

There are so many knock on effects to these things that we don't always think about. How do we solve this problem of dirty data? What's your strategy? You can't do it all at once. Don't even attempt it. It's too overwhelming. It's too big to manage. Start small and do it in chunks, whether that is your top suppliers, if you're in supply chain, or the top suppliers that you think are your suppliers, top suppliers, because quite often that doesn't turn out to be the case.

Or whether that's a business unit or a region, or a country or a product group. If you're doing cleaning, start on a little bit. Work on that. Get a method and process in place. And then also, you really do need to make it a habit and a routine. Everybody is overwhelmed with work. Make a habit, like do it first thing in the morning for an hour while you're having a coffee.

Like get it out the way while you've got a clear head. It's the worst task of the day that you're going to do. If you're lucky enough to have analysts in your team, then give it to them. Um, but really starting small and focusing in on a specific area or something like that is, is a good, manageable way.

And then what it also demonstrates to the business is the value of it, the activity. The biggest challenge that we have at Classification Guru is getting budget for people to pay for our services. I speak to so many people who would love to work with us, but they just can't get the budget. So if you do something small like that, it might help justify those costs, those time saving, those worms that you find once you start digging into the data.

So, yeah, there's, there's a few things you can do. That kind of leads into another question I had about how as a business do you justify the investment in self-service analytics or software for things like this? So it sounds like doing a proof of concept could be one way. It's very hard because particularly with the cleaning of data, if you do it properly and it is, it does involve a lot of manual work.

You are getting it ready for use by technology and automation. You can't get round the fact that you're gonna have to manually clean quite a chunk of it and it's gonna cost some money, and it's very hard to justify it. There's no tangible benefit because you don't know the benefits until it's been done.

So it can be really tricky. However, that shouldn't put us off. So the one thing that you can do is track your time on fixing errors in data. How long it's taken you to pull data together, run reports, go to the business with that, talk about the things like potential fraud that you can't, you might not be able to spot.

Try and get them to think about the money you you're losing right now that you don't even know about because you're not tracking it, but something that normally gets the decision makers. Peak of interest is talk about. Increasing profitability for the business because if you're being more efficient with your data, it's so clean and you're not wasting time coming to clean it every month, you know, and do these certain things every time.

You can get so much more done per person, per head. And if you are also working on like manufacturing, I think a productivity costs, they go up. Or if you're out charging out to the client a specific price, but your team are being more efficient and spending half the time you thought, then that's increasing your profitability.

So there are other ways to look at this. There's not like a one answer fits all. You know, look at what the business KPIs and objectives are. How can you fit data into those to try and sell it to the business to say, we're gonna hit these faster, or we're gonna blow them out the water. If we had this.

Something that I like to see to clients on the procurement side is the reality is you probably have in clean data, you're gonna spot opportunities, even if it's 1% of your total spend. Now, normally we work with million and billion dollar companies, 1% of that, our costs are just a drop in the ocean compared to what could be saved.

But again, you can't show the savings until the work's been done. But I. If we do a sample of their data, show them the clean versus the dirty, then that also kinda helps. Let them see where the benefits are. Because most people don't know, look at a spreadsheet or a table and they don't know clean data from dirty data.

They don't know the difference, so you have to show them so that they actually go, oh wow, that's like really different. So that's always a good thing as well. Yeah, and I think maybe our listeners are thinking about automating these things, but it sounds like this is definitely the step before automation.

Let me tell you now, I would not have a business and would not have had the business for seven years if we were not using, in investing in tools to help us do this. Of course, there is no way that we could do this, what we do with spreadsheets. We would've got blown out the water years ago by another, someone else.

You know, you have to have. Tools, but there's a lot of, oh, AI this, AI that. It's not about that. It's about the right tool for the right job. And knowing how to use that tool and knowing that AI is not gonna fix all your problems, you know, you can't just buy something in thinking it's gonna clean all your data and fix it.

You need a bit of knowledge and background behind it. Especially with some of those examples you gave earlier as we were talking about, you have to know the processes at the company. It sounds like if people are always putting other, or people are selecting the first thing, or it's like, oh, the person who.

Used to write detailed descriptions, they left and the new person does not, or abbreviates everything. Like there's so much variability. It sounds like that, yeah. You could be picking different tools for different use cases, I guess. Yeah, and using tools to help clean that up too will really help. There's so many different blocks and functions that you can use to help you, but again, it's not a magic wand.

You have to use your smarts as well. Definitely. So when I was looking at your LinkedIn, which is very fun by the way, listeners, go check it out. Thanks. We'll link it in the show notes, but I saw you wrote a book called Between the Spreadsheets, classifying and Fixing Dirty Data. In that book you introduced the COAT methodology, which is COAT methodology.

Can you explain that framework to our listeners for cleaning data? Yeah, I always say data problems are people problems, and a lot of the data issues come at the point of the inputting of the data. Most of the people who input data are not necessarily data professionals, and they don't really want to hear the words data governance.

Then it's like, oh, it's all like data tech. It scares them. They're like, I'm just doing my job. They don't always understand the importance of getting it right first time and how what the knock on effect of them putting in the wrong information or missing information is. So I wanted to come up with something that was fun and memorable, but actually still got a serious message across.

So I came up with the data code. So I say, make sure your data has its coat on. Mm-hmm. First of all, it's got to be consistent. So that means same terminology, the same wording. We increasingly work in a global business community where there are different words for the same thing. I had to work for a, with a client to build a global taxonomy.

And we had issues around is it crisps or chips? Is it chips or fries? Is it biscuits or cookies? And we had to define all that because globally that taxonomy is being used. So get those standards set out. Um, also things like units of measure. Is it kilometers or miles? Is it pin or is it leakers? What are you going to use?

Because that could cause so much confusion, even things like currencies. And then once you get all those consistent things in place, it then has to be organized. So think of a messy closet. You've got all your clothes in there, you clean them, and then you just kinda throw them in there once they're clean.

If you wanna go back and get that nice top that you've brought in there next week, you're gonna have to rake around in there for it. It might be meant say, it might be crumpled when you find it. You might have to iron it and it's taken you time and it's just added time to your day. But if you had cleaned it and ironed it and put it away neatly in your wardrobe or your closet, either by color or by style tops, trousers, et cetera, you could have just gone in, pulled out and off you go.

And data is exactly the same as that. If you categorize and organize it when you're working on it, when someone comes to you and goes, oh, how much should we sell in the UK last year, or the us or, how's this business unit doing? If you've got it organized and categorize, then you can go in and get that information really quickly so you're saving time and it makes you look good in front of your boss.

So you got your consistent and your organized data. Of course, it's got to be accurate, but depending on what department you're in really depends on what accuracy is. So we know that in legal and finance it's got to be a hundred percent accurate. No deviation, Ben, sales and marketing, we could round up or run down.

We maybe just only need a few data points per customer. We don't need their live story. So agree, those things. And then once you have that consistent, organized, accurate data, guess what? It's gonna be trustworthy. And how many times do we hear, are I using the data? We don't trust it. That is the worst to hear.

I'm gonna go back in that circle of we need to clean the data, but nobody wants to pay for it and nobody has time to do it. And then up using the data. But I think more importantly than all of that is if I actually at some point I need to do a second edition of the book. It's not good enough to just put your data coat on.

It needs to stay on all year round. You can't take it all, 'cause all standards will slip. And so the real secret sauce to all of this is regular maintenance going in and spot checking, doing regular cleaning and classification. Like I said, do it in the morning with a coffee, get it outta the way before the day starts.

But if you do that, think about it like cleaning your house. If you do it once a week, you're super on top of it. It doesn't take that long. If you left it for a month or more, can you imagine how long that's gonna take you to claim? Or maybe you're like me and you just outsource it to someone else and then you don't need to worry about it so much.

But, but yeah, it's, it's like that the more small and off and keep helps you keep on top of things. Definitely. I love that analogy and the coat framework, I think like building trust in the data. It's something everybody wants, it's something people talk about at a high level, but how do you actually. Get there and get past those difficult conversations where you're showing data and people don't even wanna listen or they don't want to build a dashboard because it's like, oh well there's five numbers that we got for this one product.

We know it's not Right. That's like very common struggle, I think. Yeah. And we need to get into the mentality of like, let's fix it then Not, or we know it's wrong, so we just want you it. Mm-Hmm. Yeah. Because of all those benefits, like you said, understanding risk better of the suppliers. Saving costs.

There's a lot. Yeah. And then for example, something that we do for our clients and anyone using Alteryx could do this too, is you build a flow workflow model and, and you have a master list. So whether that is like a description for a product code or whether that is a supplier name and a clean name. So for example, we get lots of PWC and PricewaterhouseCoopers, so we normalize them all in a new column.

Or it could even be things that you have to be classified as a certain thing. Have a master list and monthly or quarterly, when your new data set comes in, drop that in your workflow. You've got your master model, your master list, put it through the workflow, and then first time round you might get 60% coverage.

The more you do it, the higher that is, so when we do refreshes for client. Now we can get like 80, 90% coverage of normalization and classification because we've seen a lot of the same thing before. And so how amazing is that? It's super efficient. You don't have to build the workflow every time. You literally just copy it or use the same one.

Yeah, I love that repeatability that you can get using Alteryx for things like that. It might take complex workflows, might take a few hours, maybe days depending on what you're doing, but once it's built, you might need to tweak it, but you don't have to go through that every single time as if you would when you're trying to stick a few Excel spreadsheets together.

If you know that you're pulling data from different systems and you know it's called supplier in this one, but supplier name in this one, and it's called vendor in this one in your workflow, you set it up to rename it all to the same thing, and so every time just does it. You don't have to. Even touch it.

It's amazing. And also you don't have to be massively technically competent to do it. I can do it. If I can do it, anyone can do it. That's so true. And that's what we love to hear on the podcast and hear about Alteryx is just that it opens up for people who don't know SQL or coding. I can do the basics.

And then I've got a data analyst who can do all the crazy stuff. Exactly. And that makes me feel cool and like what? Look what I made. That's awesome. So I'd love to end on just how can we increase data quality in supply chain? We've talked about a few ways, like the coat methodology, but how do we tackle it at large?

It literally is from the ground up, from that warehouse person who is inputting a new product that comes in and types something in. To the, the finance people, to the procurement people. It's, we all have to get on board and be a team to champion data quality because we should, we wanna get to the point where you don't need people like us to come and clean your data because you're inputting it right at the start.

And we're so far from that. There's a lot of education. People do feel intimidated or they're rushed. They've got a lot of work to do. They feel under time pressures. But what they don't realize is if that's in the warehouse and they're like, oh, we'll just put in whatever, they wanna get loads of calls asking questions about it later, which has taken what an hour overall in their time.

If they'd just done it right in the first bit, they wouldn't be getting harassed much, and maybe they would have more time to do it. So it's definitely a cultural mindset and routine habit change more than anything. Until we get to that point, we need to get in the habit of continually cleaning and maintaining our data to make sure that it is good enough and it's as accurate as possible.

Well, thanks so much for sharing about your experiences, methodologies, everything. I think this will be really fun. Listen. Well, the thing is Alteryx and impartiality, other tools too. It's each tool for what you need and what suits your business. So for some people that will be Excel because maybe you've got a small amount, but.

If you do have the ability to get a tool that will help with your maintenance, it will save you so much time and makes it less intimidating and you're less likely to put off that task because you can just drop it in a workflow and it'll do most of the work for you. We have to try and make it a little bit easier to want to maintain and do it regularly and Right.

Create a routine. I think that's a huge part of establishing that overall culture, like you said. Prioritizing the data quality and everything. If we have tools that make it easy, I wanna say fun, but may, it's not fun for everybody, I'll acknowledge that, but it's fun for me. AI does not work without clean data.

That is the foundation. Clean data is the foundation of everything, and most people don't even realize that, oh, where all this new geni is pooling information from. If it's not clean, then it's gonna give you the wrong answers. Very true. Clean data is the foundation. I think that's a great way to summarize what we've chatted about today.

But yeah, thanks so much for joining me and I'll link all of the resources we mentioned in in our show notes. But thanks for coming on fact. It was my pleasure. I love talking about this stuff. Thanks for listening to get started using Alteryx for your supply chain use cases. You can download a free 30 day trial at alteryx.com/altereverything.

We'll also link a supply chain specific starter kit with prebuilt analytic workflows for popular use cases like optimizing inventory in our show notes on alteryx.com/podcast. See you next time.


This episode was produced by Megan Bowers (@MeganBowers), Mike Cusic (@mikecusic), and Matt Rotundo (@AlteryxMatt). Special thanks to @andyuttley for the theme music track, and @mikecusic for our album artwork.