Recently, I provided the opening keynote for the Great Lakes Data & Analytics Summit where we adapted the experience to encourage both in-person and live virtual participation because there's no social distancing when it comes to analytics.
While my talk focused on the importance of becoming an insight-driven organization, the questions I received after sparked a number of conversations. Here, I take a deeper dive into that Q&A, covering everything from bad data to buzzwords and more, because together we solve.
A: I don’t necessarily believe there is such a thing as “bad” data. But it is true that data sources frequently have erroneous and/or missing data. I have met very few real-world data sets that were perfectly clean. That said, most data can still be used effectively, as it is good enough to provide value, insight, and a solution to a problem.
Let’s start with an example of how data can mislead people — or, more accurately, how people’s biases can cause them to use data in ways that might provide the wrong answer.
We’ve had moments throughout history in which biased historical data has allowed people to draw the wrong conclusions. Let’s take a famous story about Abraham Wald from World War II. The legend goes that to make planes safer and more likely to survive battle, a team of researchers evaluated every plane that came back from missions with bullet holes to assess where to place more armor.
This is a great data science problem. Adding more armor to the wrong places makes a plane heavier, slower, and more likely to be shot down; adding armor to the right places makes the plane more likely to survive battle, a great optimization problem that would have meaningful consequences — exactly the kind of stuff that data scientists love to work on!
Normalizing the number of bullets by square footage, the team concluded that the best place to add armor would be where they saw the highest number
of bullets per square foot, which in this case was the fuselage.
As the legend goes, when they reviewed this data with Wald, he responded that the armor doesn’t go where the bullet holes are; it goes where they aren’t (the engine).
Why did he come to this conclusion? He kept asking questions about the data. Where are all the missing holes, the ones that aren’t in the data set? Why weren’t the holes spaced evenly all over the airplane? Wald’s conclusion was that the planes with holes on the engines weren’t coming back, whereas planes with holes in the fuselage were making it back. Therefore, adding armor to the engine would be more important.
This method was used throughout World War II as well as the Korea and Vietnam wars. It’s a great example of how the initial analysis would have been fatally incorrect if it wasn’t more carefully examined.
There are many other types of cognitive biases beyond this one (which is called “survivorship bias”), and each can affect an outcome. It’s important to consider these as you’re looking at data to ensure you get the best possible result. And keep in mind, the data wasn’t “bad”; it just wasn’t used in the right way.
Image source: https://www.businessinsider.com/cognitive-biases-that-affect-decisions-2015-8
We’ve got to be careful to avoid biases and being blindsided by our assumptions and the conclusions we make with data, and never forget to include clever humans (like Wald) as part of the process. Avoiding cognitive bias starts with being aware it exists, and then actively combatting it. The internet is full of tips, including checking your ego, not making decisions under time pressure, and avoiding multitasking.
But one of the most powerful tools in your arsenal is in computer augmentation, aka using data science tools that free people to think deeper about the analysis and create better answers.
A: The narrative that computers will “take over” is certainly one that plays well in the movie theaters and has been in the news for decades. However, most people who are experts in this space do not share the same worries. While there are many repetitive tasks that machines are quite good at, I’m not concerned that computers will replace the data scientists. To give you some perspective, let’s start with areas where computers excel and how this relates to people.
Let’s say I hand you all the invoices that came in today and need them added together to understand our total amount due. A computer is great at the mundane task of adding these numbers together. This frees me up to do higher-order functions, like talk to customers and think about what new products I might want to design and sell.
There are many tasks humans do that would be hard to envision a computer fully taking over autonomously:
Think about some of the most advanced computerized tasks today, like teaching a computer to drive a vehicle. Again, I would argue that driving a car is a mundane task for humans, even boring to most, and if we were asked to drive a car every day for eight hours, we would struggle to stay interested. Yet teaching a computer to do this task is one of the most challenging efforts underway in computer and data science today.
Computers can help augment our skills and provide an amplification to us, but at this point, I don’t see Artificial Intelligence (AI) taking over most of these types of tasks. Data science, it turns out, is a very creative and complex field, and in this profession specifically, I see many examples of amplification with AI techniques but very few cases in which the process is completely taken over by the machines.
The most significant challenge for most data scientists is in the first step of the process: problem formulation. In this step, we are trying to understand the business problem, the objective, unintended consequences that could arise and a methodological approach that might be used. This is certainly not the domain for which most computers can help. That said, data cleansing, finding correlations between values, monitoring a model that is in production — these are all areas in which computers can help augment the process.
There have been many others who have weighed in on this debate, with Elon Musk and Mark Zuckerberg famously taking different views on the discussion. We’ll see where this one ends up, but for now, I’m squarely with Mark. What are your thoughts?
A: It worries me when people ask this question. The reality is that the focus of our work should be on achieving amazing outcomes. Does it really matter whether you solved the problem using AI or ML? Does it matter how big or small the data? I would suggest that if you solved the problem well with very simple math, we should all celebrate more than using AI and ML and having a poor solution. It’s all about the outcomes. What are you asking the data to solve? What is the challenge? We need to refocus our thought process to the outcomes we’re looking to answer.
As far as definitions go you could ask different people and likely get varied responses as AI and ML continue to evolve. For AI, it’s about computers acting like humans. As the name implies, Artificial — the machine — is acting Intelligent. For example, many would say that tasks like natural language processing, understanding the meaning of spoken words, would be a skill reserved for humans. When machines do this task, it is considered artificial intelligence. The problem is that over time, what we believe is reserved for humans has changed — and so opinions of what counts has changed over time.
ML is more of a sub-category in which we have machines that learn from data and predict an outcome — like a regression analysis in which you have the data, you perform some math, and can predict a future value — or perform a what-if scenario. There is a subset of the overall ML field called Deep Learning in which we use techniques that mimic the human brain, with neural networks, to predict an outcome.
But when people ask this question, I am more concerned with why they are asking it — as it shouldn’t matter if I performed AI or ML. Instead, the question should be, can you solve my problem? Do you really care if I did it with addition or multiplication? Would that really make a difference in the value of what we did?
A: There will always be messy data, as there are very few examples of perfect data sets in large-scale production systems today. But I believe that many messy data sets can still be highly useful. And it’s this second part that is where we should be focusing nearly all our time and effort.
Let me first rewind to a problem I’ve seen many teams experience in the early days of the analytic journey. Teams became very focused on their data; they thought of it as the “new oil” and believed that if they could just clean it all up and “refine” it, then like magic, a value would come out of the process. Unfortunately, there are a few issues with this. The first is that people tend not to want to expend energy doing a task unless they know it will deliver value for them. Asking people to go clean data when they aren’t connecting it with an outcome is just not a sustainable endeavor. Instead, if we intend to solve a problem and we find that cleaning the data can help, then we may be more likely to work on cleaning the data. A second issue that is related is that simply cleaning data doesn’t provide any ROI by itself. So all of this effort doesn’t directly yield an outcome.
Here’s what I would not recommend:
So what should we do?
Clean data clearly provides value in an enterprise, but prioritizing where to focus efforts and ensuring there is a natural feedback loop to those who create the data is key to successfully improving your data pipeline. By focusing on business solutions while making the data issues visible, there are natural incentives that will ultimately drive better data quality.
A: What a great place for analytics and democratization: the ultimate knowledge industry, higher education. There are so many places where analytics plays a role. I’ll provide a top 10 list here, but I could easily have generated a top 100 list and would be happy to do so if anyone is interested! I also see analytics and specifically Alteryx being used in K-12 as well, with many of the same drivers.
If you are interested in additional use cases, you can see more detail here.
A: This is one of the most common questions I hear from IT organizations. How are we going to prevent people from creating “bad analytics”? How are we going to govern and control this? Or more bluntly, I’ve heard some say, “We can’t allow people outside IT or the data science team to perform analytics; they might make a mistake.” And while I totally understand where the question is coming from, I think frequently we miss the reality of what is happening.
Are we afraid of people all over our business using scientific methods to solve problems? Are we really worried about giving businesses better technology to perform math?
Governance is important, and good governance is like good government, an enabler to achieve your goals. Implementing successful governance programs is all about focusing on how to enable users to perform analytics using best practices and helping users achieve better outcomes.
The great news is that yes, new technologies are providing even better ways to govern and help people avoid many of the pitfalls they ran into before. That said, much of this takes work and is not as simple as buying a technology solution. Your organization will need to put processes in place to make it all happen and invest human capital to make it work seamlessly. But with a few key actions and modern analytics, we have seen many companies put great governance processes in place and have incredible people flourish across their businesses delivering amazing outcomes and ROI.
Here are a few of the best practices you’ll want to implement as part of a governance approach:
A: First, you can implement solutions on a server as well as a desktop versus only being able to deploy to a desktop. This allows visibility and an ability to support solutions that is very different from a desktop-only solution like most spreadsheets.
Second, solutions like Alteryx provide self-documentation of a process — so the need to create desktop procedures is reduced, as it’s automatic. This makes processes more sustainable and understandable, allowing very direct reviews of the “code.”
Third, with Alteryx, each process step is made transparent where inputs and outputs are seen by default. This allows full traceability.
Fourth, Alteryx creates repeatable processes that can be run on a schedule without human intervention. This automation reduces the risk of a copy-and-paste blunder. (Search for largest spreadsheet blunders and you will see quite a few multi-million and even multi-billion dollar problems).
Fifth, using solutions like Alteryx Connect, you can monitor lineage and provide guidance to users on which workflows and data sets are of high quality, which are certified, and who to go to for help.
All of these are great examples of how Alteryx helps companies improve their control and quality of analytic outcomes. No technology can eliminate human error. The real question in my mind is whether you’re making it better than it was before.
But again, if we go back to the original question and my original angst, a lot of this comes back to the question of who do we trust to use math? Is it really any better to have Accounting ask IT to build a solution? And when IT builds it, is it right? Who checks it to ensure it’s working as expected — and signs off? Wait, it’s Accounting? If that’s the case, why do we think IT or a data science group is going to know better than Accounting if an accounting process, forecast, or model is accurate? While having IT and data science professionals available to provide support makes a ton of sense, telling Accounting that we aren’t going to allow you to use a more modern solution because we’re concerned you will make a mistake seems completely wrong.
Knowledge industries are comprised of incredibly knowledgeable workers. Data science thrives where we provide these domain experts with the tools and education needed to succeed. Is your organization providing your most valuable assets, your people, with the best tools available and the training and support to leverage the modern capabilities? Do you have a Center of Enablement made up of data scientists that can help take them on this journey? If you do, you are likely well on your way towards digital transformation.
Alan Jacobson is the chief data and analytics officer (CDAO) of Alteryx, driving key data initiatives and accelerating digital business transformation for the Alteryx global customer base.
Alan Jacobson is the chief data and analytics officer (CDAO) of Alteryx, driving key data initiatives and accelerating digital business transformation for the Alteryx global customer base.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.