Observations on the Citizen Data Scientist

Question

Citizen Data Scientist is currently a very popular term in the analytics community. I’m sure that you’ve heard the term before. Perhaps in a previous blog post, at a recent conference, or from a supervisor asking you to become a CDS. Yes indeed, there is no denying that the Citizen Data Scientist is picking up speed and will have a role to play in the data wars to come.

But what does it all mean?

I don’t want to get into the details of why CDS are needed or even the journey of becoming one. Rather, I’d like to talk about some of my takeaways while exploring this burgeoning area.

Many of the terms are not standardized.

The role of Citizen Data Scientist is based on that of a Data Scientist. What do they do? Well, that answer depends on who you ask and what area the Data Scientist plies his/her trade. This is not necessarily unique to Data Scientists. Engineers may have differing skillsets based on the industry they work in, so what’s the big deal? Well, I haven’t yet come across the term Citizen Civil Engineer. Becoming a Data Scientist takes years of education and practice with a mentor. That education is usually focused on a particular field of specialization. Specialization is a great thing, but not when that specialist’s distinction isn’t recognized. What if all engineers were just called engineers. “I was trained as an electrical engineer but I’ll be working on this bridge for the next month.” The functions of a Citizen Data Scientist are even more vague to accommodate the wide range of backgrounds and skillsets contained therein.

Data Scientists know a lot of stuff and do a lot of work.

Like, a LOT of stuff. I was recently asked to dip my toes into the Data Science puddle known as Predictive Analytics. During my research phase, I have had the pleasure of meeting and speaking with several bonafide Data Scientists. They have all been creative problem solvers who happen to rely on an immense amount of knowledge to do their job. As I try to get a handle on the Predictive Analytics area, I frequently find reminders that there are several other areas of data science that I haven’t yet touched. R, Python, Cluster Analysis, Prescriptive Analytics, AB Testing, Time Series Analysis, Machine Learning, & Artificial Intelligence, oh my!

Data Scientists also rely on interviewing domain experts to better understand the business problem at hand. This definitely counts as a different skillset. They are also responsible for presenting findings to non-experts, which requires a delicate balance between relatability of what they are saying and technical accuracy to prevent misinformation. Yet another skillset required of the data scientist.

To add to the level of difficulty, the next problem you solve may be totally different from the last one you worked on. Data Scientists are required to have highly specialized knowledge in several domains, then apply that knowledge to a broad range of business problems. Each solution requires extensive research, preparation, and testing before arriving at a satisfactory solution. As I said, they know a lot of stuff and do a lot of work.

There isn’t always a right way to accomplish a task, but there are wrong ways.

Gradations are not easy for everyone. Many people want truths and facts, not probabilities and ranges. I have found that the problems solved in Data Science quickly become complex and as a result, certainty is at a premium. Designer users will relate to the notion that there are several ways to solve a problem correctly. But what if your answers couldn’t be known or checked? In this domain, you may never get the satisfaction of being correct in the traditional sense. Models are always subject to scrutiny, but they are useful because they can provide additional information to decision makers. Those decision makers must have confidence in the information provided by the model. If you make a fundamental mistake when creating a solution (improper application of a technique), it will undermine the model’s performance and any confidence in that solution. You do not need perfect strategy to gain value from your solutions, but it is vital that you use the correct techniques.

Data Science is about making judgment calls and recognizing the impact of those decisions.

While researching the various predictive models, my first question was: “how will I know which one to use?” Response: “It depends on the question and the dataset.” Neat. My follow up: “So they don’t all do different things?” Response: “Some do similar things but they are all different. You try a few and compare results, then choose the best performer.” “Makes sense to me.”  (rubs hands and thinks, this will be easier than I thought...)

Many words later, I developed a profound respect for the Data Investigation toolset. How did this happen, you ask? It turns out that data scientists also like idioms, such as “Garbage In, Garbage Out.” Except they are referring to garbage data in, garbage model out. Consequently, Data Scientists spend a lot of time preparing data to make it optimal for the modeling algorithm(s) they plan to use. Again, sounds fair to me. Until I found out that the techniques used will differ depending on the model you plan on creating and the incoming data. Again I asked, “how do you know what to do to the data?” And would you know they said to me, “it depends.”

My conclusion: the butterfly effect is real. Also, there is an artistic quality to recognizing the most appropriate action with any given scenario. Acquiring a sense of what actions are appropriate requires exposure and practice, but that will only get you so far. Trial and error will get you the rest of the way. After preparing data, you must recognize the impact of the changes you made and decide if you want to continue or go back and adjust.

Data Scientists respect the hustle.

As I mentioned, I’ve had the opportunity to ask some very “pedestrian” questions when speaking with extremely educated/capable people. Every one of my questions was met without judgment. I have found the Data Science community to be a very welcoming one (both online and in-person). This community is very generous with its hard-earned knowledge and I really appreciate that. Many professionals keep their trade secrets under lock and key for fear that the next qualified applicant may take their job. Perhaps these data scientists know I wouldn’t be able to do their jobs anywhere near as well as they can, but I choose to believe that they enjoy spreading knowledge and the unique perspective that knowledge provides.

You don’t have to know everything to be useful.

When you embark on the citizen data science journey, it can be overwhelming. The jargon used is often specific to the context of a particular problem. Then comes a laundry list of unfamiliar techniques, each containing a large, ambiguous function or three. This is followed by the thought experiment known as exploratory data analysis, which can feel more like exploratory data paralysis. Like I said, overwhelming.

But I have found that this community is eager to share knowledge. I have also found that there are a tremendous amount of resources available to help anyone get started. And most importantly, I have found that persistence leads to literacy, which leads to capability. I may not have full confidence in my own ability to make a great predictive model (yet), but I can appreciate what it takes to make a good one and my “data literacy” has improved immensely. I don’t intend to stop here but wanted to take time to reflect on my own journey so far and share what I have found interesting.

JohnJPS · Answer

Great discussion.  I confess to wasting more than a few CPU cycles on definitions... here's my take:

* Data Analyst... prepares and analyzes data: often significantly contributing to business decisions
* Data Engineer... understands how to automate the process of data prep and analysis.
* Data Modeler... defines data models (in particular predictive and/or machine learning models) where they help the computer to form its own description of the data. The modeler is already a top flight analyst, and can interpret the model output and contribute to business decisions and contribute to automation demands.
* Machine Learning Engineer... understands how to automate the process of machine learning, including data ingestion, model training, deployment and scoring.
* Data Scientist... Discovers new techniques of doing analysis, engineering and modeling, and is able to show that the techniques work either via mathematical proof (a theorist) or empirically (an experimentalist). The few in existence work at Google, AWS, etc... (haha).

For me a Citizen Data Scientist is basically someone with none of these (or similar) job titles but who is still doing one or more of the given functions.