'Tis the season to be spooky! Read our new blog, How Spooky is Your City? Mapping and Predicting Scary Stuff. In it, @SusanCS provides a fun glimpse into using data to figure out the creepy quotient of where you live!
When we begin with analytics, either as a new practice within your firm, or starting a team in a new area–much of your work may initially be what I call "single-pass analytics," where the data goes from source, to prep, straight to the end-point.
This is natural and normal in the beginning. The question for this article is, "does this change as you scale, and if so, then how?" From my experience, our work changes significantly with scale across many different axes, and this series of articles will tackle each of these one by one:
For this first article, we will cover how we think about data, and how this grows as your analytics efforts scale.
Broadly speaking, data analysis has a predictable set of steps. On different projects, these may be done in parallel, in another order, or you may group together (or some may have been done for you by a previous project or a central team).
What are type 1 and type 2 analytics (or Mode 1 and Mode 2)?
Why is this distinction useful? Well, you will think differently about type 1 analytics than type 2 and manage the project differently. The classic type 2 project in our world is when the boss says, “please can you figure out why X happened?” Where classic type 1 would be “please can you give us P&L by department every month?"
Where do we spend our time?
Our experience is that at least 80-90% of the time spent in analytics is spent on sourcing, cleaning, enriching, and preparing data, with the remainder spent on insight. So, once you've done your first project, you may already be starting to think about how to store this data so that you don't have to re-do all this work for the next project.
How does this change as you scale? This will never be the exact stages that you and your team/organisation go through, but it may be useful to help us follow the story. For this article, we will focus on how your treatment of data changes, and later we will cover culture, skills, and reusable components/widgets, etc.
Stage 1: The Beginning
You may start with network drive with spreadsheets (try to create a good folder-structure with names, for example, client data in one folder and product data in another). The limitation is that this doesn't scale well (more than a million rows becomes slow) and it's not great for a team of 2 or more.
Every team is doing single-path analytics – the data starts in raw sources, is processed in Alteryx, and then output into a report, alert, or another file.
Stage 2: Starting to Store Your Results
From there you can graduate to a file-based database like SQL Lite or MS Access to store some work-in-progress
Stage 3: Bigger Data = Bigger Database
If you continue to be successful, you may need to get a database so that you can store your data, and to process larger volumes. Note: as soon as someone says, "how does this measure compare to last week?" - you probably need a database to start keeping history.
At this stage, your data picture above has changed:
Stage 4: Do we all need the same ingredients?
Once you have 3 or 4 people working on related datasets, it's important to talk about data and how you can reuse. There are tools to help like Alteryx Connect or even a simple data dictionary. Don't overlook the power of a chat room, too. ("Does anyone have data about Giraffes?") You begin to have multiple different Alteryx jobs running from the same data and you need to think about notifying folks before changing datasets that may be used by other people. This requires data lineage (again, Alteryx Connect can help).
You will also need to think about data governance. California just passed the California Consumer Protection Act, which gives consumers rights over their data. You will need specific focus on data retention, accuracy, and access control.
Stage 5: Data Engineering becomes a role
If you are successful in stage 4 and your team continues to grow, you may find that you are starting to need several different versions of a piece of data. One person needs sales summarised by month, another person needs only sales for stores in western states where a promotion was underway. Rather than having five different people connecting to your sales system and all of you pulling subsets of the sales data into your reporting tables, your data picture now changes again:
Sources → Raw/Staging → Clean, Enrich, Conforming, Prep → Publication to end metrics env
Source: Microsoft Enterprise Data
3 key takeaways
NOTE: we’re not advocating that every project needs the full force of the Stage 5 model – there will be many type 2 analytics projects that need a quick turnaround and are done in a straight-through analytics pipe. However, as your data assets become cleaner and more managed, your entire organization will feel the benefit of faster and cleaner analytics.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.