In a recent blog post, announcing the release of Alteryx Analytics 11.0, @LauraS explains that “Many organizations understand that governance is not only needed, but actually useful in empowering data workers and analysts to discover new insights; however for many organizations it’s difficult to achieve data governance within a self-service model.”
For this week’s Thursday Thought, and in honor of the 11.0 release, I wanted to take the opportunity to dive a little bit deeper into the concept of “Balanced Governance”. As always, please feel free to include as much or as little information as you like. All “thoughts” are welcome & encouraged!
Question:
What challenges do you face in your organization with regards to Data Governance?
I think it might be useful to nail down Data Governance vs. Information Governance. There's a great distinction article here: https://www.viewpointe.com/pointe-of-view-blog/the-difference-between-information-governance-and-dat...
From there, I think we're probably not too interested (from an analytics perspective) in things like security and general reliability or access. We're probably more into lineage and master data management (e.g. making sure a column in one dataset "means the same thing as" a column in another dataset.
I think from a functional perspective, there's a very rough assumption that "data governance" is for IT, and "information governance" is for data consumers. But as a data architect bridging that gap, I find that far too simplistic. We do want consumers to be the information owners: we would love for them to be their own "metadata managers." However IT tends to be a little more brutally ruthless in their adherence to (indeed, automation of) standards, which can cause friction since changes to the (often automated) system are difficult.
But, if we want cool stuff like automatic impact and dependency analysis, automatic lineage... then the system does require strict adherence... you can't "wing it" and generate flat files, and share them "outside the system" then re-introduce that data back into the system anywhere else... if you do that, then you cannot generate reliable dependency and impact analysis from within the system.
Just some random thoughts; that's the main challenge from my angle.
Biggest challenge that we have is hetrogenous systems which all use different terms for the same thing.
For example:
- Trading system 1 uses the exchange's own ticker symbol for the product ID; and a proprietary 2 digit exchange code
- Trading system 2 uses the firm's internal product master ID; and the firms internal master exchange ID
- Trading system 3 is a vendor package and uses vendor proprietary IDs
and then the invoices from the exchange all arrive with different coding strategies.
As a result, we end up building a large number of one-way translation tables to translate from each of these different nomenclatures into a single normalized voice in order to do analysis; validation; etc.
I expect that this problem is common across ANY large IT environment.
Unfortunately - this is an area that Alteryx is not yet fully mature - especially when compared to platforms like Watson that builds allows you to build a semantic understanding of your particular domain, and do visual data-transformation in-situ - so much of this has to be hand-rolled one-by-one.
What's needed is the ability to define synonyms in a robust way so that within any given alteryx flow I could pull in a synonym translator for my Product synonyms or Client synonyms. This would all be company / team specific - but this would build up a huge momentum once the first few are done.
Our main challenge is dealing with data novices (dozens and dozens of lawyers). We have one main database that IT is just now thinking about letting us connect to, but most of our business is conducted via Excel spreadsheets formatted in every unintelligible way possible. We've even received screenshots of spreadsheets.