Engine Works

NickJ · ‎06-27-2018

Across the Alteryx platform, we see data management as a core component of successful analytic business models. From the self-service work of a modern analyst through to Enterprise-scale orchestrated analytic pipelines, companies strive to unlock the value of data through business insight.

However, these efforts can fall short if organisations don’t evolve their practices around the management of data, especially if a company culture has a legacy of inattention to data governance and curation.

As a critical foundation, Reference Data is often called the backbone of enterprise applications. Its role within most departments is to transform other operational data (such as transactions) into meaningful business information – providing context and continuity across lines of business questions.

In the analytics world, reference data has formed the core of decision support environments like the data warehouse for decades. In the traditional data warehouse concept of a star schema, centralised ‘facts’ (such as business events, or transactions) are surrounded by reference data ‘dimensions’ that are used for slicing and dicing, and for consistent reporting.

The costs associated with poor reference data management are also abundantly clear. Enterprises that rely on large-scale data entry (such as Healthcare and Financial Services) will often encounter data quality problems due to ‘miscoding’ of reference data, leading to unsafe data (and models) that cannot be used for insight.

Individually, these reference data errors may seem insignificant, but in aggregate they can end up being extremely costly – even requiring entire ‘back office’ teams to scrub and correct reference data across applications: full-time jobs that only exist because of a lack of data governance.

This lack of standardisation can be truly frightening – every single enterprise application may use its own ‘mapping tables’ to coordinate universal concepts like ISO Currency Code or Country Names. Every mapping table represents a chance for mapping errors to be introduced into the data and poor decisions made, to say little about the thousands of wasted hours for your knowledge worker teams.

In this post, we’re going to ‘alteryx’ this problem, and show how we can introduce several lightweight patterns to help your teams gain confidence and ‘spell-check’ their reference data.

However, first a caveat: it’s critical not to view reference data management as simply a technology problem, looking for a software-based solution. It’s as much a people and process problem, meaning that it will involve negotiation and discussion with many departments and stakeholders. It may also involve multiple technologies, in which case we’re looking for software in the technical solution that will play and integrate nicely together. This integration is often best achieved with open standards and with easy-to-use APIs.

The solutions discussed below are also aimed at the core Alteryx audience – anyone with a data job to do, who needs a simple and flexible way to get that job done and get their lives back! This isn’t industrial IT – this is merely problem solving by reusing patterns and components you’ll already have to hand.

For this example, let’s start with the reference data itself, and we’ll use a simple example of currency codes.

Our requirement is to maintain a list of valid ISO codes along with Currency Names, and to ensure that any incorrect currency codes have been mapped to the correct values instead. This is hugely important for those ‘fat finger’ mistakes, such as UDS for US Dollar rather than USD.

If you’re not familiar with currency ISO codes, here’s a helpful Wikipedia article to ISO 4217.

Of course – we could take this a lot further. We could hold additional reference data for currencies; we can hold historically-valid data (such as a currency that was active between certain historical dates, such as the Yugoslav dinar); we could offer a fuzzy-matching ‘best guess’ for your miscoded reference data, and much more. Please feel free to take these ideas as far as you and your company need to manage your reference data. I’ll be deliberately keeping things simple.

Reference Data Storage

We probably want to store the data in an environment that allows for easy CRUD (‘Create’ – ‘Read’ – ‘Update’ – ‘Delete’) operations and that offers failover, redundancy and recovery. To this end, one of the more popular formats is likely to be a relational database table. In my example, I’ve used MS SQL Server, but there are plenty of other options.

We can perform CRUD operations on this table either by directly using SQL commands on the database, or via an Alteryx Analytic App (likely to be exposed to administrators only) that could create, update or delete currency records as needed.

It’s also worth considering that we may only want to ‘soft-delete’ records to preserve the history of our changes over time. As opposed to permanently removing a record, we could simply tag the record with a yes/no flag to say whether the record is currently active.

A simple database table for our currency data might look like:

CCY_ID	Unique/Primary Key for the table (important for updates)
Country	The country name (e.g., Australia)
Currency	The currency’s official title (e.g., Australian dollar)
ISO Code	The standard 3-character code for the currency (e.g., AUD)
Status	Your internal status for this code. You might use ‘Certified,’ ‘Retired,’ ‘Error,’ and more. Codes in this column will be defined by your internal policies around data retention.
Replace With	A simple column that points to a preferred ISO Code value. For example, the code ‘AUX’ is a common mistyping of ‘AUD’, so we set a Replace With value of ‘AUD’ for the AUX mistype.

Note: We could use the Alteryx YXDB format, but this would give us less flexibility over the update process. In effect, we’d be overwriting the file with each change.

Reference Data Service

Once the reference data has been maintained, we’d be looking for techniques to bring this data into workflows for automated cleansing and highlighting of errors.

Several techniques available:

Bring the entire reference table for this class of data into Alteryx Designer and compare set-wise against a user’s data.
Bring single records into Alteryx Designer using a Batch Macro to control the data filtering.
Bring single records into Alteryx Designer using the Gallery API and a Batch Macro
Bring the entire reference table into Designer using the Gallery API
and many others…

There are pros and cons to each method, with the primary decision likely to be made around data volumes for the reference data. If we’re dealing with millions of data points, it might be inefficient to bring all this data into memory, so we might instead consider filtering the query to the underlying data source, or streaming the local data into the database using the In-DB functions and performing the match there instead.

Likewise, by exposing the Analytic App’s functionality via the Gallery API, we could develop a reference data ‘utility’ that could be called outside of pure Alteryx workflows and see wider adoption in an enterprise. However, there are small latencies in interacting with this API that may cause performance to be considerably slower than working directly with a database connection – for small numbers of records, it’s unlikely to be a problem but possibly would be unacceptable for thousands/millions of potential reference data checks on a record-by-record basis.

Design Decisions

For our currency example, we’ll bring the entire data set into memory via a single database query, and match our data set-wise using Alteryx’s Join tools. Remember: there is unlikely to be a single answer that fits all scenarios, so upfront thought about performance and usability is critical.

We might also make the design decision to build multiple reference data checks into a single tool for the end user, so they can reuse it for currency checks, sectors, regions – and potentially many other flavours of reference data: a one-stop shop for reference data cleansing.

You could make the decision that you want each reference check in a separate tool depending on its functional type. Again, there is no right answer: ask your users. “Performance, Feedback, Revision” as we say!

Let’s revisit our design decisions here:

We’ll be using a SQL Server table (or tables) to hold the reference data.
We’ll have an analytic app to perform CRUD actions on the reference data. This app will be accessible to data stewards only (so that we maintain control of the integrity/quality of the underlying data).
We’ll build a macro that queries this reference data table and matches a Currency against known reference data. We can either extend this single tool for other reference data, or build more tools – it’s outside the scope of the post to go further.
The macro will give us several possible outcomes:
1. The reference data maps perfectly (and we get on with our day).
2. The reference data can’t map the codes used (these records are piped into a separate data stream for further analysis)
3. The reference data identifies replacements (managed by the data stewards) for current incorrect data in the data stream. For this first iteration, no auto-correction has been applied, although this could be possible.

Implementation

I’ve provided a bare-bones solution to the first couple of points. For Step 1, you can use a simple Alteryx workflow to load the attached Excel sheet containing the sample reference data, and simply use a Record ID tool and an Output Data tool to create an initial reference data table in your database.

@fadib has also written a great piece on the Community about making sure that you can set a Primary Key on the CCY_ID column, which is critical if you want to make updates to the table later on.

Alternatively, here’s some basic SQL to achieve the same result:

CREATE TABLE [dbo].[ccy] (
  [CCY_ID] [int] NOT NULL,
  [Country] [varchar](255) NULL,
  [Currency] [varchar](255) NULL,
  [ISO Code] [varchar](255) NULL,
  [Status] [varchar](255) NULL,
  [Replace With] [varchar](255) NULL,
  PRIMARY KEY CLUSTERED (
    [CCY_ID] ASC
  ) WITH (
    PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON
  ) ON [PRIMARY]
) ON [PRIMARY]

For Step 2, I’m attaching a basic analytic app that accepts updated values for the columns in this table and then commits them back to the database. Remember – this is not an application for all users, just the data stewards. It could be published to the Alteryx Server and added to a ‘Data Steward’ Collection so that it wasn’t generally available.

(Eagle-eyed users will spot that I’ve really only covered the ‘C’ and ‘U’ of CRUD operations here… more work needed, especially around error-handling!)

For Steps 3-4, I’m also attaching a macro that reads in a user’s local workflow data and compares this to the reference data table in its entirety. (Remember: for very large reference data sets – it may not be practical to load millions of records from a database in this way. For my 200 records of currency data – it’s perfect. Your mileage may vary).

The macro uses a join tool to check the incoming user currency against the reference ISO code. For good matches, there are two possible outcomes. The record passes straight out to the user workflow (for non-replaced codes) or the user is informed of a recommendation to change their code (if the ‘Replace With’ field is populated).

For non-matches, we filter these records into a separate macro output for further investigation. In these cases, we may have missing reference data or we have miscoded in a way that’s not handled by our replacement rules. Again, we could introduce more complexity/sophistication here using our Fuzzy Matching/Predictive Grouping tools – keep your tool as simple as you can, but not simpler (as Einstein would say!)

As a side note, I did consider automatically ‘fixing’ the reference data, but I think we’d also need to record which CCY_ID we need to replace the offending data with. The currency code itself isn’t sufficient, as it may appear in multiple country lines (e.g. ‘Euro’ or ‘AUD’ are used in multiple geographies). We’ll keep this simple for now, but the fix isn’t difficult and would be easy to implement within the macro.

Testing

Attached is some sample currency reference data (around 200 rows) that we can use as the basis of our refdata table.

We’ll test the macro tool with the following data:

Currency	Date	Strike Type	TXN ID	Property1
ABC	2018-06-11	PUT	91209432	1.917941
GBP	2018-06-11	CALL	3294021349	0.38936
TRI	2018-06-11	PUT	4312903214	8.0293

So – we have some arbitrary ‘financial’ data with currency codes. We’ll attach this data to our refdata ‘spell checker’ make sure we define the column that we’d like to test against (‘Currency’). The results are as follows:

From the macro’s ‘M’ output (‘matched’) – we see that GBP has come through without issue.

From the macro’s ‘U’ output (‘unmatched’) – we can see that ABC doesn’t currently map to any of our stored reference data.

From the macro’s ‘R’ output (‘recommended fix’) – we can see that TRI has a recommendation to be changed to ‘TRY.’

In each of the cases, we’ve also included a new field REF_CHECK that gives descriptive detail of this output, and a date/time stamp of when the check took place (which could be useful for auditing).

I can also use my admin-only analytic app, and add new mappings to the dataset, such as that AUX should be treated as AUD:

I can then test with an additional element of data using a different data schema/format:

Currency Type	Date	Units
AUX	2018-06-11	5000

This new data is passed through the reference data check and returns the following recommendation to remap to ‘AUD’ – success!

Next Steps

As part of this examination of reference data, I thought about the following extensions:

Adding multi-domain reference data checks within the single tool (e.g., Currency, Sector, Country, Product, ISIN, etc.)
Adding a new analytical tool on top of the reference data that tried to identify the type of reference data in a user-supplied column. For example, the output of this tool might be to say to the analyst ‘I think you have currency data in this field,’ or ‘it looks like you have sector information.’ This could be the first step to guide the user to the correct refdata checks.
Adding fuzzy-matching or nearest neighbour analysis to make a best guess in cases where no refdata match was found. The tool could provide a similarity score to estimate confidence.
Adding proper CRUD functionality to the Admin app, and publishing to the Alteryx Gallery for browser-based access.
Adding mechanisms to bulk-reload certain reference data, but retain user-mappings.
We could set the macro up to automatically email the data steward team with aggregated statistics from each run of the macro.
Registering the reference data source as metadata in Alteryx Connect so that the data source can be simply dragged/dropped into a workflow for ad-hoc use by analysts.
For more complex imputation/matching of reference data, deploying code-friendly options in either R or Python to Alteryx Promote and ‘score’ the reference data using the generated API endpoint.

The beauty of using Alteryx for this kind of solution is that any of these steps could be implemented in our amazing platform – ask your users what they need. Don’t over-engineer. Performance, Feedback, Revision.

Conclusion

We’ve taken an Alteryx worldview to the challenge of Reference Data – we’ve focused our lens on the problems of reference data for the analyst and the citizen data scientist and how it’s vital to be using the correct context and meaning in order to develop the best insights.

We’ve used an analytic app to develop a trivial reference data mapping utility (and of course, you may want to purchase, use or build your own). We showed how this data source could be consumed inside a code-free Alteryx macro to identify gaps or recommendations for reference data, and we highlighted some of the ways that you could extend this functionality further.

With a growing trust in reference data, organisations can ensure quality throughout the information supply chain and confidence that their analytic models and decisions have the best reference data foundation to succeed.