2018 Excellence Awards Entry: Data Check Between Two Sources
Overview of Use Case:
There are various instances where we do a lot to data comparison between two different sources – from a front-end result exported in excel to checking if the results match up from the MSSQL server database to checking between output files generated (2+ million rows). Manually comparing the data from the sources was extremely tedious and taking a toll on the product development time. But having built this Data compare workflow with Alteryx has been a blessing and it has helped the team catch data mismatches very early in the process and effectively improving the quality check efficiency 10-fold using this repeatable workflow.
Describe the business challenge or problem you needed to solve:
Initial business problem was validating the data from two different data sources. These sources can be anything from SQL Database, front-end Excel results or .CSV flat file. The main goal was to compare the datasets and easily be able to find errors early in the release process. This data comparison was previously done manually which would take more than 3 days of product development time. Since the file format is huge, sometimes 300 MB, Excel wouldn’t even open in most cases. Using an Alteryx workflow as an interface Application helped the team raise the bar and make effective checks early in the process.
Describe your working solution
Alteryx products used: mainly Alteryx Designer x64(2018.2). For this particular scenario, data format for the input files is Excel and a database table from Microsoft SQL Server Management Studio. SQL Table is constant and will change every year based on forecasting. Currently, this workflow is deployed in the Alteryx Gallery and can be run by the product team.
1.Input the Excel from the Alteryx workflow or the Gallery.
2. Select Tool and recordID Tool is used to identify each data from the source (since our data has more columns, For example If a patientID has 47 columns then it gets 47 rows with RecordID=1 using Transpose Tool).
3. Join is done based on the RecordID and Name (Rename of column names is done here as Expected and Actual).
4. Message Box displays any warning messages if there is a mismatch.
5. Union Tool combines the data from the 2 sources and aligns it.
6. Filter Tool- spits the results into two. If Actual=Expected then true path, otherwise, the False Path.
7. Results are exports in Excel.
Figure 1: Data compared between 2 Excel files – 10,000+ rows (for this instance)
Figure 2: Data compared between SQL and Excel
Figure 3: Deployed in Alteryx Gallery for scheduled run checks
Describe the benefits you have achieved:
Product Development time saved is more than 3+ days after this test was completed during each sprint. This workflow can be used across multiple sources check and has helped across the teams as well to compare between data sources and validate the changes. This “Data Check” workflow is repeatable and is being used in various other scenarios as well just having a direct impact on increased revenue and cost savings for the company. Customers are delighted with the quality of analytic products from the company.