Showcase your achievements in the Maveryx Community by submitting a Success Story now!
SUBMISSION INSTRUCTIONSThere are various instances where we do a lot to data comparison between two different sources – from a front-end result exported in excel to checking if the results match up from the MSSQL server database to checking between output files generated (2+ million rows). Manually comparing the data from the sources was extremely tedious and taking a toll on the product development time. But having built this Data compare workflow with Alteryx has been a blessing and it has helped the team catch data mismatches very early in the process and effectively improving the quality check efficiency 10-fold using this repeatable workflow.
Describe the business challenge or problem you needed to solve:
Initial business problem was validating the data from two different data sources. These sources can be anything from SQL Database, front-end Excel results or .CSV flat file. The main goal was to compare the datasets and easily be able to find errors early in the release process. This data comparison was previously done manually which would take more than 3 days of product development time. Since the file format is huge, sometimes 300 MB, Excel wouldn’t even open in most cases. Using an Alteryx workflow as an interface Application helped the team raise the bar and make effective checks early in the process.
Alteryx products used: mainly Alteryx Designer x64(2018.2). For this particular scenario, data format for the input files is Excel and a database table from Microsoft SQL Server Management Studio. SQL Table is constant and will change every year based on forecasting. Currently, this workflow is deployed in the Alteryx Gallery and can be run by the product team.
1.Input the Excel from the Alteryx workflow or the Gallery.
2. Select Tool and recordID Tool is used to identify each data from the source (since our data has more columns, For example If a patientID has 47 columns then it gets 47 rows with RecordID=1 using Transpose Tool).
3. Join is done based on the RecordID and Name (Rename of column names is done here as Expected and Actual).
4. Message Box displays any warning messages if there is a mismatch.
5. Union Tool combines the data from the 2 sources and aligns it.
6. Filter Tool- spits the results into two. If Actual=Expected then true path, otherwise, the False Path.
7. Results are exports in Excel.
Describe the benefits you have achieved:
Product Development time saved is more than 3+ days after this test was completed during each sprint. This workflow can be used across multiple sources check and has helped across the teams as well to compare between data sources and validate the changes. This “Data Check” workflow is repeatable and is being used in various other scenarios as well just having a direct impact on increased revenue and cost savings for the company. Customers are delighted with the quality of analytic products from the company.
Hi!
Could you please clarify what's the importance of the union tool right after join tool? Why the results can be different without the union tool?
Thanks in advance!
@fernandoscventura: Thanks for the question.
Main reason for using Union after the join is to use the Manually Configure Fields option to sort the column values accordingly.
This is essential to display the column values in order.
Hope it clarifies.
this look great and something Id find extremely useful I cant seem to download the workflow?
I am trying to build something similar to this, would you be able to provide the workflow please as I don't seem to be able to download it?
Many thanks
Sam
@samstar85 @abs1892 : I am not able to copy the workflow/Attach a workflow in the comments - there seems to be no option to add attachment or copy.
@kkumar if you send your workflow to us at advocacy@alteryx.com I can add it to your use case. Thanks
@AlteryxAdvocacy Thanks for the uploads. @samstar85 @abs1892 please find the attachment in the workflow.
Thank you
Thank you @kkumar
Have you tried this with large amounts of data ie million rows per spreadsheet would it take Alteryx long to run would it fall over?
@abs1892 : Yes, I have tried it. If sorting of the 2 data sources are in order and row counts are matching should be around 10 - 15 minutes since this usecase checks column by column.
I.e - If a row has 8 columns between the 2 data sources then totally 8 checks would be done for each row generating a recordID for each row to identify.
For this usecase I found its better to remove the Browse/Output file from the True Filter and concentrate only on the False part. Message icon is useful to provide details of mismatch on the flow.
Undoubtedly a very good approach. Thanks a lot for sharing the details!!!