Advent of Code is back! Unwrap daily challenges to sharpen your Alteryx skills and earn badges along the way! Learn more now.

Alteryx Success Stories

Learn how Alteryx customers transform their organizations using data and analytics.
STORIES WANTED

Showcase your achievements in the Maveryx Community by submitting a Success Story now!

SUBMISSION INSTRUCTIONS

Matching 24 million Records of Transaction Data

jacob_kahn
12 - Quasar

Overview of Use Case

In the accounting and greater financial industries, we often deal with big transactional data. In my very lengthy, extensive, very experienced, whole 3-years in the workforce, I might conclude that many industries utilize and work with transactional data on a weekly, if not daily basis (please have gotten my sarcasm).

 

Transactional data, in the context of data analytics and data management, is the information recorded from transactions. Transactional data can be created by and comprises purchases, returns, invoices, payments, credits, debits, trades, dividends, asset sales, contracts, interest, payroll, lending, reservations, signups, subscriptions, donations, and the list goes on-and-on. Transactional data touches almost every industry.

 

Thanks to the leaders of Alteryx User Groups who have converged all of the different industries into one location - I would have had to do a lot of copy-and-pasting (or many searches on the Alteryx community to have it posted itself).

 
Describe the business challenge or problem you needed to solve

I was tasked with matching credit and debit transactions in a single dataset made  up of 24 million records of data. The dataset did not include any identifier that might match debit transactions (sales), and credit transactions (returns). Each debit transaction with a corresponding credit transaction has to be removed from the data, and isolated into a separate stream of records without that single identifier already mentioned missing! Even though one transaction is a sale, and the second transaction is a return, both transactions are posted with “sales” values.

 

The legacy process instructed us to filter for high count values, and find their exact credit value. For example, if we saw that the sales value of $24 appeared 30 times, we would filter for records with a sales value of -$24, copy and paste the debit row with sales of $24 and the credit row of -$24 into a separate excel spreadsheet, and remove the “matched” rows from the larger dataset. In some  scenarios, multiple debit transactions can be “matched” with multiple credit transactions. 

 

the_jake_tool_7-1590528426082.gif

 

 

Describe your working solution

Firstly, we unioned all of our 30 excel workbooks with transactional data into a single .yxdb file (Alteryx Database File). This alone took the breath away from my colleagues - 24 million records of data in a single file that can be opened on a basic corporate Lenovo machine, and analyzed (using a single Browse tool) - GASP!  We also added a RecordID field numbering each record.

 

Our workflow was actually pretty simple. Firstly, we split the data between sales (debits) and returns (credits) (e.g. we used a Filter tool isolating records with a sales value less then $0).

 

the_jake_tool_0-1590527843877.png

 

 

Afterwards, we simply took the absolute value of the negative sales value of the credit transactions (with a Formula tool) and used a Join tool to join the debit transactions with the absolute value of the credit transactions.

 

the_jake_tool_1-1590527843777.png

 

 

Side note (if you are already bored, skip this)

In my opinion, it’s not necessarily crucial to understand this paragraph to get the point of this article as it relates to just isolating the data of the records that “match” due to the fact that the Join tool appends data rather than data stacked.

the_jake_tool_6-1590528223480.gif

 

Because the Join tool places one dataset next to the other. We simply deselected all fields except for the RecordID and Right_RecordID so that the only fields coming out of the Join Output would be the RecordID and Right_RecordID. We use two other Join tools to isolate from the large dataset the records that matched based on their RecordID’s. This is a complex concept and requires an extensive example to make the subject clear. 

 

When Meat met Potatoes by Sage on Dribbble

 

We were very excited about our solution, but we faced our biggest challenge.

 

We solved the ability to “match” debit transactions and credit transactions, but due to the functionality of the Join tool, we were duplicating data in instances where we might have had two debit transactions with the same sales value, but only one corresponding credit transaction. The single corresponding credit transaction would join with both debit transactions, multiplying the credit transaction by two - it only exists once, not twice.

 

the_jake_tool_2-1590527843780.png

 

 

After a night sleeping on the challenge we came up with the following solution.Simply, we used a Multi-Row Formula tool to add a RecordID field grouped by the sales value in both the debit transaction stream, and credit transaction stream.

 

We then used the Join tool again, however, we changed our configuration of our join tool:

Debit transactions (Left Input)

Credit Transactions (Right Input)

Sales Value

Absolute value of Sales Value

Grouped_by RecordID

Grouped_by RecordID

 

Therefore, not only do the records have to join on the sales value, but also on the amount of times that the specific sales value actually exists in the data. 

 

the_jake_tool_3-1590527843878.png

 

Staples Easy Button Review! GIF | Gfycat

 

 
Describe the benefits you have achieved

What was to gain?

Firstly, we were lacking accuracy and completeness of the task. Due to the data limits in excel, we were matching transactions within single workbooks consisting of a piece of the entire dataset, rather than against the dataset in its entirety of 24 million rows of data. 

 

Secondly, the analysis of 24 million rows of data could take multiple employees weeks to analyze - let alone hours to even open up a single file on our computer - ain’t nobody got enough memory for that! 

 

Lastly, human error sometimes resulted in the duplication of transactions (e.g. forgetting to delete a transaction from one dataset; accidentally pasting a transaction multiple times in a separate dataset).

 

Oh did I mention... I wasn’t going to last long at my job matching 24 million records with each other...HA!

 

the_jake_tool_5-1590528133650.gif

 

 

 

Conclusion

With our 24 million records of data, our workflow ran in less than 6 minutes - and that is with only 8GB of memory on my Lenovo X1 - 2019.

 

What was even more surprising was when I realized that my colleagues were still performing the same above mentioned process for smaller datasets without a matching identifier. Since, we've expanded our development into an analytical application that can be used with data in several formats and shared across our corporate environment. 

 

We were able to isolate all debit transactions and matched credit transactions with the click of a button, without any single identifier within our dataset. 

 

Lastly, for purposes of not making this article more complex than it already is, please ignore what was done in our process with transactions with a $0 sales value. As well, we can ignore the possibility of existing credit transactions without any corresponding debit transactions.

 

Like GIFs | Tenor

 

Please Like and Comment if you enjoyed or have feedback to share.

Also, please feel comfortable reaching out through Alteryx Community, My Instagram (attached to my profile if you hover over my icon), or my LinkedIn (linked similarly).

 

Please enjoy and be safe,

 

  •  
 
 
Comments
kevin025
8 - Asteroid

24 millions records under 6 mins run-time is crazy.

BalOates
7 - Meteor

Really nicely written article - has given me some idea's on approaching similar problems. Love the use of the multi-row formula tool.

abby123jpmc
5 - Atom

This was extremely useful. Thank you for this post

DanTh
Alteryx
Alteryx

Great post, thanks!