This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
A dedicated space to share resources, connect with like-minded data workers, and discuss industry specific analytic challenges + solutions.
Right now, the power to solve new global challenges across industries, is at your fingertips, no matter where you're working from. Create a new topic or reply to an existing thread to share your perspective.
In the accounting and greater financial industries, we often deal with big transactional data. In my very lengthy, extensive, very experienced, whole 3-years in the workforce, I might conclude that many industries utilize and work with transactional data on a weekly, if not daily basis (please have gotten my sarcasm).
Transactional data, in the context of data analytics and data management, is the information recorded from transactions (P.S. Is transactional even a word?).
Transactional data can be created by and comprises purchases, returns, invoices, payments, credits, debits, trades, dividends, asset sales, contracts, interest, payroll, lending, reservations, signups, subscriptions, donations, and the list goes on-and-on. Transactional data touches almost every industry. Thanks to the leaders of Alteryx User Groups who have converged all of the different industries into one location - I would have had to do a lot of copy-and-pasting (or many searches on the Alteryx community to have it posted itself).
Introduction - the challenge
I was tasked with matching credit and debit transactions in a single dataset made up of 24 million records of data. The dataset did not include any identifier that might match debit transactions (sales), and credit transactions (returns). Each debit transaction with a corresponding credit transaction has to be removed from the data, and isolated into a separate stream of records without that single identifier already mentioned missing! Even though one transaction is a sale, and the second transaction is a return, both transactions are posted with “sales” values.
The legacy process instructed us to filter for high count values, and find their exact credit value. For example, if we saw that the sales value of $24 appeared 30 times, we would filter for records with a sales value of -$24, copy and paste the debit row with sales of $24 and the credit row of -$24 into a separate excel spreadsheet, and remove the “matched” rows from the larger dataset. In some scenarios, multiple debit transactions can be “matched” with multiple credit transactions.
What was to gain?
Firstly, we were lacking accuracy and completeness of the task. Due to the data limits in excel, we were matching transactions within single workbooks consisting of a piece of the entire dataset, rather than against the dataset in its entirety of 24 million rows of data.
Secondly, the analysis of 24 million rows of data could take multiple employees weeks to analyze - let alone hours to even open up a single file on our computer - ain’t nobody got enough memory for that!
Lastly, human error sometimes resulted in the duplication of transactions (e.g. forgetting to delete a transaction from one dataset; accidentally pasting a transaction multiple times in a separate dataset).
Oh did I mention…?...I wasn’t going to last long at my job matching 24 million records with each other...HA!
Firstly, we unioned all of our 30 excel workbooks with transactional data into a single .yxdb file (Alteryx Database File). This alone took the breath away from my colleagues - 24 million records of data in a single file that can be opened on a basic corporate Lenovo machine, and analyzed (using a single Browse tool) - GASP! We also added a RecordID field numbering each record.
Our workflow was actually pretty simple. Firstly, we split the data between sales (debits) and returns (credits) (e.g. we used a Filter tool isolating records with a sales value less then $0).
Afterwards, we simply took the absolute value of the negative sales value of the credit transactions (with a Formula tool) and used a Join tool to join the debit transactions with the absolute value of the credit transactions.
Side note (if you are already bored, skip this)
In my opinion, it’s not necessarily crucial to understand this paragraph to get the point of this article as it relates to just isolating the data of the records that “match” due to the fact that the Join tool appends data rather than data stacked.
Because the Join tool places one dataset next to the other. We simply deselected all fields except for the RecordID and Right_RecordID so that the only fields coming out of the Join Output would be the RecordID and Right_RecordID. We use two other Join tools to isolate from the large dataset the records that matched based on their RecordID’s. This is a complex concept and requires an extensive example to make the subject clear.
The solution - a roadblock - THE MEAT AND POTATOES OF THIS ARTICLE
We were very excited about our solution, but we faced our biggest challenge.
We solved the ability to “match” debit transactions and credit transactions, but due to the functionality of the Join tool, we were duplicating data in instances where we might have had two debit transactions with the same sales value, but only one corresponding credit transaction. The single corresponding credit transaction would join with both debit transactions, multiplying the credit transaction by two - it only exists once, not twice.
After a night sleeping on the challenge we came up with the following solution.
Simply, we used a Multi-Row Formula tool to add a RecordID field grouped by the sales value in both the debit transaction stream, and credit transaction stream.
We then used the Join tool again, however, we changed our configuration of our join tool:
Debit transactions (Left Input)
Credit Transactions (Right Input)
Absolute value of Sales Value
Therefore, not only do the records have to join on the sales value, but also on the amount of times that the specific sales value actually exists in the data.
With our 24 million records of data, our workflow ran in less than 6 minutes - and that is with only 8GB of memory on my Lenovo X1 - 2019.
What was even more surprising was when I realized that my colleagues were still performing the same above mentioned process for smaller datasets without a matching identifier. Since, we've expanded our development into an analytical application that can be used with data in several formats and shared across our corporate environment.
We were able to isolate all debit transactions and matched credit transactions with the click of a button, without any single identifier within our dataset.
Lastly, for purposes of not making this article more complex than it already is, please ignore what was done in our process with transactions with a $0 sales value. As well, we can ignore the possibility of existing credit transactions without any corresponding debit transactions.
Please Like and Comment if you enjoyed or have feedback to share.
Also, please feel comfortable reaching out through Alteryx Community, My Instagram (attached to my profile if you hover over my icon), or my LinkedIn (linked similarly).