This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
In the accounting and greater financial industries, we often deal with big transactional data. In my very lengthy, extensive, very experienced, whole 3-years in the workforce, I might conclude that many industries utilize and work with transactional data on a weekly, if not daily basis (please have gotten my sarcasm).
Transactional data, in the context of data analytics and data management, is the information recorded from transactions (P.S. Is transactional even a word?).
Transactional data can be created by and comprises purchases, returns, invoices, payments, credits, debits, trades, dividends, asset sales, contracts, interest, payroll, lending, reservations, signups, subscriptions, donations, and the list goes on-and-on. Transactional data touches almost every industry. Thanks to the leaders of Alteryx User Groups who have converged all of the different industries into one location - I would have had to do a lot of copy-and-pasting (or many searches on the Alteryx community to have it posted itself).
Introduction - the challenge
I was tasked with matching credit and debit transactions in a single dataset made up of 24 million records of data. The dataset did not include any identifier that might match debit transactions (sales), and credit transactions (returns). Each debit transaction with a corresponding credit transaction has to be removed from the data, and isolated into a separate stream of records without that single identifier already mentioned missing! Even though one transaction is a sale, and the second transaction is a return, both transactions are posted with “sales” values.
The legacy process instructed us to filter for high count values, and find their exact credit value. For example, if we saw that the sales value of $24 appeared 30 times, we would filter for records with a sales value of -$24, copy and paste the debit row with sales of $24 and the credit row of -$24 into a separate excel spreadsheet, and remove the “matched” rows from the larger dataset. In some scenarios, multiple debit transactions can be “matched” with multiple credit transactions.
What was to gain?
Firstly, we were lacking accuracy and completeness of the task. Due to the data limits in excel, we were matching transactions within single workbooks consisting of a piece of the entire dataset, rather than against the dataset in its entirety of 24 million rows of data.
Secondly, the analysis of 24 million rows of data could take multiple employees weeks to analyze - let alone hours to even open up a single file on our computer - ain’t nobody got enough memory for that!
Lastly, human error sometimes resulted in the duplication of transactions (e.g. forgetting to delete a transaction from one dataset; accidentally pasting a transaction multiple times in a separate dataset).
Oh did I mention…?...I wasn’t going to last long at my job matching 24 million records with each other...HA!
Firstly, we unionedall of our 30 excel workbooks with transactional data into a single .yxdb file (Alteryx Database File). This alone took the breath away from my colleagues - 24 million records of data in a single file that can be opened on a basic corporate Lenovo machine, and analyzed (using a single Browse tool) - GASP! We also added a RecordIDfield numbering each record.
Our workflow was actually pretty simple. Firstly, we split the data between sales (debits) and returns (credits) (e.g. we used a Filter toolisolating records with a sales value less then $0).
Afterwards, we simply took the absolute value of the negative sales value of the credit transactions (with a Formula tool) and used a Join toolto join the debit transactions with the absolute value of the credit transactions.
Side note (if you are already bored, skip this)
In my opinion, it’s not necessarily crucial to understand this paragraph to get the point of this article as it relates to just isolating the data of the records that “match” due to the fact that the Join tool appends data rather than data stacked.
Because the Join tool places one dataset next to the other. We simply deselected all fields except for the RecordIDand Right_RecordID so that the only fields coming out of the Join Output would be the RecordID and Right_RecordID. We use two other Join tools to isolate from the large dataset the records that matched based on their RecordID’s. This is a complex concept and requires an extensive example to make the subject clear.
The solution - a roadblock - THE MEAT AND POTATOES OF THIS ARTICLE
We were very excited about our solution, but we faced our biggest challenge.
We solved the ability to “match” debit transactions and credit transactions, but due to the functionality of the Join tool, we were duplicating data in instances where we might have had two debit transactions with the same sales value, but only one corresponding credit transaction. The single corresponding credit transaction would join with both debit transactions, multiplying the credit transaction by two - it only exists once, not twice.
After a night sleeping on the challenge we came up with the following solution.
Simply, we used a Multi-Row Formula tool to add a RecordID field grouped by the sales value in both the debit transaction stream, and credit transaction stream.
We then used the Join tool again, however, we changed our configuration of our join tool:
Debit transactions (Left Input)
Credit Transactions (Right Input)
Absolute value of Sales Value
Therefore, not only do the records have to join on the sales value, but also on the amount of times that the specific sales value actually exists in the data.
With our 24 million records of data, our workflow ran in less than 6 minutes - and that is with only 8GB of memory on my Lenovo X1 - 2019.
What was even more surprising was when I realized that my colleagues were still performing the same above mentioned process for smaller datasets without a matching identifier. Since, we've expanded our development into an analytical application that can be used with data in several formats and shared across our corporate environment.
We were able to isolate all debit transactions and matched credit transactions with the click of a button, without any single identifier within our dataset.
Lastly, for purposes of not making this article more complex than it already is, please ignore what was done in our process with transactions with a $0 sales value. As well, we can ignore the possibility of existing credit transactions without any corresponding debit transactions.
Please Like and Comment if you enjoyed or have feedback to share.
Also, please feel comfortable reaching out through Alteryx Community, My Instagram (attached to my profile if you hover over my icon), or my LinkedIn (linked similarly).
Is this similar to an approach of grouping the transactions into their absolute values via a summarize tool? So we have 15, 10, 5 transactions for $30, $40, $50 respectively on the credit side. We have 12, 9, 3 transactions for $30, $40, $50 on the debit side. You then subtract the counts of the like amounts for a (15-12), (10-9), (5-3) total counts. So there would be a net credit balance of $30*3, $40*1, $50*2 or $230 total.
Actually, and I probably can make this more clear in the post, this process was done for a complex tax compliance process that requires the analysis we perform transaction by transaction.
It definitely would have been easier to just perform the math of those transaction that don't net out, but that would defeat the purpose of this entire process - ultimately the positive and negative do negate each other to zero.
Rather, we had to retain the specific information for each transaction and analyze the dataset record by record.
Hope this clarifies why we took this route rather than just isolate what might not match to zero $
I see, so in theory this is a list of purchases and returns so the debits are going to match the credits (assuming the listing is complete and error free). So to your point, there is no identification number. Therefore would the only thing allowing you to determine if the transactions are the related amounts (unless there is a product dimension or something)? If so, you could still group/join by amounts and then assign each joined debit and credit a record ID.
Ex-auditor here and I can tell you I never enjoyed auditing returns/allowances.
If so, you could still group/join by amounts and then assign each joined debit and credit a record ID.
To this point, yes - that is what we did initially; but I included in the post the issue of when there may be three debit transactions with one value and only two corresponding credit transactions with the same value, we end up multiplying the data many, many times.
Therefore would the only thing allowing you to determine if the transactions are the related amounts
To this point - YES! I've actually seen a few posts in the designer discussion section with issues relating to matching data without a common identifier, or with values that repeat.
In the actual workflow that we built, we actually match on two values - we match on Sales and on Cost. However, initially we only started with matching on sales, and therefore kept it simple as that in this article.