Identifying and removing duplicates / contras where there is a close match but not exact
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hi peeps
As per title, my process at the moment is convoluted to try and remove obvious reversals like when it contains reverses and then summarise but then i lose some useful detail. Any suggestions on how I might be able to do this more efficiently, perhaps with a fuzzy match combined? At present, it leaves c6000 lines of possible contras that need to be manually reviewed so if i could find a better way of doing this it would really help...
Here is a sample of contras that I have pulled together which was identified by using the abs, sort and multirow function and identifying them as possible contras. The problem is that it still leaves several false positives which requires the manual intervention referenced above. If they genuinely do need to be contra'd off then the description would be very similar, perhaps with the addition of some words (e.g. Reverses) - I have tried to manually manipulate those instances by summarising on amounts and line description when the amounts do equal zero bu it loses some detail so just wondered if there is a different way of achieving this...guidance and help appreciated.
Thanks in advance...
COMPANY | COUNTRY | GL | IC | LINE_DESCRIPTION | ENTERED_CURRENCY | FUNCTIONAL_CURRENCY | net balance |
3150 | ES | 50003330 | 3151 | Banco DA_ES_425 -February 19 | EUR | EUR | 25000.00 |
3150 | ES | 50003330 | 3151 | Banco DA_ES_425 -February 19 - correction | EUR | EUR | -25000.00 |
8002 | 15 | 82004500 | 4100 | 131896 - 01-Nov-2018 to 30-Nov-2018 | USD | USD | 8.57 |
8002 | 15 | 82004500 | 4100 | 131896 - 01-Nov-2018 to 30-Nov-2018 - Back dated Bnk Acc | USD | USD | -8.57 |
8002 | 15 | 82007300 | 1500 | 10042 - Translation: Entry 819888 | USD | USD | -2.65 |
8002 | 15 | 82007300 | 1500 | 10042 - Translation: Entry 819888 - DEAL - 828549 | USD | USD | 2.65 |
8002 | 15 | 82004100 | 3350 | 123623 - 01-Apr-2012 to 30-Apr-2012 | USD | USD | -0.05 |
8002 | 15 | 82004100 | 3350 | 123623 - 01-Apr-2012 to 30-Apr-2012 - Back dated Bnk Acc t | USD | USD | 0.05 |
8000 | 15 | 82007350 | 8002 | 133355 - Reval 31-Jan-2019 | USD | USD | -119767.87 |
8000 | 15 | 82007350 | 8002 | 133355 - Reval Maturity Reversal | USD | USD | 119767.87 |
2000 | GB | 79009260 | 8537 | RechargeIN: GPL May (370) | GBP | GBP | 109166.78 |
2000 | GB | 79009260 | 8537 | RechargeIN: GPL May (370) acc. Correction | GBP | GBP | -109166.78 |
2000 | GB | 79009260 | 8537 | RechargeIN: GPL May (370) | GBP | GBP | 109166.78 |
1000 | US | 80001100 | 8000 | Less: Recharge of management activities costs for quarter three of the financial year ended 31 March 2019 in accordance with previously agreed arrangements | USD | USD | 1000.00 |
1000 | US | 80001100 | 8000 | Recharge of management activities costs for quarter three of the financial year ended 31 March 2019 in accordance with previously agreed arrangements | USD | USD | -1000.00 |
Solved! Go to Solution.
- Labels:
- Fuzzy Match
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hi @ck2024
It would be helpful if you could also share what the correct output would look like based on your sample data - I am unclear which of these examples are supposed to be contras and which are the false positives you refer to.
Would you also be able to explain where detail is being lost? I am assuming that you are referring to this happening because transactions are being incorrectly removed as false positives. Or is there something else which is causing loss of detail?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Thanks for your response, So what might happen is that there will be three transaction lines with similar details two of which would cancel each one out, leaving just one behind that I need to include. There is then plenty of other lines of data that were not identified as possible contras and also taken into account. For the purposes of this quesiton I haven't included those here. I've added a few more examples below too,
COMPANY | COUNTRY | GL | IC | LINE_DESCRIPTION | ENTERED_CURRENCY | FUNCTIONAL_CURRENCY | net balance |
3150 | ES | 50003330 | 3151 | Banco DA_ES_425 -February 19 | EUR | EUR | 25000.00 |
3150 | ES | 50003330 | 3151 | Banco DA_ES_425 -February 19 - correction | EUR | EUR | -25000.00 |
8002 | 15 | 82004500 | 4100 | 131896 - 01-Nov-2018 to 30-Nov-2018 | USD | USD | 8.57 |
8002 | 15 | 82004500 | 4100 | 131896 - 01-Nov-2018 to 30-Nov-2018 - Back dated Bnk Acc | USD | USD | -8.57 |
8002 | 15 | 82007300 | 1500 | 10042 - Translation: Entry 819888 | USD | USD | -2.65 |
8002 | 15 | 82007300 | 1500 | 10042 - Translation: Entry 819888 - DEAL - 828549 | USD | USD | 2.65 |
8002 | 15 | 82004100 | 3350 | 123623 - 01-Apr-2012 to 30-Apr-2012 | USD | USD | -0.05 |
8002 | 15 | 82004100 | 3350 | 123623 - 01-Apr-2012 to 30-Apr-2012 - Back dated Bnk Acc t | USD | USD | 0.05 |
8000 | 15 | 82007350 | 8002 | 133355 - Reval 31-Jan-2019 | USD | USD | -119767.87 |
8000 | 15 | 82007350 | 8002 | 133355 - Reval Maturity Reversal | USD | USD | 119767.87 |
8000 | 15 | 82007350 | 8002 | 133355 - Reval Reposted | USD | USD | 119767.87 |
2000 | GB | 79009260 | 8537 | RechargeIN: GPL May (370) | GBP | GBP | 109166.78 |
2000 | GB | 79009260 | 8537 | RechargeIN: GPL May (370) acc. Correction | GBP | GBP | -109166.78 |
2000 | GB | 79009260 | 8537 | RechargeIN: GPL May (370) redone | GBP | GBP | 109166.78 |
1000 | US | 80001100 | 8000 | Less: Recharge of management activities costs for quarter three of the financial year ended 31 March 2019 in accordance with previously agreed arrangements | USD | USD | 1000.00 |
1000 | US | 80001100 | 8000 | Recharge of management activities costs for quarter three of the financial year ended 31 March 2019 in accordance with previously agreed arrangements | USD | USD | -1000.00 |
1000 | US | 82004400 | 8548 | Reverse JV#3194365 recurring Interests of April | USD | USD | 36946090.72 |
1000 | US | 82004400 | 8548 | Reverse Recurring JV#3158965 Interests of March 2018 | USD | USD | -36946090.72 |
1000 | US | 82004400 | 8548 | JV#3158965 Interests of March 2018 | USD | USD | 36946090.72 |
When "Alteryxed", the output would just leave these that would need to be fed into the other data that wasn't identified as contra'd data (or perhaps I'd just run all the data through the solution).
COMPANY | COUNTRY | GL | IC | LINE_DESCRIPTION | ENTERED_CURRENCY | FUNCTIONAL_CURRENCY | net balance |
8000 | 15 | 82007350 | 8002 | 133355 - Reval Reposted | USD | USD | 119767.87 |
2000 | GB | 79009260 | 8537 | RechargeIN: GPL May (370) redone | GBP | GBP | 109166.78 |
1000 | US | 82004400 | 8548 | JV#3158965 Interests of March 2018 | USD | USD | 36946090.72 |
As all the others cancel one another out. I thought about doing a row count and if it is an equal number and the sum of the two(four/six etc) values equals zero then its a strong indicator that they contra one another and can be removed from the data set, just unsure how to really get that output when the descriptions vary in some way.
Thanks!
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hi @ck2024
Thanks for the extra info - this is much clearer now.
I have built a workflow for you, sample attached:
As you suggested, I used Fuzzy Match to find similar Line Descriptions. It is an interesting tool - the configuration options I chose work on your sample data, but you may need to play around with the different options for best performance on your full dataset.
The one notable difference is that for the row "Reverse JV#3194365 recurring Interests of April" and its two corresponding records, the Fuzzy Match tool paired the latter two (which both mention March 2018), and did not pair with the one which mentions April. This looks like the correct interpretation to me, however if you wanted all of these three to be a match, you could exclude dates and month names etc. from your data to avoid this causing mismatches.
I then used the Make Group tool to simplify these into groups of matching records.
The final step was to decide on some rules for how to identify and exclude contras within those groups. You may want to adjust this approach, with your greater understanding of your data.
I used a two-step process. First I summarised per group to find ones which summed to 0, and assumed that these were contras with no 'real' records. Then I used a Multi-Row Formula to find the first two rows of each group which cancel each other out. I sorted by the original order of the data you sent me, but you may want to sort by date.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hi
Thanks so much for the detailed explanation. I did indeed have to modify the characteristics of your suggestions but it get me to where I wanted to be in the end which i would have struggled to do otherwise, so that you for your help.