ALTERYX INSPIRE | Join us this May for for a multi-day virtual analytics + data science experience like no other! Register Now

Alteryx Designer Discussions

Find answers, ask questions, and share expertise about Alteryx Designer.

Need help with fuzzy match to try and match "duplicate" invoices

8 - Asteroid

Hi all!


I am new to the world of Alteryx and am constantly facing challenges!


My current challenge is in respect to using the fuzzy match, i am trying to use it to find matching (or similar) invoice numbers. I am matching on the RecordID, 1-10 to 51-60.


i seem to be able to get the exact matches just fine, but some of my data has truncated numbers, and maybe some transposed numbers in there as well. Basically, if the numbers look kind of similar at all, i want Alteryx to flag them. Here is kind of what my data looks like:


RecordIDReference (Inv# + Store#)Assignment (Bottler Inv#)Reference Key 1 (Not Sure??)Reference Key 2 (Store#)Amount in local currencyReason codeDocument NumberSource 
19469061010003 6396841.41 1816642261800
2497170300009 176881,290.24 1816642671800
399904866103 3423257.15 1816528631800
4950002130 342373.02 1816578301800
59503572131 11494751.28 1816606561800
63180071213 18007102.00 1814383191800
7179413811794 13801626.16 1816570651800
89531061011795 136891,211.98 1816571711800
91804081231804 8123343.20 1816570671800
101859138011859 13801128.70 1816589261800
519496061019496 61011,061.68 1816521441400
529497170309497 170301,679.44 1816522541400
539498129789498 129781,679.44 1816517961400
549499048669499 48661,679.44 1816528121400
559500023769500 23761,610.80 1816532091400
569501104319501 104311,576.48 1816539911400
579502135729502 135721,679.44 1816526691400
589503065589503 65581,576.48 1816559911400
599504128849504 128841,568.94 1816551301400
609531061019531 6101411.84 1816522511400


My only technique is to remove the first digit from from the "Reference (Inv# + Store#)" with the 1400 source after running a fuzzy match and doing it again and deleting another digit. 




Hopefully this makes sense, any help would be greatly appreciated!!

5 - Atom



40% is a very low threshold to run on a first pass, I think. What is your key length? Can you post the Match Options configuration screen? It's not looking like your data is actually complete enough to make any meaningful matches. Can you also explain why you're looking for such a nebulous match?




8 - Asteroid

Hi John,


Thanks for the reply! 


I started at the default threshold and i got 4 matches from my original data. these were all exact matches but i was hoping for Alteryx to be able to look at the numbers and at least give me a number/percentage to say that it could, maybe be a match.


I have different data from customers than what is in our work systems and some of it is hand keyed so there are always some errors. Other times the system may take off some of the first digits.


These are all invoice numbers, many of them have a consistent length. The largest ones are 15 characters (mostly digits, sometimes some letters in there), down to about 8 characters. the data has about 100,000 lines of data or so, so the idea is to find some that are a close match and then manually go over them to determine if they actually are.


Yesterday was my first attempt using the fuzzy match, so I am really unsure how to use it, it seems better suited for more consistent data like names, addresses, and phone numbers.



Thanks! --




5 - Atom



A couple things jump out to me. First of all you're not using a matching algorithm at all, so you're only able to get exact key matches. Second, you're not generating any keys, and since you're using a length longer than your longest number, it becomes a standard join. Third, Double Metaphone with digits tries to "pronounce" the "word" and then matches the numbers. Since all you have is numbers, it also won't help you.


I took your sample and ran a Character Jaro Distance and only got one match with a key length of 4, but because of the nature of this dataset I doubt fuzzy will get you a great result.  Because the numbers are in a block, I think that Fuzzy Match may be less powerful than other methods. It's a bit more art than science, so there's always some tuning, but your use case probably has a better way to associate the data.


My first thought is to look for corroborating data that could be used to associate the numbers, and find a join method appropriate for whatever that data is. Perhaps names, products, service codes, city/state/zip or a combination of those are available. If it were me, I'd look for more metadata on each of these to try to join on.


If you can, for example, just get the location code isolated and convert it to VString then you could do a Jaro fuzzy to see if there were transposed characters ('41223' typed in as '42123') or substitution characters typed in erroneously ('41223' is typed in as '41222'). It may also help to read a bit about how the fuzzy algorithms work, because that can inform your decision a ton.


Good luck!