Hi there,
I am fuzzy matching two data sources with city and country information. Source one having clean information and source two four columns with a lot of other garbage in it.
In a first step I indentified the country information out of four different columns of source two.
In a second step I want to fuzzy match the city information out of the very same four columns from source two but leverage the country information of the first step.
Therefore I setup the fuzzy match tool to conduct firstly a "Key Match Only" (Generate Keys: "Whole Filed (Case Insensitive)" and Match Function: "None - Key Match Only")
and secondly a fuzzy match on the city name.
My expectation was that fuzzy matches on the city name are only conducted if the country name is perfectly the same.
To my surprise this was only true for most of the cases but not ALL the cases.
How is that possible?
Cheers Tom
P.S. I tried to re create this in a simple example flow, but there everything works as expected :(
Solved! Go to Solution.
Hi Tom,
I think your confusion stems from the wording of the ZIP code type fuzzy matching included in the Alteryx sample workflows and demos. In those fuzzy examples, the ZIP code is used in "key match only" mode, and it is explicitly stated that it will math the whole ZIP exactly, then go on to the other fields and fuzzy.
The difference here is NOT because you're using for country the key match type of "Whole Field (Case Insensitive)" while ZIP code is using "Digits Only". The difference is in the maximum key length. If yours is left as default, then it will be 3 or 4 characters long. In some cases, as in the screenshot I've attached below, two countries may share the same first 3 or 4 phonetic key chars. Such examples of matching pairs are Austria and Australia, having a shared 3-char key of "AUS", as well as South Korea and Dem. Pppls. Rep of Korea , both of which share the word Korea, which would have the same key (If you check the option of Generate Keys for Each Word).
The "Whole Field" match type is misleadingly named. It kinda means that it will create keys from any characters in that field, be they letters, digits, special characters etc, but it won't necessarily use the entire length of the field. To do that, you must increase the "Maximum Key Length" to 10 or so. You might also want to uncheck the "Generate Keys for Each Word" option, only in this particular case.
Hope that is clear enough! Let me know if that solves it.
Hi hilton,
thank you for your reply!
Actually I wasn't aware of the ZIP code sample ;)
Since I selected the key match type "Whole Field (Case Insensitive)" my expectation was, that the key resamples the letters of the country name (which seems to be true in my simple test flow I made) therefore your suggested issue (having the same phonetic key chars) should not be the case. (Just for information: I am using the 2 digit country ISO codes not the full name).
Any other ideas?
Cheers Tom
"Applicant Line" being the source for a potential city informatin, "City Names" is a clean list of citiy names and the Country ISO codes from the two sources ....
Can you open the "Edit..." options dialog for your Country ISO Code field, and take a screenshot? Or otherwise, attach your workflow to this discussion if you can.
I suspect the key to answering this question lies in those settings.
Sure, I can insert the screenshotl I can post the flow as well, but only without data, which would than probably not that helpfull anymore ...
Acctually I found the issue: If the ID of source A and source B is both just numeric Alteryx mixes them up and it is impossible to extract them vom the "ID" and "ID2" which will be outputted by the Fuzzy Match. Now I understand why in the templates / examples they combine the ID number with the source name ;)
Ah, that would explain it. I never considered the ID fields, which have to have universally unique values. Thanks for closing this one out.
If you leave the country match away it works perfectly without universally unique values ...
Another interesting observation: If you define the ID from one stream between 1 and 1000 and for the other stream between 10000 and 11000 (being universally unique) the fuzzy match mixes them between the newly generated ID and ID2 fields. If you add the source like A1 to A1000 and B10000 to B11000 this does not happen ;)
That is interesting to hear, to say the least! I would not have expected it to behave that way.
In other words, what you're saying is that Alteryx doesn't maintain the integrity of the ID keys that are in the format of numbers only, in that it switches them around between ID and ID2? Thats not good news.
When you do that, with numbers only (1 to 1000, and 10000 to 11000), is your ID field a numerical type field, eg. int32, double etc? Or is it a String type? I always make sure to use String ID fields, which is maybe why I've never seen this happen before. Maybe Alteryx cannot handle ID fields in numerical types properly?
I configured the ID as string.
But since you raised that point I tried it with int32, and once of a sudden the ID's don't get mixed up anymore!
-> Using an ID consisting only out of numerics and setting it to string is obviously not a thing one should do!