Today's teaser is a good one. I've got two data sets which I'm matching by address. It works well, but 1/3 of the addresses don't match due to some inconsistencies in the address format. Simple things are throwing the match off.
I don't think fuzzy match will work in this case, since the matches are not one-to-one. E.g. John Smith and his address appear once in one database, but John Smith might appear 3 times in the other because he has three different products. (if I'm wrong, let me know!)
The main issue is that I have an address field which isn't broken down into street number, street prefix, street name, etc, - it's all one field. I'm trying to parse out the address into individual fields, which is the first part of my problem.
224 Belmont Street APT 220
225 N Belmont St 220
4 Saffron Hill Road
In the example, the Belmont Street addresses might actually be next door neighbors, but the address format is different. One contains a street prefix, the other doesn't. One includes the word Street, the other abbreviated. One includes "APT", the other drops "APT" altogether.
The other issue is that some street names are more than one word. I*think* RegEx is the tool after watching a really neat live class on it.
My thoughts are I need to do this in a few phases:
Use RegEx to identify different address formats - e.g. if an address is a number, followed by one or two letters, followed by a space, followed by, etc, then identify this as Format One. I'd then have a similar formula to identify several different address formats.
Once each address has been identified as a certain format, I'd then be able to filter and write a parsing entry to start pushing the components of the addresses into individual fields. In the example of Belmont:
|St Number||St Prefix||St Name||St Suffix||Apt/Suite||Apt Number|
|224|| ||Belmont||Street ||APT||220|
|4|| ||Saffron Hill||Road|| || |
Am I overthinking this, or is this on the right lines? We don't have the spatial analytics package so we can't do a CASS match (which I think would solve a lot of these issues).
I must say, I really enjoy these problems - especially solving them! I know I could simply use RegEx to replace "APT" with nothing, therefore removing it from the equation, but then where would the fun in that be. Plus, that only solves 5% of my problems!