This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
Today's teaser is a good one. I've got two data sets which I'm matching by address. It works well, but 1/3 of the addresses don't match due to some inconsistencies in the address format. Simple things are throwing the match off.
I don't think fuzzy match will work in this case, since the matches are not one-to-one. E.g. John Smith and his address appear once in one database, but John Smith might appear 3 times in the other because he has three different products. (if I'm wrong, let me know!)
The main issue is that I have an address field which isn't broken down into street number, street prefix, street name, etc, - it's all one field. I'm trying to parse out the address into individual fields, which is the first part of my problem.
224 Belmont Street APT 220
225 N Belmont St 220
4 Saffron Hill Road
In the example, the Belmont Street addresses might actually be next door neighbors, but the address format is different. One contains a street prefix, the other doesn't. One includes the word Street, the other abbreviated. One includes "APT", the other drops "APT" altogether.
The other issue is that some street names are more than one word. I*think* RegEx is the tool after watching a really neat live class on it.
My thoughts are I need to do this in a few phases:
Use RegEx to identify different address formats - e.g. if an address is a number, followed by one or two letters, followed by a space, followed by, etc, then identify this as Format One. I'd then have a similar formula to identify several different address formats.
Once each address has been identified as a certain format, I'd then be able to filter and write a parsing entry to start pushing the components of the addresses into individual fields. In the example of Belmont:
Am I overthinking this, or is this on the right lines? We don't have the spatial analytics package so we can't do a CASS match (which I think would solve a lot of these issues).
I must say, I really enjoy these problems - especially solving them! I know I could simply use RegEx to replace "APT" with nothing, therefore removing it from the equation, but then where would the fun in that be. Plus, that only solves 5% of my problems!
Ooooooof. This is epic - I appreciate this very much. I will dissect this (thanks for the piece-by-piece explanation, by the way) and attempt to get my head around it. Regex is a very different looking language from anything I've encountered in the past - it looks so alien!
Oh, thanks for posting that link - I'll go take a look at it now!
Hey @neilgallen - funny you should mention that - we have a demo of that exact thing today! Right now we don't have that in our license, but if it can substantially increase our address matching then I think we'll be in a position to enthusiastically recommend it to our client!
Hey @neilgallen - we're currently demo'ing the spatial/CASS package. This is certainly the fastest and most accurate way to go - the level of matching is uncanny. We're currently evaluating if we should spring for it or not.
That being said, it makes very light work of address matching.