Hey guys!
Today's teaser is a good one. I've got two data sets which I'm matching by address. It works well, but 1/3 of the addresses don't match due to some inconsistencies in the address format. Simple things are throwing the match off.
I don't think fuzzy match will work in this case, since the matches are not one-to-one. E.g. John Smith and his address appear once in one database, but John Smith might appear 3 times in the other because he has three different products. (if I'm wrong, let me know!)
The main issue is that I have an address field which isn't broken down into street number, street prefix, street name, etc, - it's all one field. I'm trying to parse out the address into individual fields, which is the first part of my problem.
For example:
224 Belmont Street APT 220
225 N Belmont St 220
4 Saffron Hill Road
In the example, the Belmont Street addresses might actually be next door neighbors, but the address format is different. One contains a street prefix, the other doesn't. One includes the word Street, the other abbreviated. One includes "APT", the other drops "APT" altogether.
The other issue is that some street names are more than one word. I*think* RegEx is the tool after watching a really neat live class on it.
My thoughts are I need to do this in a few phases:
Use RegEx to identify different address formats - e.g. if an address is a number, followed by one or two letters, followed by a space, followed by, etc, then identify this as Format One. I'd then have a similar formula to identify several different address formats.
Once each address has been identified as a certain format, I'd then be able to filter and write a parsing entry to start pushing the components of the addresses into individual fields. In the example of Belmont:
St Number | St Prefix | St Name | St Suffix | Apt/Suite | Apt Number |
224 | Belmont | Street | APT | 220 | |
224 | N | Belmont | St | 220 | |
4 | Saffron Hill | Road |
Am I overthinking this, or is this on the right lines? We don't have the spatial analytics package so we can't do a CASS match (which I think would solve a lot of these issues).
I must say, I really enjoy these problems - especially solving them! I know I could simply use RegEx to replace "APT" with nothing, therefore removing it from the equation, but then where would the fun in that be. Plus, that only solves 5% of my problems!
Any ideas?
Solved! Go to Solution.
So this one is a little fiddly...
^(\d+) ?([A-Za-z](?= ))? (.*?) ([^ ]+?) ?((?<= )APT)? ?((?<= )\d*)?$
Lots of optional parts...
^(\d+) - gets the number at beginning of text and stores in St Number
([A-Za-z](?= ))? - matches a single letter after the number and uses a positive look ahead for a space
(.*?) ([^ ]+?) - matches the street name and separates the last word as a street suffice
((?<= )APT)? - matches optionally and APT (you could add a list of expected words here as needed separated by |
((?<= )\d*)? - matches the last number if present looking behind for a space
regexr does a decent explaining job: https://regexr.com/47306
Hey @jdunkerley79!
Ooooooof. This is epic - I appreciate this very much. I will dissect this (thanks for the piece-by-piece explanation, by the way) and attempt to get my head around it. Regex is a very different looking language from anything I've encountered in the past - it looks so alien!
Oh, thanks for posting that link - I'll go take a look at it now!
Cheers again!
@EvolveKev cheers to @jdunkerley79 for a mighty impressive RegEx, but it also seems like this might be a candidate for a CASS standardization. I know that's dependent on licensing, but you might be able to get a trial from your account rep if you do not have access to it.
Hey @neilgallen - funny you should mention that - we have a demo of that exact thing today! Right now we don't have that in our license, but if it can substantially increase our address matching then I think we'll be in a position to enthusiastically recommend it to our client!
OMG. Are you a wizard?
This works like an absolute charm. It's beautiful! Thank you so much, @jdunkerley79 !
Hey @neilgallen - we're currently demo'ing the spatial/CASS package. This is certainly the fastest and most accurate way to go - the level of matching is uncanny. We're currently evaluating if we should spring for it or not.
That being said, it makes very light work of address matching.
@EvolveKev Glad that's working for you. I can honestly say we quickly found many uses for the spatial package, and the value has been worth it for us.
Very new to RegEx. What would you do if you have some addresses that are out of order? I tried moving portions of your formula around but couldn't get it to work.
Ex:
Unit 4 200 Main St