RegEx - Addresses, different formats, and headaches

Hey guys!

Today's teaser is a good one. I've got two data sets which I'm matching by address. It works well, but 1/3 of the addresses don't match due to some inconsistencies in the address format. Simple things are throwing the match off.

I don't think fuzzy match will work in this case, since the matches are not one-to-one. E.g. John Smith and his address appear once in one database, but John Smith might appear 3 times in the other because he has three different products. (if I'm wrong, let me know!)

The main issue is that I have an address field which isn't broken down into street number, street prefix, street name, etc, - it's all one field. I'm trying to parse out the address into individual fields, which is the first part of my problem.

For example:

224 Belmont Street APT 220

225 N Belmont St 220

4 Saffron Hill Road

In the example, the Belmont Street addresses might actually be next door neighbors, but the address format is different. One contains a street prefix, the other doesn't. One includes the word Street, the other abbreviated. One includes "APT", the other drops "APT" altogether.

The other issue is that some street names are more than one word. I*think* RegEx is the tool after watching a really neat live class on it.

My thoughts are I need to do this in a few phases:

Use RegEx to identify different address formats - e.g. if an address is a number, followed by one or two letters, followed by a space, followed by, etc, then identify this as Format One. I'd then have a similar formula to identify several different address formats.

Once each address has been identified as a certain format, I'd then be able to filter and write a parsing entry to start pushing the components of the addresses into individual fields. In the example of Belmont:

St Number	St Prefix	St Name	St Suffix	Apt/Suite	Apt Number
224		Belmont	Street	APT	220
224	N	Belmont	St		220
4		Saffron Hill	Road

Am I overthinking this, or is this on the right lines? We don't have the spatial analytics package so we can't do a CASS match (which I think would solve a lot of these issues).

I must say, I really enjoy these problems - especially solving them! I know I could simply use RegEx to replace "APT" with nothing, therefore removing it from the equation, but then where would the fun in that be. Plus, that only solves 5% of my problems!

Any ideas?

Regex

Fuzzy Match

Expression

Accepted answers

jdunkerley79

So this one is a little fiddly...

^(\d+) ?([A-Za-z](?= ))? (.*?) ([^ ]+?) ?((?<= )APT)? ?((?<= )\d*)?$

Lots of optional parts...

^(\d+) - gets the number at beginning of text and stores in St Number

([A-Za-z](?= ))? - matches a single letter after the number and uses a positive look ahead for a space

(.*?) ([^ ]+?) - matches the street name and separates the last word as a street suffice

((?<= )APT)? - matches optionally and APT (you could add a list of expected words here as needed separated by |

((?<= )\d*)? - matches the last number if present looking behind for a space

regexr does a decent explaining job: https://regexr.com/47306

Regex Address.yxmd

All comments

jdunkerley79

So this one is a little fiddly...

^(\d+) ?([A-Za-z](?= ))? (.*?) ([^ ]+?) ?((?<= )APT)? ?((?<= )\d*)?$

Lots of optional parts...

^(\d+) - gets the number at beginning of text and stores in St Number

([A-Za-z](?= ))? - matches a single letter after the number and uses a positive look ahead for a space

(.*?) ([^ ]+?) - matches the street name and separates the last word as a street suffice

((?<= )APT)? - matches optionally and APT (you could add a list of expected words here as needed separated by |

((?<= )\d*)? - matches the last number if present looking behind for a space

regexr does a decent explaining job: https://regexr.com/47306

Regex Address.yxmd

EvolveKev

Hey @jdunkerley79!

Ooooooof. This is epic - I appreciate this very much. I will dissect this (thanks for the piece-by-piece explanation, by the way) and attempt to get my head around it. Regex is a very different looking language from anything I've encountered in the past - it looks so alien!

Oh, thanks for posting that link - I'll go take a look at it now!

Cheers again!

EvolveKev

OMG. Are you a wizard?

This works like an absolute charm. It's beautiful! Thank you so much, @jdunkerley79 !

RyanB24

Very new to RegEx. What would you do if you have some addresses that are out of order? I tried moving portions of your formula around but couldn't get it to work.

Ex:

Unit 4 200 Main St

Quick Links

This months top contributors

atcodedog05 19458

Qiu 15865

binu_acs 15708

MarqueeCrew 13708

apathetichell 13703