Alteryx Designer Desktop Discussions

Vanderleck16 · ‎07-28-2021

Hello everyone,

I am working on a data standardization project.

The goal is to create a model for recognizing differents informations from a field.

So, I have a field with several information such the first name, last name, address, city, postal code...

The idea is to identify each part and isolate the information in a new field.

For example, I will have a column with the first name, one with the last name, a column with the street number, one with the name of the street, one with the city, one with the Zip code..

The problem is that thoose informations are in a different order. I was thinking about using regular expressions but it seems to be difficult to find a pattern wich work all the times.

I would like to be able to use machine learning techniques, for example by creating an algorithm that could identify each piece of information, based on data that has already been clean. Perhaps with a multitude of data, the algorithm will be able to identify the name, the city ....

Unfortunately I don't know how the machine learning algorithms work in this case, but it's something I'd like to learn how to use.

So if you can help me move forward on this project, I would be very grateful to you.

I am attaching an example file to show you the expected result.

Thank you

KaneG · ‎07-29-2021

Hi @Vanderleck16,

That data is dirty... having the name in there as well makes it really hard to deal with. I'll propose some options here to point you in the right direction, but I'm by no means an expert in this.

The issues with the data:

Name mixed with address
Different countries, meaning very different formats
Missing Data
Random punctuation
Different Orders

Let's assume you could pull the names out and then just have dirty addresses, you could then possibly:

Separate all the data to individual terms and use Fuzzy Match/Make Groups to tag the words
Learn about Hidden Markov Models and use them to identify the terms

However, the easier method would be to use an on-line geocoder such as Google and let their algorithms clean the data and return an address.

In order to start on this, I would probably try and identify some of the common formats such as

2 words at the start with no numbers
UK Postcodes
4 digit postcodes
Data separated by commas

And then try to parse each of those formats separately

apathetichell · ‎07-29-2021

I've hypothesized on this before here - but I'll reiterate, I do not think Addresses are regular expressions and I do not believe they follow a standard logical arrangement. Even within one state/one city/one province the logical structure of what goes where and creating a system of recognizing it can be very very hard without some master list to match off of. This is one of the reasons why spatial data/addresses costs $$$$ and companies pay for it...

I'll mention that even in your sample data showing how it should be broken down there are mistakes - see Ben Ferrer, Mexico City , Adolfo Perez... Your split up as it as Adolfo Perez on Adolfo Perez Avenida which isn't correct.

Alteryx Designer Desktop Discussions

Identify informations from a field - data standardization