Advent of Code is back! Unwrap daily challenges to sharpen your Alteryx skills and earn badges along the way! Learn more now.
Free Trial

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Name cleansing and Parsing

spencer046
8 - Asteroid

I have the below dataset:

 

AMD CO DIST CENTER

Bi Tech Industries

CZ INDUSTRIES

D I S COMPANY LTD

A AND B HOMES CO

Hi Tech CO DIST Center

 

I want to get rid of the 2/3/4 letter words from the names, either from the end or in the middle, leaving the first part of the names out of the filter/formula. 

 

The output I am expecting:

 

AMD CENTER

Bi Industries

CZ INDUSTRIES

D I S COMPANY

A AND B HOMES CO

Hi CENTER

 

Can anyone help with a workflow? Thanks!

10 REPLIES 10
MarqueeCrew
20 - Arcturus
20 - Arcturus

@spencer046 ,

 

You've got lots of name cleaning ahead of you.  BTW, AND violates your 2/3/4 letter words.

 

Replace(Trim(GetWord([Field1], 0) + 
RegEx_Replace(RegEx_Replace(Substring(Replace([Field1],"AND",'&'),FindString(Replace([Field1],"AND",'&'), " ")),"\b\w{2,4}\b",''),"\s+",' ')),"&",'AND')

 

This one nested expression will get you the output that you want, but you drop CO.  If you want to keep CO or other words, you should use a FIND Replace tool and make each KEY word a unique symbol (like I did with & for AND).  Use that replacement before the expression and then use another FIND REPLACE tool to convert back.

 

If you don't need to worry about these types of words, you can use:

Trim(GetWord([Field1], 0) + 
RegEx_Replace(RegEx_Replace(Substring([Field1]),FindString([Field1], " ")),"\b\w{2,4}\b",''),"\s+",' '))

 

Cheers,

 

Mark

 

 

Alteryx ACE & Top Community Contributor

Chaos reigns within. Repent, reflect and restart. Order shall return.
Please Subscribe to my youTube channel.
spencer046
8 - Asteroid

@MarqueeCrew  Awesome that's exactly what I am looking for, thanks a ton for your help. 

 

I am not too good at alteryx, so can you help me understand the RegEx formula by breaking it down? So I know what adjustments need to be made in future if I come across similar situation.

MarqueeCrew
20 - Arcturus
20 - Arcturus

@spencer046 ,

 

\b is a word boundary

 \w is a word character

{2,4} qualifies the preceding attribute being 2 to 4 occurrences. 

any of your rules will have exceptions. You will throw good words away and keep bad words. Depending upon your data, you'll get a variety of results. One size does not fit all data. 

you need to develop an approach to cleaning. Easy too hard. Biggest bang first.  Once you've cleaned data, save your results. Next time you can look at a saved file and convert directly to the know good results. 

Business names are very tricky to work with. You need to find tools like regex and learn them. 

cheers,

 

 mark

 

 

 

Alteryx ACE & Top Community Contributor

Chaos reigns within. Repent, reflect and restart. Order shall return.
Please Subscribe to my youTube channel.
spencer046
8 - Asteroid

@MarqueeCrew what's the utility of the below?

"\s+",' '

 

MarqueeCrew
20 - Arcturus
20 - Arcturus

Replace 1 or more spaces with a single space

 

Alteryx RegEx Beginner Tutorial
https://youtu.be/pTZj2U2SDFA

cheers

Alteryx ACE & Top Community Contributor

Chaos reigns within. Repent, reflect and restart. Order shall return.
Please Subscribe to my youTube channel.
spencer046
8 - Asteroid

@MarqueeCrew  Thanks a lot. With your formula, its working properly, but at the same time, its truncating some words as shown below in the output. Any thoughts? I am using the formula that does not take AND into consideration. 

 

APPLIANCE BUILDER WHOLESALE --> APPLIANCE BUILDER

APPLIANCE CONNECTION INC --> APPLIANCE CONNECTI

APPLIANCE DISCOUNTERS --> APPLIANCE DISCOUNT

CAPITAL DISTRIBUTING --> CAPITAL DISTRIBUTI

ADP CO DEPENDABLE DISTRIBUTION CENTERS --> ADP DEPENDABLE D

MarqueeCrew
20 - Arcturus
20 - Arcturus
Trim(GetWord([Field1], 0) + 
RegEx_Replace(RegEx_Replace(Substring([Field1],FindString([Field1], " ")),"\b\w{2,4}\b",''),"\s+",' '))
Alteryx ACE & Top Community Contributor

Chaos reigns within. Repent, reflect and restart. Order shall return.
Please Subscribe to my youTube channel.
spencer046
8 - Asteroid

@MarqueeCrew I am using that formula but still getting those truncates.

MarqueeCrew
20 - Arcturus
20 - Arcturus

I wonder if you're defining the field with a short length?

Alteryx ACE & Top Community Contributor

Chaos reigns within. Repent, reflect and restart. Order shall return.
Please Subscribe to my youTube channel.
Labels
Top Solution Authors