I have the below dataset:
AMD CO DIST CENTER
Bi Tech Industries
CZ INDUSTRIES
D I S COMPANY LTD
A AND B HOMES CO
Hi Tech CO DIST Center
I want to get rid of the 2/3/4 letter words from the names, either from the end or in the middle, leaving the first part of the names out of the filter/formula.
The output I am expecting:
AMD CENTER
Bi Industries
CZ INDUSTRIES
D I S COMPANY
A AND B HOMES CO
Hi CENTER
Can anyone help with a workflow? Thanks!
Solved! Go to Solution.
You've got lots of name cleaning ahead of you. BTW, AND violates your 2/3/4 letter words.
Replace(Trim(GetWord([Field1], 0) +
RegEx_Replace(RegEx_Replace(Substring(Replace([Field1],"AND",'&'),FindString(Replace([Field1],"AND",'&'), " ")),"\b\w{2,4}\b",''),"\s+",' ')),"&",'AND')
This one nested expression will get you the output that you want, but you drop CO. If you want to keep CO or other words, you should use a FIND Replace tool and make each KEY word a unique symbol (like I did with & for AND). Use that replacement before the expression and then use another FIND REPLACE tool to convert back.
If you don't need to worry about these types of words, you can use:
Trim(GetWord([Field1], 0) +
RegEx_Replace(RegEx_Replace(Substring([Field1]),FindString([Field1], " ")),"\b\w{2,4}\b",''),"\s+",' '))
Cheers,
Mark
@MarqueeCrew Awesome that's exactly what I am looking for, thanks a ton for your help.
I am not too good at alteryx, so can you help me understand the RegEx formula by breaking it down? So I know what adjustments need to be made in future if I come across similar situation.
\b is a word boundary
\w is a word character
{2,4} qualifies the preceding attribute being 2 to 4 occurrences.
any of your rules will have exceptions. You will throw good words away and keep bad words. Depending upon your data, you'll get a variety of results. One size does not fit all data.
you need to develop an approach to cleaning. Easy too hard. Biggest bang first. Once you've cleaned data, save your results. Next time you can look at a saved file and convert directly to the know good results.
Business names are very tricky to work with. You need to find tools like regex and learn them.
cheers,
mark
Replace 1 or more spaces with a single space
Alteryx RegEx Beginner Tutorial
https://youtu.be/pTZj2U2SDFA
cheers
@MarqueeCrew Thanks a lot. With your formula, its working properly, but at the same time, its truncating some words as shown below in the output. Any thoughts? I am using the formula that does not take AND into consideration.
APPLIANCE BUILDER WHOLESALE --> APPLIANCE BUILDER
APPLIANCE CONNECTION INC --> APPLIANCE CONNECTI
APPLIANCE DISCOUNTERS --> APPLIANCE DISCOUNT
CAPITAL DISTRIBUTING --> CAPITAL DISTRIBUTI
ADP CO DEPENDABLE DISTRIBUTION CENTERS --> ADP DEPENDABLE D
Trim(GetWord([Field1], 0) +
RegEx_Replace(RegEx_Replace(Substring([Field1],FindString([Field1], " ")),"\b\w{2,4}\b",''),"\s+",' '))
@MarqueeCrew I am using that formula but still getting those truncates.
I wonder if you're defining the field with a short length?