Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Repeating letters

Esmeralda_In_Paris
8 - Asteroid

Hi, 

I'm struggling with a file of email addresses to clean up. 

 

I come across fake emails where a letter can be repeated many times such as : 

xxx@gmail.com
yyyidmcho@naver.com
xx.xxx@gmail.com
XXX@XXX
avaxxxava@gmail.com
xxxxx.yyyyy@hotmail.fr
xxxprettyboy@naver.com
lucaslucasxxx94@gmail.com

 

could you please help me with a regular expression to search all the emails with any letter repeated more than twice? 

thank you 

2 REPLIES 2
NicoleJohnson
ACE Emeritus
ACE Emeritus

So there might very well be a more elegant way to do this, but RegEx is a tricky beast, and while ugly, this at least should work...

 

The general idea is to create a very long RegEx pattern that includes all the possible combinations of a single letter repeated 3 times, i.e. "aaa|bbb|ccc..." etc. The | pipe delimiter acting as the OR in this scenario. So I basically just listed all the letters a-z, then used a formula to create a field that repeated each letter 3 times, then concatenated using Summarize tool and a pipe delimiter (OR operator in the RegEx expression) with double quotes on each end.

 

NicoleJohnson_0-1633983585492.png

This can then be used in the "pattern" section of a RegEx CountMatches expression, so it will count how many of those combinations it finds in your string. Using REGEX_CountMatches([Email],[Concat_LetterCombo],1as the expression will allow for it to be case insensitive, so will work for AAA as well as aaa. Kinda ugly, but it works!


Cheers,

NJ

Luke_C
17 - Castor

Hi @Esmeralda_In_Paris 

 

Here's an alternative:

 

  1. Add unique record IDs
  2. tokenize to rows
  3. Multirow formula to count consecutive characters for a given record ID
  4. Filter to instances where consecutive = 2 (count starts at 0 so 2 indicates at least 3 consecutive characters)
  5. Group by Record ID
  6. Join back to the original data to get the population of emails with an issue.

 

Luke_C_0-1633984672117.png

 

Labels