Alert: There is a planned Community maintenance outage October 16th from approximately 10 - 11 PM PST. During this time the Alteryx Community will be inaccessible. Thank you for your understanding!

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Repeating letters

Esmeralda_In_Paris
8 - Asteroid

Hi, 

I'm struggling with a file of email addresses to clean up. 

 

I come across fake emails where a letter can be repeated many times such as : 

xxx@gmail.com
yyyidmcho@naver.com
xx.xxx@gmail.com
XXX@XXX
avaxxxava@gmail.com
xxxxx.yyyyy@hotmail.fr
xxxprettyboy@naver.com
lucaslucasxxx94@gmail.com

 

could you please help me with a regular expression to search all the emails with any letter repeated more than twice? 

thank you 

2 REPLIES 2
NicoleJohnson
ACE Emeritus
ACE Emeritus

So there might very well be a more elegant way to do this, but RegEx is a tricky beast, and while ugly, this at least should work...

 

The general idea is to create a very long RegEx pattern that includes all the possible combinations of a single letter repeated 3 times, i.e. "aaa|bbb|ccc..." etc. The | pipe delimiter acting as the OR in this scenario. So I basically just listed all the letters a-z, then used a formula to create a field that repeated each letter 3 times, then concatenated using Summarize tool and a pipe delimiter (OR operator in the RegEx expression) with double quotes on each end.

 

NicoleJohnson_0-1633983585492.png

This can then be used in the "pattern" section of a RegEx CountMatches expression, so it will count how many of those combinations it finds in your string. Using REGEX_CountMatches([Email],[Concat_LetterCombo],1as the expression will allow for it to be case insensitive, so will work for AAA as well as aaa. Kinda ugly, but it works!


Cheers,

NJ

Luke_C
17 - Castor

Hi @Esmeralda_In_Paris 

 

Here's an alternative:

 

  1. Add unique record IDs
  2. tokenize to rows
  3. Multirow formula to count consecutive characters for a given record ID
  4. Filter to instances where consecutive = 2 (count starts at 0 so 2 indicates at least 3 consecutive characters)
  5. Group by Record ID
  6. Join back to the original data to get the population of emails with an issue.

 

Luke_C_0-1633984672117.png

 

Labels