Repeating letters
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hi,
I'm struggling with a file of email addresses to clean up.
I come across fake emails where a letter can be repeated many times such as :
xxx@gmail.com
yyyidmcho@naver.com
xx.xxx@gmail.com
XXX@XXX
avaxxxava@gmail.com
xxxxx.yyyyy@hotmail.fr
xxxprettyboy@naver.com
lucaslucasxxx94@gmail.com
could you please help me with a regular expression to search all the emails with any letter repeated more than twice?
thank you
Solved! Go to Solution.
- Labels:
- Preparation
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
So there might very well be a more elegant way to do this, but RegEx is a tricky beast, and while ugly, this at least should work...
The general idea is to create a very long RegEx pattern that includes all the possible combinations of a single letter repeated 3 times, i.e. "aaa|bbb|ccc..." etc. The | pipe delimiter acting as the OR in this scenario. So I basically just listed all the letters a-z, then used a formula to create a field that repeated each letter 3 times, then concatenated using Summarize tool and a pipe delimiter (OR operator in the RegEx expression) with double quotes on each end.
This can then be used in the "pattern" section of a RegEx CountMatches expression, so it will count how many of those combinations it finds in your string. Using REGEX_CountMatches([Email],[Concat_LetterCombo],1) as the expression will allow for it to be case insensitive, so will work for AAA as well as aaa. Kinda ugly, but it works!
Cheers,
NJ
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Here's an alternative:
- Add unique record IDs
- tokenize to rows
- Multirow formula to count consecutive characters for a given record ID
- Filter to instances where consecutive = 2 (count starts at 0 so 2 indicates at least 3 consecutive characters)
- Group by Record ID
- Join back to the original data to get the population of emails with an issue.
