Hi All,
I have a challenge, that seems easy but has me stumped.
Aim
I want to try and guess a customer email address based on the most common patterns of email address for customers at that company.,
After analyzing email addresses, I've found out that the most common formats are:
| Possible email structures | |
| First | John@nike.com |
| First [1 letter] + Last | JSmith@nike.com |
| First+.+Last | John.Smith@nike.com |
| Last | Smith@nike.com |
| FirstLast | JohnSmith@nike.com |
| last + first [1 letter] | SmithJ@nike.com |
| first [1 letter]+.+Last | J.Smith@nike.com |
| First+Last[1 letter] | JohnS@nike.com |
| Last+First | SmithJohn@nike.com |
| Last.First | Smith.John@nike.com |
Assumptions
Since majority of companies have standard formatted for their email address, i am hoping to find the most common email structure for a given company and then extrapolate based on it.
Example
Company : Nike.com
| Number of Contacts | Percent | Email Structure | |
| 240 | 60% | First+.+Last | John.Smith@nike.com |
| 100 | 25% | First [1 letter] + Last | JSmith@nike.com |
| 60 | 15% | First | John@nike.com |
Given that 60% of the contacts at Nike have the email structure of "First+.+Last" i would like to then follow a similar format for those 100 Contacts which have no email addresses
Current Input
| Contactid | First Name | LastName | Company Name | Companyid | |
| 111111 | Richard | Piper | Richard.Piper@Nike.com | Nike Inc | 001f100001InnV5AAJ |
| 222222 | Danielle | Collins | Danielle.Collins@Nike.com | Nike Inc | 001f100001InnV5AAJ |
| 333333 | Dane | Smith | Dane.Smith@Nike.com | Nike Inc | 001f100001InnV5AAJ |
| 44444 | Robert | Atleryx | RAlteryx@Nike.com | Nike Inc | 001f100001InnV5AAJ |
| 55555 | John | King | Nike Inc | 001f100001InnV5AAJ | |
| 666666 | Chris | Dannher | Nike Inc | 001f100001InnV5AAJ |
Expected Outcome
| Contactid | First Name | LastName | Predicted Email | Rational | Company Name | Companyid | |
| 111111 | Richard | Piper | Richard.Piper@Nike.com | Nike Inc | 001f100001InnV5AAJ | ||
| 222222 | Danielle | Collins | Danielle.Collins@Nike.com | Nike Inc | 001f100001InnV5AAJ | ||
| 333333 | Dane | Smith | Dane.Smith@Nike.com | Nike Inc | 001f100001InnV5AAJ | ||
| 44444 | Robert | Atleryx | RAlteryx@Nike.com | Nike Inc | 001f100001InnV5AAJ | ||
| 55555 | John | King | John.King@Nike.com | Common Email Pattern " First+.+Last" | Nike Inc | 001f100001InnV5AAJ | |
| 666666 | Chris | Dannaher | Chris.Dannaher@Nike.com | Common Email Pattern " First+.+Last" | Nike Inc | 001f100001InnV5AAJ |
Looking forward to your help & advice
Many thanks
Masond3
I been validating today and documented some use cases. Which i need to vet properly tomorrow to understand the logic and where I think it’s going wrong.
At the moment i should be getting more data in the output than the input . (Due to the d split in email domains) however I am getting less than the inout( by a significant amount)
so u just need to do some validating . I think the core of it’s there. It’s just tweaking , changing . Amending etc
