Hi All,
I have a challenge, that seems easy but has me stumped.
Aim
I want to try and guess a customer email address based on the most common patterns of email address for customers at that company.,
After analyzing email addresses, I've found out that the most common formats are:
Possible email structures | |
First | John@nike.com |
First [1 letter] + Last | JSmith@nike.com |
First+.+Last | John.Smith@nike.com |
Last | Smith@nike.com |
FirstLast | JohnSmith@nike.com |
last + first [1 letter] | SmithJ@nike.com |
first [1 letter]+.+Last | J.Smith@nike.com |
First+Last[1 letter] | JohnS@nike.com |
Last+First | SmithJohn@nike.com |
Last.First | Smith.John@nike.com |
Assumptions
Since majority of companies have standard formatted for their email address, i am hoping to find the most common email structure for a given company and then extrapolate based on it.
Example
Company : Nike.com
Number of Contacts | Percent | Email Structure | |
240 | 60% | First+.+Last | John.Smith@nike.com |
100 | 25% | First [1 letter] + Last | JSmith@nike.com |
60 | 15% | First | John@nike.com |
Given that 60% of the contacts at Nike have the email structure of "First+.+Last" i would like to then follow a similar format for those 100 Contacts which have no email addresses
Current Input
Contactid | First Name | LastName | Company Name | Companyid | |
111111 | Richard | Piper | Richard.Piper@Nike.com | Nike Inc | 001f100001InnV5AAJ |
222222 | Danielle | Collins | Danielle.Collins@Nike.com | Nike Inc | 001f100001InnV5AAJ |
333333 | Dane | Smith | Dane.Smith@Nike.com | Nike Inc | 001f100001InnV5AAJ |
44444 | Robert | Atleryx | RAlteryx@Nike.com | Nike Inc | 001f100001InnV5AAJ |
55555 | John | King | Nike Inc | 001f100001InnV5AAJ | |
666666 | Chris | Dannher | Nike Inc | 001f100001InnV5AAJ |
Expected Outcome
Contactid | First Name | LastName | Predicted Email | Rational | Company Name | Companyid | |
111111 | Richard | Piper | Richard.Piper@Nike.com | Nike Inc | 001f100001InnV5AAJ | ||
222222 | Danielle | Collins | Danielle.Collins@Nike.com | Nike Inc | 001f100001InnV5AAJ | ||
333333 | Dane | Smith | Dane.Smith@Nike.com | Nike Inc | 001f100001InnV5AAJ | ||
44444 | Robert | Atleryx | RAlteryx@Nike.com | Nike Inc | 001f100001InnV5AAJ | ||
55555 | John | King | John.King@Nike.com | Common Email Pattern " First+.+Last" | Nike Inc | 001f100001InnV5AAJ | |
666666 | Chris | Dannaher | Chris.Dannaher@Nike.com | Common Email Pattern " First+.+Last" | Nike Inc | 001f100001InnV5AAJ |
Looking forward to your help & advice
Many thanks
Masond3
I been validating today and documented some use cases. Which i need to vet properly tomorrow to understand the logic and where I think it’s going wrong.
At the moment i should be getting more data in the output than the input . (Due to the d split in email domains) however I am getting less than the inout( by a significant amount)
so u just need to do some validating . I think the core of it’s there. It’s just tweaking , changing . Amending etc