Hi All,
I am relatively new to predictive modeling using Alteryx. I have a data set that contains:
My goal is to fill all the records that are missing an email address. I have noticed a pattern that company's typical use the same formula which combines the first and last name in someway to generate their employee's email (i.e. Jane.Doe@company.com or JDoe@company.com). My task is to create a machine learning model that learns the formula for each company and then fills in the missing emails. Any pointers on how to accomplish this would be greatly appreciated!!!!
This is a bit different than a traditional machine learning model in that you are looking for patterns in a field rather than trying to predict a specific value.
Setting the machine learning model aside for a moment, you could easily accomplish the fill of a known pattern using a formula tool. For example, if the pattern was Jane.Doe@company.com and you had another record with John Smith as the first and last name, you could use a formula tool that says:
[First Name]+"."+[Last Name]+"@company.com" and it would generate the email addresses from the information that you have.
Now back to the patterns...
You could create new columns that have flags for different pattern matches and then see which column of patterns has the most flags after checking all records. For example, you could create a new column called FirstInitialLastName where you say
IF Left([First Name], 1)+[Last Name]+"@company.com" = [Email]
THEN 1
ELSE 0
ENDIF
This would create a flag with a value of 1 in a new column for all situations where this is true. Then you could have another formula that creates a column called FirstNamePeriodLastName that says
IF [First Name]+"."+[Last Name]+"@company.com" = [Email]
THEN 1
ELSE 0
ENDIF
Rather than using a machine learning model, you could just sum up all of these flags for every column. Then you could use another formula tool to fill the email based on whichever one had the greatest sum.