This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
We have a file with names in this format: first name(s) always Proper case, surnames are always UPPER case.
Philip Seymour HOFFMAN
Helena BONHAM CARTER
We want to split out first name and surnames into new fields, i.e.
Philip Seymour HOFFMAN
Helena BONHAM CARTER
Is there a formula that could do this? The best workaround I can come up with involves many steps: splitting out each part of the name into a separate field, creating duplicates of each field in proper case, comparing whether each string matches it's Proper-case-duplicate to identify whether it is a portion of the surname or first name(s), and then joining the sections back together. Probably some cross-tab and transpose action in there to help with the comparison and then the piecing back together.
It seems like there should be a more elegant way to do this - does anyone know of one?
You will be able to use the RegEx tool to achieve this.
I would place it in 'Parse mode' which allows you to create 'marked groups' in your case we need two marked groups.
I'm not an expert in RegEx by any means but the following seems to provide me with the result you are after.
Essentially group one takes any words beginning with an uppercase character and proceeded with a lowercase character until the point at which two uppercase characters in a row occur, at which point the second group kicks in which represents our surnames.
That's fantastic, thank you Ben! And thank you for also explaining how to deploy it and a bit about it's working :) I did think the solution might involve RegEx, but I would have no idea where to start, I now feel like it's been demystified a tiny bit for me.
In case anyone else looks to employ this, we did discover there were some anomalies in our data where this didn't work, e.g. names with apostrophes in like O'NEILL. In our case we can come up with some workarounds for these, like stripping them first with the Data Cleaning tool, since we only need to be able to match to existing details. If anyone has a formula that can work with punctuation too then feel free to share it!
I think I've arrived at the solution through pure 'button mashing' and figuring out in reverse why it worked!
Use the following expression in the RegEx tool -
the original expression (\u\l.+[^\u\u]) (\u\u.+) is unable to parse where there is an apostrophe in the surname, as I think \u\u is matching on values where the first 2 characters are 2 consecutive uppercase letters, for example - Helena BONHAM CARTER = Helena and BONHAM CARTER Testing O'NEILL = Error, null values in both columns
I initially updated it to (\u\l.+[^\u.+\u]) (\u.+\u),as \u.+\u indicates a sequence of any characters that both begin and end with uppercase letters, with one or more characters of any type (.+) in between, for example -
Helena BONHAM CARTER = Helena and BONHAM CARTER Testing O'NEILL = Testing and O'NEILL
However, in doing so, I found there was also an issue where the expression was unable to parse the field if an individual had a 3-letter first name, for example - Ian SURNAME = Error, null values in both columns
This comes from the initial \u\l.+ in marked group 1, because a 3-letter first name does not have one or more characters (+) following the single, unspecified character (.) that isn't the initial upper/lower case letters (\u\l) - it only satisfies three of the four match criteria. All of the following would rectify this -
- however, it seems neater to use the code above, as the sequence of letters for first names doesn't need to be specified.
The ^ in the character set of marked group 1, 'negates' the matching characters from the output (essentially, Output 1 = "return this sequence, minus what matches the criteria held in the square brackets", Output 2 = "return what matches the criteria held in the square brackets"). If Uppercase is being used exclusively for the surname values, and you are not trying to further split what remains in output 1, it is enough to return any one or more characters (.+), that don't match the negated group because you already know these are your first name values.
Hi Brzlg , I had the same problem, and applied your trick (impressed by your Regex mastery, btw ;-)), but noticed 2 small remaining problems which I don't know how to solve, as I don't master regex: 1. if the family name consists of 2 uppercase characters (e.g. LI, YE,...) the expression doesn't work 2. if the person has a first name consisting of 2 parts, whereby the second part is in fact an 'initial' only (e.g. "Shirley J" (Full Name being "Shirley J PING"), the expression doesn't manage to have "Shirley J" as first name) Any further ideas? Thanks