This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
We have a file with names in this format: first name(s) always Proper case, surnames are always UPPER case.
Full name |
Philip Seymour HOFFMAN |
Helena BONHAM CARTER |
Catherine ZETA-JONES |
etc... |
We want to split out first name and surnames into new fields, i.e.
Full name | First name | Surname |
Philip Seymour HOFFMAN | Philip Seymour | HOFFMAN |
Helena BONHAM CARTER | Helena | BONHAM CARTER |
Catherine ZETA-JONES | Catherine | ZETA-JONES |
Is there a formula that could do this? The best workaround I can come up with involves many steps: splitting out each part of the name into a separate field, creating duplicates of each field in proper case, comparing whether each string matches it's Proper-case-duplicate to identify whether it is a portion of the surname or first name(s), and then joining the sections back together. Probably some cross-tab and transpose action in there to help with the comparison and then the piecing back together.
It seems like there should be a more elegant way to do this - does anyone know of one?
Solved! Go to Solution.
You will be able to use the RegEx tool to achieve this.
I would place it in 'Parse mode' which allows you to create 'marked groups' in your case we need two marked groups.
I'm not an expert in RegEx by any means but the following seems to provide me with the result you are after.
(\u\l.+[^\u\u]) (\u\u.+)
Essentially group one takes any words beginning with an uppercase character and proceeded with a lowercase character until the point at which two uppercase characters in a row occur, at which point the second group kicks in which represents our surnames.
Example attached.
Ben
That's fantastic, thank you Ben! And thank you for also explaining how to deploy it and a bit about it's working :) I did think the solution might involve RegEx, but I would have no idea where to start, I now feel like it's been demystified a tiny bit for me.
In case anyone else looks to employ this, we did discover there were some anomalies in our data where this didn't work, e.g. names with apostrophes in like O'NEILL. In our case we can come up with some workarounds for these, like stripping them first with the Data Cleaning tool, since we only need to be able to match to existing details. If anyone has a formula that can work with punctuation too then feel free to share it!
Arr! Those pesky O'NEIL'S
Challenge accepted, though I'm know regex master.
I think I've arrived at the solution through pure 'button mashing' and figuring out in reverse why it worked!
Use the following expression in the RegEx tool -
(.+[^\u.+\u]) (\u.+\u)
the original expression (\u\l.+[^\u\u]) (\u\u.+) is unable to parse where there is an apostrophe in the surname, as I think \u\u is matching on values where the first 2 characters are 2 consecutive uppercase letters, for example -
Helena BONHAM CARTER = Helena and BONHAM CARTER
Testing O'NEILL = Error, null values in both columns
I initially updated it to (\u\l.+[^\u.+\u]) (\u.+\u),as \u.+\u indicates a sequence of any characters that both begin and end with uppercase letters, with one or more characters of any type (.+) in between, for example -
Helena BONHAM CARTER = Helena and BONHAM CARTER
Testing O'NEILL = Testing and O'NEILL
However, in doing so, I found there was also an issue where the expression was unable to parse the field if an individual had a 3-letter first name, for example -
Ian SURNAME = Error, null values in both columns
This comes from the initial \u\l.+ in marked group 1, because a 3-letter first name does not have one or more characters (+) following the single, unspecified character (.) that isn't the initial upper/lower case letters (\u\l) - it only satisfies three of the four match criteria. All of the following would rectify this -
(\u.+[^\u.+\u]) (\u.+\u)
(\u.[^\u.+\u]) (\u.+\u)
(\u+[^\u.+\u]) (\u.+\u)
- however, it seems neater to use the code above, as the sequence of letters for first names doesn't need to be specified.
The ^ in the character set of marked group 1, 'negates' the matching characters from the output (essentially, Output 1 = "return this sequence, minus what matches the criteria held in the square brackets", Output 2 = "return what matches the criteria held in the square brackets"). If Uppercase is being used exclusively for the surname values, and you are not trying to further split what remains in output 1, it is enough to return any one or more characters (.+), that don't match the negated group because you already know these are your first name values.
Hope this is useful!