I have concatenated strings from which I need to parse out words in all uppercase. Multiple words can occur and they could be separated by a space or other non-lowercase letter characters. Other words in the string can have the first letter only in uppercase (title case), but those words should not be included.
I have provided a text file of sample data as an attachment, including the desired outputs from the parsing. I have tried using REGEX, but I am struggling with the complexity of the pattern and the presence of title case words that I do not want included. Any assistance with a solution is greatly appreciated. Thank you in advance!
Note: Ideally the uppercase words should be at the beginning only. In other words, I would prefer that those occurring later in the string would not be parsed (see Record 5 for an example). I would welcome an alternate solution as well that would parse them anywhere in the string, however, in the event I need to do so with a future dataset.
Excerpt Sample:
Original | Desired1 | Desired2 |
THIS is a first example 202201.pdf | THIS | is a first example 202201.pdf |
UPPER PENINSULA Sample Data Line | UPPER PENINSULA | Sample Data Line |
Word This Prepayment Form | [Null] | Word This Prepayment Form |
Retailer_s More Other DA-1S | [Null] | Retailer_s More Other DA-1S |
Monthly Word More OTHER A-1246 | [Null] | Monthly Word More OTHER A-1246 |
Screen Shot (see txt attachment):
Regards,
taxtechfl
Solved! Go to Solution.
Hey @taxtechfl ,
Try this:
REGEX_Replace([Field1], "\b\u\l+\b|\b\l+\b", '',0)
It will look for words that are all lowercase or titlecase and will zap them. It will fail miserably on "aBCDe".
I wrote another expression in case you end up with extra spaces:
Regex_Replace(REGEX_Replace([Field1], "\b\u\l+\b|\b\l+\b", '',0),"\s{2,}",' ')
To be sure of all uppercase words I am going to explain the workflow that YOU must build.
1. If you don't have a recordID (unique key), put one on the data.
2. TEXT to COLUMNS tool with space as a delimiter - SET output to ROWS
3. FILTER as: REGEX_Match([Field1], "\u+",0)
4. SUMMARIZE as: Group By Record ID + String (Concatenate Function) with SPACE as a delimiter
5. JOIN the data back to your incoming data
Try this out and see the magic.
Cheers,
Mark
Thank you so much for the response and information. Although I was able to build the alternate workflow to parse all uppercase words, I was not able to get the REGEX solution for beginning uppercase words to work as expected for all records. I have attached the workflow I built. Thank you again for your assistance.
I accidentally replied to my main thread instead of your response, and it won't let me delete it. I'm replying so you get the notification. Thanks.
Hi @taxtechfl
Here's one way to do it.
Once the words are split to rows, number them, check if they're all upper case and then mark the strings of uppercase words from the start only. Concatenate the two sets of words and then join them back on record id and then join this to your original data set, giving you
Note that I modified the regex to check if the word is all upper case to [\u+\-*]* to catch TWO-OR-MORE hyphenated words at the start of the sentence
Dan
Thank you for the assistance. That gets the desired results without pulling uppercase words later in the string. This Community is awesome.