Another day, another RegEx struggle...
I want to pull the name, city and state for an organization issuing a contract. This appears towards the end of the text and is separated by commas (99% of the time). It is also always followed by the phrase "contracting activity."
The problem is the body of text begins with the awardee's name, city and state using the same comma separated construct. I know how to capture crudely the awardee information using the expression
(^.*?)\,.*?([[:upper:]].*?)\,.*?([[:upper:]].*?)\,.*?
But how do I isolate and extract the Org name, city, and sate information. Can I use contracting activity as an anchor to look back or is there another approach?
Awardee name, city, and state, ipsum dolor sit amet, awarded consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco Random Capitalized Word laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt Org Name, Org City, Org State, is the contracting activity id est laborum.
I tried using the expression \bawarded.*([[:upper:]].*)\,\s([[:upper:]].*)\,\s([[:upper:]].*)\,.*$. to avoid the initial awardee information. I can zero in on city and state. But, I am having difficulty with isolating name. The reason being that name can be 2, 3, 4, 5 or 6 + worlds. For example, the name might be U.S. Army Contracting Command or Naval Sea Systems Command, etc.
I would love to figure out how to do this with RegEx. But my alternative is to build a lookup table. Ugh.
Thanks,
Solved! Go to Solution.
Hi @hellyars
Does this work for you?
.*(?=\b\w+\s\w+,)(.*,.*,.*)(?=,.*contracting\sactivity).*
REGEX_Replace([Field1], ".*(?=\b\w+\s\w+,)(.*,.*,.*)(?=,.*contracting\sactivity).*", "$1")
The pattern I took was - 3 commas interpolated by some words, something else, contracting activity.
Cheers,
.*(?=\b\w+\s\w*,)(.*,.*,.*)(?=,.*contracting\sactivity).*
It is almost there. And, this is where I ran into trouble with my expression. How can we adjust the expression to capture all capitalized name segments? The name will always consist of at least two or more segments.
Is there a way to make it find the first capitalized word and then all the capitalized words (or characters in the case of abbreviations such as U.S.) between it and the first common that is then followed by the rest of the pattern (city, state )?
Close. Here is the good news. It works. I modified it slightly to split name, city, and state. But, both versions work. That is, they work on regex101.com environment. I can't get them to work in Alteryx. In Alteryx, its grabbing the awardee name, city, state at the start of the text. It seems to be ignoring everything that works so well on regex101.com.
Use Parse Method in RegEX tool and it will work. Be sure to uncheck the "Case Insensitive" box.
RegEX_Replace in this case is a bit tricky because it tries to replace a part of the string, so the engine works differently.
It does and does not.
If I take the exact example I used in this post and pair it with a RegEx tool using a Text Input, it works.
If I add it to my real data, it does not work.
And this is it where it gets strange. If I copy a body of text from a field in the real data and add it to the Text Input in my first test run, it works.
Huh?
Could you share your original data? At least filtering where it does not work, so we can compare this situation.
Cheers,
Here is an extract from the real data. It will not work hooked up to the RegEx tool. But, you can copy and paste records to the Input Tool and it will work.
Thanks