Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Extract second iteration of Name, City and State

hellyars
13 - Pulsar

Another day, another RegEx struggle...

 

I want to pull the name, city and state for an organization issuing a contract.   This appears towards the end of the text and is separated by commas (99% of the time).  It is also always followed by the phrase "contracting activity."

 

The problem is the body of text begins with the awardee's name, city and state using the same comma separated construct.  I know how to capture crudely the awardee information using the expression

 

(^.*?)\,.*?([[:upper:]].*?)\,.*?([[:upper:]].*?)\,.*?  

 

But how do I isolate and extract the Org name, city, and sate information. Can I use contracting activity as an anchor to look back or is there another approach?   

 

Awardee name, city, and state, ipsum dolor sit amet, awarded consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco Random Capitalized Word laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt Org Name, Org City, Org State, is the contracting activity id est laborum.

 

I tried using the expression \bawarded.*([[:upper:]].*)\,\s([[:upper:]].*)\,\s([[:upper:]].*)\,.*$. to avoid the initial awardee information.   I can zero in on city and state.  But, I am having difficulty with isolating name.  The reason being that name can be 2, 3, 4, 5 or 6 + worlds.   For example, the name might be U.S. Army Contracting Command or Naval Sea Systems Command, etc.  

 

I would love to figure out how to do this with RegEx.  But my alternative is to build a lookup table.  Ugh.


Thanks,

12 REPLIES 12
hellyars
13 - Pulsar

@Thableaus 

 

Opps.  Here it is.   Thanks.

Thableaus
17 - Castor
17 - Castor

@hellyars 

 

I modified your expression to this:

.*?((?:\b[A-Z]\w*\W*)+),([^,]+),([^,]+)(?=,[^,]+contracting\sactivity).*

 

But I found many situations where this pattern is not followed.

Basically, you gotta know that the only thing I did was to translate the pattern you said it might be the best to identify the situation to a RegEX.

 

But fact is that your data is very variable. Like I said, I don't think a single RegEX extracts what you want.

As far as volume of data is raised, more problems are most likely to happen.

 

Be aware that my regEX says this - 

Identify any capitalized words in sequence, comma, anything that is not a comma, comma, anything that is not a comma followed by comma, anything that is not a comma and the exact word "contracting activity".

 

Cases that I found that do not match this situation:

 

- Some Org Names have "of" in the middle of the expression, which is not capitalized;

- Contracting activity sometimes does not come after a comma. It starts in a sentence, with a period before it;

- Some records have "contract activity" instead of "contracting activity";

- Some Org Names have numbers on it - the rule of thumb is that they have to be made of Capitalized Words.

 

Take a further look and tell me what you think.

 

Cheers,

hellyars
13 - Pulsar

 

@Thableaus 

 

This works.  Yes, there are anomalies.  But, it captures a large percentage of the information and that gives me options..  I can add iterate the expression to account for contract vs. contracting and other annoying quirks.  Alternatively, I can use the expression to build a lookup table.  A lookup table might take out some of the variability.  

 

THANKS AGAIN FOR YOUR HELP.

Labels