Reference the sample body of text below. I need to extract the contract number from it. The pattern to the contract number is ([A-Z]{1}.{5}-[0-9]{2}-[A-Z]{1}-.{4}), with hyphens separating the different sections. Sometimes the contract number lacks the hyphens. In the sample below, the contract number N0018919D0004 lacks the hyphens.
How can I extract the number and add the hyphens or add the hyphens and extract the number? I need to keep the body of text. There can be more than one contract number -- they will all have or not have hyphens, there is never a mix.
If the hyphens exist, I use the RegEx tool set using the expression ([A-Z]{1}.{5}-[0-9]{2}-[A-Z]{1}-.{4}) to tokenize and split to rows.
Thanks
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Tempus quam pellentesque nec nam. Sapien pellentesque habitant morbi tristique senectus et netus et. Convallis posuere morbi leo urna molestie. Enim sed N0018919D0004 turpis in eu mi bibendum neque egestas. Eu mi bibendum neque egestas congue. Mi ipsum faucibus vitae aliquet nec ullamcorper sit. Id ornare arcu odio ut. Fermentum dui faucibus in ornare quam viverra. Integer malesuada nunc vel risus commodo viverra maecenas accumsan. Consectetur lorem donec massa sapien faucibus et molestie ac. Et magnis dis parturient montes nascetur ridiculus mus mauris.
Solved! Go to Solution.
Hi @hellyars
You could add a question mark after each hyphen, which basically means 0 or 1 hyphen: [A-Z]{1}.{5}-[0-9]{2}-[A-Z]{1}-.{4}
Hope this helps.
Jasper
Hi @hellyars
Does this work well for you?
First I add hyphens to your contract number, then extract them into Rows. Workflow appended.
Cheers,
Thanks. Two things.
1. This works.
2. BUT, it reveals a flaw in my expression. My expression is now picking up spaces, so its picking up word number combinations that meet the same criteria. How do I change my expression to ignore spaces? Is there a way to specify that the .{5} must be either a number or a letter but not a space?
@Thableaus Thanks. Same two things.
1. It works.
2. But I am running into the same problem. My expression is catching other text, space and number combinations that match the pattern. How do I mod the expression to only inlaced alpha numeric characters and ignore spaces? I am still new to RegEx expressions.
Thanks again.
Hi @hellyars
Instead of (.), use \w
REGEX_Replace([Field1], "([A-Z]{1}\w{5})([0-9]{2})([A-Z]{1})(\w{4})", "$1-$2-$3-$4")
\w brings only word characters (letters, digits and also the underscore) and excludes punctuation.
Cheers,
I'm running into some problems translating it to the real world. I can't get it to catch everything. And, I can't sort out why.
Here is some real world data. It includes contract numbers with and without hyphens. It also includes examples with more than one contract Id per entry.
Thanks.
Hi @hellyars
For some weird reason, there is a hyphen difference between some records.
That means - some records have a kind of hyphen, others have another.
Example:
GE Medical Systems Information Technologies Inc., Wauwatosa, Wisconsin, has been awarded a maximum $450,000,000 firm‐fixed‐price, indefinite‐delivery/indefinite‐quantity contract for patient monitoring systems, accessories and training. This was a competitive acquisition with 36 responses received. This is a five-year base contract with one five‐year option period. Location of performance is Wisconsin, with a Jan. 10, 2024, performance completion date. Using customers are Army, Navy, Air Force, Marine Corps, and federal civilian agencies. Type of appropriation is fiscal 2019 through 2024 defense working capital funds. The contracting activity is the Defense Logistics Agency Troop Support, Philadelphia, Pennsylvania (SPE2D1‐19‐D‐0010).
SPE2D1‐19‐D‐0010 has a CHAR(8208), a UNICODE hyphen.
On the other hand
Raytheon Co. (Raytheon) Space and Airborne Systems (SAS), San Diego, California, is being awarded a single award with a contract ceiling of $9,607,811 for an indefinite-delivery/indefinite-quantity contract for Multi-Spectral Targeting System (MTS) sensor support. Raytheon will provide subject matter expertise as a member of a government-led sensor development and demonstration team and will provide research, development, fielding and test support, operations, maintenance, and as-needed repairs on the government-owned MTS-class sensors. Raytheon is the sole designer, developer, and manufacturer of the MTS-class sensor. Work will be performed at El Segundo and San Diego, California. The ordering period and the period of performance is five years from the date of award. The first task order will be awarded at the same time the basic contract is awarded. Fiscal 2018 research, development, test and evaluation funds in the amount of $1,299,520 for the first task order is being obligated at time of award. The award to Raytheon SAS is the result of a proposal submitted in response to a sole-source solicitation (HQ0147-18-R-0013) one offer was received. The Missile Defense Agency, Redstone Arsenal, Alabama, is the contracting activity. (HQ0147-19-D-0013).
HQ0147-19-D-0013 has a CHAR(45) hyphen, an ASCII hyphen
This expression solves it for all - Tokenize - ([A-Z]{1}\w{5}[[:punct:]]?[0-9]{2}[[:punct:]]?[A-Z]{1}[[:punct:]]?\w{4})(?:.)
That's because [[:punct:]] refers to any kind of punctuation (ASCII and UNICODE).
This was a very technical and specific case, I'm sorry you were introduced to RegEX in such a complicated situation.
Workflow with final solution - and also the CHAR difference between hyphens is appended to you.
Cheers,
That's some impressive root cause analysis. Thank you. Everything is working.