All my records are trapped in a single blob of text. This issue is an old nemesis. This is just a different source/twist. This time all the records are in one cell. There are no standard delimiters. See examples below.
The first thing I need to break the individual entries into their own cells. From there I can probably sort out the rest. But, how do I separate the entries?
There is one patter that I can use. Each new entry starts with a number d\{1,3} followed by a space or more (given the quality of the PDF import) and then a title (words) in ALL CAPS. See bold below....
INPUT
144 EXPLANATION OF PROJECT LEVEL ADJUSTMENTS [ in thousands of dollars ] Budget Commitiee Change from Pp Request Recommended Request 11 UH-60 BLACKHAWK ( MYP ) 9,999 9,999 9,999 Program increase - six HH-60 M for the National Guard 1,000 Excess fielding costs 702 14 CH-47 HELICOPTER 9,999 9,999 9,999 Program increase - F Block If 8,000 MH-47 G SLEP unit cost increase -344563 15 CH-47 HELICOPTER ( AP - CY ) 9,999 9,999 9,999 Program increase - F Biock II 29,000 19 GRAY EAGLE MODS2 9,999 9,999 9,999 Program increase - link 16 communications upgrade for MQ-1C 14,000 21 AH-64 MODS 9,999 9,999 9,999 Sensors cost growth 9,740 27 UTILITY HELICOPTER MODS 9,999 9,999 9,999 Program increase - enhanced ballistic protection system 5,000 Program increase - tail rotor drive system 5,000 35 AIRCRAFT SURVIVABILITY EQUIPMENT 9,999 9,999 9,999 Costs previously funded 55,000 40 COMMON GROUND EQUIPMENT 9,999 9,999 9,999 SCAMP if fielding costs previously funded 91,176 41 AIRCREW INTEGRATED SYSTEMS 9,999 9,999 9,999 Airframe kit instal excess growth -32,000 |
DESIRED OUTPUT |
144 EXPLANATION OF PROJECT LEVEL ADJUSTMENTS [ in thousands of dollars ] Budget Commitiee Change from Pp Request Recommended Request |
11 UH-60 BLACKHAWK ( MYP ) 9,999 9,999 9,999 Program increase - six HH-60 M for the National Guard 1,000 Excess fielding costs 702 |
14 CH-47 HELICOPTER 9,999 9,999 9,999 Program increase - F Block If 8,000 MH-47 G SLEP unit cost increase -3963 |
15 CH-47 HELICOPTER ( AP - CY ) 9,999 9,999 9,999 Program increase - F Biock II 29,000 19 GRAY EAGLE MODS2 9,999 9,999 9,999 Program increase - link 16 communications upgrade for MQ-1C 14,000 |
21 AH-64 MODS 9,999 9,999 9,999 Sensors cost growth 9,740 |
27 UTILITY HELICOPTER MODS 9,999 9,999 9,999 Program increase - enhanced ballistic protection system 5,000 Program increase - tail rotor drive system 5,000 |
35 AIRCRAFT SURVIVABILITY EQUIPMENT 9,999 9,999 9,999 Costs previously funded 55,000 |
40 COMMON GROUND EQUIPMENT 9,999 9,999 9,999 SCAMP if fielding costs previously funded 91,176 41 AIRCREW INTEGRATED SYSTEMS 9,999 9,999 9,999 Airframe kit instal excess growth |
Solved! Go to Solution.
I should add....
Although the example does not include it, the source data could have a line that starts with a single digit -- hence the \d{1,3} example.
Hello @hellyars
I believe you can achieve the desired output by using Regex in Tokenize mode, split to rows.
I attached a sample workflow that was used in one of Regex training.
Hi @hellyars ,
I've looked for a while to your text and there is no pattern that could leave your black bold parts together.
Maybe if you have some other file where we can match product names then we can concatenate it back the lines needed.
Here is my partial solution.
BEst,
Fernando Vizcaino
@christine_assaad This is what I want to do. There is just one problem. In your example, each entry you are trying to extract ends with PMC \d{7}. The construct of each entry also seems to be consistent. My data is not. It does end with a \d{1,7}. But, there can be 1-N \d{1,7} references in-between that seem to throw off regex -- put more accurately, I do not know how to address.
@fmvizcaino Cool, but you cheated. You manually added a | at the breakpoints. I am trying to avoid such a step. But, you might be right. That might be the only way to get around the lack of a consistent pattern.
@fmvizcaino You might be on to something. Is there a way to look for the pattern at the start of an entry and insert a | to serve as a delimiter. You would be looking for a pattern of \s\d{1,3}\s[A-Z][A-Z0-9]....
Hi @hellyars ,
I've used regex to add the | before each step. Look at the 2nd formula from the 1st formula tool. The 1st formula is useless, not sure what that is doing there.
Best,
Fernando Vizcaino
@fmvizcaino My bad. I thought I saw a | in the input text tool. It was a [. Your RegexReplace expression works perfectly. THANKS!