Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Parse entries from a single cell of imported text

hellyars
13 - Pulsar

All my records are trapped in a single blob of text.  This issue is an old nemesis.  This is just a different source/twist.  This time all the records are in one cell.  There are no standard delimiters.  See examples below.

 

The first thing I need to break the individual entries into their own cells.  From there I can probably sort out the rest.  But, how do I separate the entries?  

 

There is one patter that I can use.  Each new entry starts with a number d\{1,3} followed by a space or more (given the quality of the PDF import) and then a title (words) in ALL CAPS. See bold below....

 

INPUT

144 EXPLANATION OF PROJECT LEVEL ADJUSTMENTS [ in thousands of dollars ] Budget Commitiee Change from Pp Request Recommended Request 11 UH-60 BLACKHAWK ( MYP ) 9,999 9,999 9,999 Program increase - six HH-60 M for the National Guard 1,000 Excess fielding costs 702 14 CH-47 HELICOPTER 9,999 9,999 9,999 Program increase - F Block If 8,000 MH-47 G SLEP unit cost increase -344563 15 CH-47 HELICOPTER ( AP - CY ) 9,999 9,999 9,999 Program increase - F Biock II 29,000 19 GRAY EAGLE MODS2 9,999 9,999 9,999 Program increase - link 16 communications upgrade for MQ-1C 14,000 21 AH-64 MODS 9,999 9,999 9,999 Sensors cost growth 9,740 27 UTILITY HELICOPTER MODS 9,999 9,999 9,999 Program increase - enhanced ballistic protection system 5,000 Program increase - tail rotor drive system 5,000 35 AIRCRAFT SURVIVABILITY EQUIPMENT 9,999 9,999 9,999 Costs previously funded 55,000 40 COMMON GROUND EQUIPMENT 9,999 9,999 9,999 SCAMP if fielding costs previously funded 91,176 41 AIRCREW INTEGRATED SYSTEMS 9,999 9,999 9,999 Airframe kit instal excess growth -32,000 

 

 

DESIRED OUTPUT
144 EXPLANATION OF PROJECT LEVEL ADJUSTMENTS [ in thousands of dollars ] Budget Commitiee Change from Pp Request Recommended Request
11 UH-60 BLACKHAWK ( MYP ) 9,999 9,999 9,999 Program increase - six HH-60 M for the National Guard 1,000 Excess fielding costs 702
14 CH-47 HELICOPTER 9,999 9,999 9,999 Program increase - F Block If 8,000 MH-47 G SLEP unit cost increase -3963
15 CH-47 HELICOPTER ( AP - CY ) 9,999 9,999 9,999 Program increase - F Biock II 29,000 19 GRAY EAGLE MODS2 9,999 9,999 9,999 Program increase - link 16 communications upgrade for MQ-1C 14,000
21 AH-64 MODS 9,999 9,999 9,999 Sensors cost growth 9,740
27 UTILITY HELICOPTER MODS 9,999 9,999 9,999 Program increase - enhanced ballistic protection system 5,000 Program increase - tail rotor drive system 5,000
35 AIRCRAFT SURVIVABILITY EQUIPMENT 9,999 9,999 9,999 Costs previously funded 55,000
40 COMMON GROUND EQUIPMENT 9,999 9,999 9,999 SCAMP if fielding costs previously funded 91,176 41 AIRCREW INTEGRATED SYSTEMS 9,999 9,999 9,999 Airframe kit instal excess growth 

 

8 REPLIES 8
hellyars
13 - Pulsar

I should add....

 

Although the example does not include it, the source data could have a line that starts with a single digit -- hence the \d{1,3} example.  

 

 

Hello @hellyars 

 

I believe you can achieve the desired output by using Regex in Tokenize mode, split to rows.

 

I attached a sample workflow that was used in one of Regex training.

fmvizcaino
17 - Castor
17 - Castor

Hi @hellyars ,

 

I've looked for a while to your text and there is no pattern that could leave your black bold parts together.

Maybe if you have some other file where we can match product names then we can concatenate it back the lines needed.

 

Here is my partial solution.

BEst,

Fernando Vizcaino

hellyars
13 - Pulsar

@christine_assaad   This is what I want to do.  There is just one problem.  In your example, each entry you are trying to extract ends with PMC \d{7}.  The construct of each entry also seems to be consistent.  My data is not.  It does end with a \d{1,7}.  But, there can be 1-N \d{1,7} references in-between that seem to throw off regex -- put more accurately, I do not know how to address.  

hellyars
13 - Pulsar

@fmvizcaino   Cool, but you cheated.  You manually added a | at the breakpoints.  I am trying to avoid such a step.  But, you might be right. That might be the only way to get around the lack of a consistent pattern. 

hellyars
13 - Pulsar

@fmvizcaino   You might be on to something.  Is there a way to look for the pattern at the start of an entry and insert a | to serve as a delimiter.   You would be looking for a pattern of \s\d{1,3}\s[A-Z][A-Z0-9]....

fmvizcaino
17 - Castor
17 - Castor

Hi @hellyars ,

 

I've used regex to add the | before each step. Look at the 2nd formula from the 1st formula tool. The 1st formula is useless, not sure what that is doing there. 

 

Best,

Fernando Vizcaino

hellyars
13 - Pulsar

@fmvizcaino   My bad.  I thought I saw a | in the input text tool. It was a [.   Your RegexReplace expression works perfectly.  THANKS!  

Labels