Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

How to split into columns garbled text from PDF

hellyars
13 - Pulsar

Okay.  I have a messy PDF table.  It does not want to place nice.   The following example -- see below -- comes across as 1 cell.  

 

I tried using the delimiter \n.   That did not work.  

 

The key is the # at the start of a line.   This is the LINE NUMBER.   The LINE NUMBER is followed by the PROGRAM TITLE.  The line or lines between LINE NUMBER / PROGRAM TITLE rows are COMMENTS.  There may be one or more COMMENT rows associated with each LINE NUMBER / PROGRAM TITLE.  How can split this jumbled mess into these fields (see desired output below)?

 

I also want to get rid of all '.' (periods).   This will leave some OCR errors, but I am okay with that (for now).

 

EXAMPLE of STARTING MESS

2     Utility FNI Aircraft  .......................................................................

Insufficient  budget  justification: Lack  of  supporting  jus- tification  ............................................................................

4    R0-11 (RAVEN)  ......................   ..... ......................  ...... ...... ................  .

Improving funds management: Prior year carl)'OV1!r  ...... ...... .

5    Tactical Unmanned Aircraft System (TUAS) .................  .... ..... .........

Insufficient  budget   justification:  Poor  justification  mate- rials  ...................................................................................

7    Helicopter, Light Utility (LUH(  ............ ....................... .................... .

Program increase, Three aircraft  .........................................

Program increase, Expandable rotorcraft diagnostics  ..........

12    UH-60 Blackhawk M Model (MVP) ................ ................ ...... .......... .

Restoring acquisition  accountability, Unit cost growth ........

16    CH-47 Helicopter-AP  ..................................................................

Program increase ................... ............................................... .

20    Gray Eagle Mods2  ............................  ..............................................   .

Program increase ............ .......................................................

26    EMARSS SEMA Mods (MIP) ............ ................................. ................

Program increase: Performance enhancements  ....................

28                   0

Utility  ;: : :  t:  :  e  s:   ,·uii' ,2·,ii..;i ·;  i;i  ; i.. d··j;; ·: I

uct improvements  ............................... ..............................

 

This is my desired output.  Note I left the OCR errors.

 

LINE NUMBERPROGRAM TITLECOMMENT
2Utility FNI AircraftInsufficient budget justification: Lack of supporting justification
4R0-11 (RAVEN)Improving funds management: Prior year carl)'OV1!r
2 REPLIES 2
hellyars
13 - Pulsar

The number (LINE NUMBER) can be 1 to 3 digits.

hellyars
13 - Pulsar

Okay.   I had to change the field size to get the initial split to columns using the delimiter \n to work.  The data cleansing tools takes care of the punctuation.   From there, it was just a series of multi-row and REGEX parse steps.     

Labels