Bring your best ideas to the AI Use Case Contest! Enter to win 40 hours of expert engineering support and bring your vision to life using the powerful combination of Alteryx + AI. Learn more now, or go straight to the submission form.
Start Free Trial

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.

Help Needed: Formatting PDF to Text Extraction (Line Method) into Table Format

buddhiDB
7 - Meteor

Hi Alteryx Community,

I’m working on a workflow to extract a specific table from a batch of PDFs using the "PDF to Text" tool with the "Line" method.

 

In the attached Excel file:

  • The "Lines method" tab shows the raw extracted data from the PDFs.

  • The "Table snips" tab includes screenshots of the tables from the PDFs, provided for reference.

  • The "Expected Result" tab shows the desired output format.

I'm struggling to format the extracted data into a structured table. The main issue is that the relevant values are not always aligning correctly under the appropriate headers, likely due to inconsistencies in spacing or formatting in the original PDF files.

 

Could anyone guide me on how to transform the data in the "Lines method" tab into the desired format shown in the "Expected Result" tab? Any suggestions or example workflows would be greatly appreciated.

 

Thank you in advance for your support!

Best regards,
Buddhi

2 REPLIES 2
KGT
13 - Pulsar

I'm not sure why you would use the Lines method output instead of the table method. The table method has it all dropped out and you just need to re-align. As the table header is already tagged, you can just figure out which column is which and join it back on. I haven't validated the data, and I expect with a lot more, you may need to spend a little longer than 5 mins to build and test. I also wouldn't be surprised if there's 1-2 things you may need to write a rule to overcome. 

 

AlteryxGui_fnuwhX0l8e.png

 

 

Bonnie219Bailey
5 - Atom

Hello!

Since I can't directly access external files or view attachments like the Excel file you mentioned, I can't provide a precise, ready-to-use Alteryx workflow. However, I can offer strategies and Alteryx tools that are commonly used to tackle the challenge of extracting structured data from inconsistently formatted PDF text output, especially when using the "Line" method. LiteBlue

Your problem is a classic text parsing challenge where you need to normalize varying spacing and align data to headers. 

Labels
Top Solution Authors