Hi Alteryx Community,
I’m working on a workflow to extract a specific table from a batch of PDFs using the "PDF to Text" tool with the "Line" method.
In the attached Excel file:
The "Lines method" tab shows the raw extracted data from the PDFs.
The "Table snips" tab includes screenshots of the tables from the PDFs, provided for reference.
The "Expected Result" tab shows the desired output format.
I'm struggling to format the extracted data into a structured table. The main issue is that the relevant values are not always aligning correctly under the appropriate headers, likely due to inconsistencies in spacing or formatting in the original PDF files.
Could anyone guide me on how to transform the data in the "Lines method" tab into the desired format shown in the "Expected Result" tab? Any suggestions or example workflows would be greatly appreciated.
Thank you in advance for your support!
Best regards,
Buddhi
I'm not sure why you would use the Lines method output instead of the table method. The table method has it all dropped out and you just need to re-align. As the table header is already tagged, you can just figure out which column is which and join it back on. I haven't validated the data, and I expect with a lot more, you may need to spend a little longer than 5 mins to build and test. I also wouldn't be surprised if there's 1-2 things you may need to write a rule to overcome.
Hello!
Since I can't directly access external files or view attachments like the Excel file you mentioned, I can't provide a precise, ready-to-use Alteryx workflow. However, I can offer strategies and Alteryx tools that are commonly used to tackle the challenge of extracting structured data from inconsistently formatted PDF text output, especially when using the "Line" method. LiteBlue
Your problem is a classic text parsing challenge where you need to normalize varying spacing and align data to headers.
User | Count |
---|---|
105 | |
82 | |
70 | |
54 | |
40 |