We are celebrating the 10-year anniversary of the Alteryx Community! Learn more and join in on the fun here.
Start Free Trial

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Need Help Formatting PDF Extracted Data into Expected Table Format in Alteryx

buddhiDB
7 - Meteor

Hi Alteryx Experts,

 

I've been working on extracting data from a few PDFs into Alteryx, but the extracted data is misaligned across multiple columns. I’ve tried multiple methods, including Transpose, Cross Tab, Multi-Row Formula, and Text to Columns, but I haven’t been able to get the expected result.

 

Challenges I am Facing:

  • Some headers are misplaced, like  "1st Inst" sometimes appearing in Column 4 instead of its correct column.
  • Some values are shifting across multiple columns after extraction.
  • Some values are misaligned due to empty spaces in the PDF structure.
  • Some rows (like "pooling") are entirely empty but need to stay in the correct position.

What I Have Tried So Far:

  1. Filter Tool – Removed unnecessary  "Table Header" rows.
  2. Multi-Row Formula Tool – Tried shifting column values where they are misplaced, but couldn't fully align them.
  3. Transpose & Cross Tab Tools – Attempted reshaping the data, but misaligned numbers remained.
  4. Text to Columns Tool – Tried splitting the data correctly but faced inconsistent column placements.

Request for Help

Can anyone guide me on the best approach to correctly format this extracted data in Alteryx? Would really appreciate any suggestions, workflows, or logic to apply!

Thanks in advance! 

3 REPLIES 3
Gumsmenezes
10 - Fireball

Hi @buddhiDB!

 

WIth the way that it's structured right now I find it hard that you'll be able to automate it. Mostly because there's no standard position for the "1st Inst." figures.

 

For "Aaron John Kedzlie - 2023 - IR3", the values for 2024 provisional tax, 2024 tax pooling and amounts due are not under "1st Inst." They are under column 4, which has no header.

 

If you told me that it's always the case for the first file name, for example, there could be a way. But this happens again with "Mt Roskill Cash 'N Carry Ltd - 2023 - IR4" after skipping "Joanna Maree Kedzlie - 2023 - IR3" and "Kedzlie Home Trust - 2023 - IR6".

 

I would try a different method of extracting the data from the PDF, if possible.

OllieClarke
15 - Aurora
15 - Aurora

Hey @buddhiDB 

Here's an approach which I've tried to make as flexible as possible
image.png

However, with instances like these, it's often not worth trying to be flexible. If these files only need to be read in once, then it's often quicker to just work out the alignment and use a lookup to align the headers.

The approach I took here, is to find the values in a column with no headers, and then shift their column +/- 1. Then find which shift ended up with the most values aligned, and use that shift to update the column. 
It's not perfect, as if a value is sat between a column which should be null and a column which should have the value, then there's no way of knowing which column it should be in. At least without further business context.

image.png

Anyway, hope that helps,

Ollie

Gumsmenezes
10 - Fireball

I've also explored a bit and this approach should always work, as long as the misalignments are consistent.

 

Screenshot 2025-02-25 145004.png

 

 

Labels
Top Solution Authors