Free Trial

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.

PDF Parse using PDF Import not capturing Table Structure

NeilFisk
9 - Comet

I have a lengthy PDF that has tables embedded in it. I have tried to use PDF to Text from the Intelligence Suite under Computer Vision but both creating an Alteryx Table and a Text Delimited version results in the same result that is not ideal.  Ideally, I would like the extraction to recognize the tables (i.e., recognize that there are lines and create fields accordingly.  

 

Any suggestions?

 

Thanks,

Neil

2 REPLIES 2
NeilFisk
9 - Comet

Thanks for your response.

 

I have used other tools like PDF2XL, Adobe Acrobat Pro, Kofax Power PDF as well as looked at Tabula, Camelot, and the likes on the Python side.  It defeats the point of using Alteryx Designer and Intelligence Suite in the first place.  Neither the COTS software nor Python provide a perfect solution, the former has limitations on extraction, the latter requiring a lot of coding.  The point of the PDF tool within the Intelligence Suite was to provide a no code solution that, unfortunately, falls short.  If Alteryx developers want to understand how they are falling short, they should reach out to users like me who see the shortcomings with real world data.

 

For what I'm doing, unfortunately, I will fall back to Camelot as a first pass (potentially building it into the Python tool in Alteryx Designer) followed by cleansing in Alteryx.

 

Regards,

Neil

KGT
12 - Quasar

Hi,

 

First up, please ignore the post by esther598, it's a spam post and the link in it is potentially dangerous.

 

I've had good results with Tables from PDFs using Intelligence Suite. I think I parsed it by Line in the PDF to Text tool and then parsed it further to get it into the right columns. I don't have IS anymore to test it, though when I look at the XML of that workflow I think I used Lines for the successful method.

 

OutputOptions>
              <OutputString>false</OutputString>
              <OutputLines>true</OutputLines>
              <OutputPipeDelimitedTable>false</OutputPipeDelimitedTable>
              <OutputAlteryxTable>false</OutputAlteryxTable>
            </OutputOptions>

 

Labels
Top Solution Authors