community
cancel
Showing results for 
Search instead for 
Did you mean: 

Alteryx designer Discussions

Find answers, ask questions, and share expertise about Alteryx Designer.

Alteryx PDF to Text Tool (Beta)

Highlighted
Alteryx
Alteryx

Hello, all.  I have seen some interest among the community about having an Alteryx tool that can read in pdf data, parse all text, and push this data downstream.

I went ahead and put together a simple Alteryx tool called "PDF to Text" that uses OCR (through a module called Tika) to do just that. Simply download the yxi and install.

This tool is a personal side project only and not an Alteryx product.

 

Please note that I have not done extensive testing on this tool and results may vary.  This tool was originally built with a specific goal of parsing marketing materials for our international markets, and for this very specific job, it seems to be holding up well so far.  Also, there are comments and a couple of extra python modules that I have left in for future reference / debug purposes, but aren't being used at the moment.

 

Once the tool has been installed, use the file browse and select "all files" and point it to your target pdf file.  The tool will produce a single cell of data that contains all of the text that it was able to parse.  I am looking into extending the tool to be able to break out text into smaller chunks, but I haven't really stumbled on a use case that makes sense (yet).  

 

Happy PDF-parsing!  Feedback, critiques, and ideas are welcome.

Quasar

Excellent! Maybe change the default file type for the browse to be *.pdf

Alteryx
Alteryx

You read my mind!  I am still familiarizing myself the python sdk and that's pretty close to top of my list.  Thanks for the great suggestion.

Atom

Hey Jeremy,

 

 

Great tool!  It has helped me quite a bit with some ETL work.

 

Any plans for developing it a bit further?  I'm currently working to parse a 10 page PDF, and it will only pull about 7 of the pages then just cut off the rest (it runs properly on both halves if the pdf is split)

 

Barry

Alteryx
Alteryx

Hey, Barry.  

 

I had to limit the byte size of the cell (since all the pdf data is getting sent to one cell).  Let me take a look at it getting it to overflow to another cell in the event that the size limitation is reached.  Great suggestions all. 

Have you looked at using Alteryx and Tika to convert other file types to text/json? I'm really interested in converting docx and pptx in particular.

Alteryx
Alteryx

@coderockride - I have not.  But that sounds like a fun little project!  The take-home message about these Python tools is that almost anything is possible.  I'd say with just some small tweaks to the pdf tool a guy could make docx and pptx a reality. I am not even sure you'd need tika in this case because we aren't really dealing with images like we are with PDFs, so I am sure there may be even easier solutions out there.  Let me know how you do!

Labels