community
cancel
Showing results for 
Search instead for 
Did you mean: 

Alteryx designer Discussions

Find answers, ask questions, and share expertise about Alteryx Designer.

Alteryx PDF to Text Tool (Beta)

Highlighted
Alteryx
Alteryx

Hello, all.  I have seen some interest among the community about having an Alteryx tool that can read in pdf data, parse all text, and push this data downstream.

I went ahead and put together a simple Alteryx tool called "PDF to Text" that uses OCR (through a module called Tika) to do just that. Simply download the yxi and install.

This tool is a personal side project only and not an Alteryx product.

 

Please note that I have not done extensive testing on this tool and results may vary.  This tool was originally built with a specific goal of parsing marketing materials for our international markets, and for this very specific job, it seems to be holding up well so far.  Also, there are comments and a couple of extra python modules that I have left in for future reference / debug purposes, but aren't being used at the moment.

 

Once the tool has been installed, use the file browse and select "all files" and point it to your target pdf file.  The tool will produce a single cell of data that contains all of the text that it was able to parse.  I am looking into extending the tool to be able to break out text into smaller chunks, but I haven't really stumbled on a use case that makes sense (yet).  

 

Happy PDF-parsing!  Feedback, critiques, and ideas are welcome.

Quasar

Excellent! Maybe change the default file type for the browse to be *.pdf

Alteryx
Alteryx

You read my mind!  I am still familiarizing myself the python sdk and that's pretty close to top of my list.  Thanks for the great suggestion.

Atom

Hey Jeremy,

 

 

Great tool!  It has helped me quite a bit with some ETL work.

 

Any plans for developing it a bit further?  I'm currently working to parse a 10 page PDF, and it will only pull about 7 of the pages then just cut off the rest (it runs properly on both halves if the pdf is split)

 

Barry

Alteryx
Alteryx

Hey, Barry.  

 

I had to limit the byte size of the cell (since all the pdf data is getting sent to one cell).  Let me take a look at it getting it to overflow to another cell in the event that the size limitation is reached.  Great suggestions all. 

Have you looked at using Alteryx and Tika to convert other file types to text/json? I'm really interested in converting docx and pptx in particular.

Alteryx
Alteryx

@coderockride - I have not.  But that sounds like a fun little project!  The take-home message about these Python tools is that almost anything is possible.  I'd say with just some small tweaks to the pdf tool a guy could make docx and pptx a reality. I am not even sure you'd need tika in this case because we aren't really dealing with images like we are with PDFs, so I am sure there may be even easier solutions out there.  Let me know how you do!

Atom

 

Hi guys, I'm trying to use the tool but, I reveiced an error message about "Traceback" related to miniconda3 app, someone here knows how to fix it in order to run the tool?

 

Capture.JPG

Atom
Jeremy,

Great share!! I'm excited about testing this option, but not exactly sure how to execute once it is installed? I started with the file browse tool, and selected arbitrary file types, specifically all files. Then I added an input tool which I pointed to my test.PDF. since that thought app is not recognized it throws an error. I'm sure I am missing something simple. Any help would be awesome! Thanks.
Jacob
Atom

I get the following error message, anyone have advice on how to fix?

 

Error: Text to PDF (1): Traceback (most recent call last):
File "PDF_to_textEngine.py", line 88, in pi_push_all_records
File "PDF_to_textEngine.py", line 159, in get_data
File "C:\Users\steve.hayden\AppData\Roaming\Alteryx\Engine\../Tools\PDF_to_text\Lib\site-packages\tika\parser.py", line 36, in from_file
jsonOutput = parse1('all', filename, serverEndpoint, headers=headers)
File "C:\Users\steve.hayden\AppData\Roaming\Alteryx\Engine\../Tools\PDF_to_text\Lib\site-packages\tika\tika.py", line 319, in parse1
headers, verbose, tikaServerJar, rawResponse=rawResponse)
File "C:\Users\steve.hayden\AppData\Roaming\Alteryx\Engine\../Tools\PDF_to_text\Lib\site-packages\tika\tika.py", line 513, in callServer
serverEndpoint = checkTikaServer(scheme, serverHost, port, tikaServerJar, classpath)
File "C:\Users\steve.hayden\AppData\Roaming\Alteryx\Engine\../Tools\PDF_to_text\Lib\site-packages\tika\tika.py", line 571, in checkTikaServer
raise RuntimeError("Unable to start Tika server.")
RuntimeError: Unable to start Tika server.

 

Labels