This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
Hello, all. I have seen some interest among the community about having an Alteryx tool that can read in pdf data, parse all text, and push this data downstream.
I went ahead and put together a simple Alteryx tool called "PDF to Text" that uses OCR (through a module called Tika) to do just that. Simply download the yxi and install.
This tool is a personal side project only and not an Alteryx product.
Please note that I have not done extensive testing on this tool and results may vary. This tool was originally built with a specific goal of parsing marketing materials for our international markets, and for this very specific job, it seems to be holding up well so far. Also, there are comments and a couple of extra python modules that I have left in for future reference / debug purposes, but aren't being used at the moment.
Once the tool has been installed, use the file browse and select "all files" and point it to your target pdf file. The tool will produce a single cell of data that contains all of the text that it was able to parse. I am looking into extending the tool to be able to break out text into smaller chunks, but I haven't really stumbled on a use case that makes sense (yet).
Happy PDF-parsing! Feedback, critiques, and ideas are welcome.
Great tool! It has helped me quite a bit with some ETL work.
Any plans for developing it a bit further? I'm currently working to parse a 10 page PDF, and it will only pull about 7 of the pages then just cut off the rest (it runs properly on both halves if the pdf is split)
I had to limit the byte size of the cell (since all the pdf data is getting sent to one cell). Let me take a look at it getting it to overflow to another cell in the event that the size limitation is reached. Great suggestions all.
@coderockride - I have not. But that sounds like a fun little project! The take-home message about these Python tools is that almost anything is possible. I'd say with just some small tweaks to the pdf tool a guy could make docx and pptx a reality. I am not even sure you'd need tika in this case because we aren't really dealing with images like we are with PDFs, so I am sure there may be even easier solutions out there. Let me know how you do!
Great share!! I'm excited about testing this option, but not exactly sure how to execute once it is installed? I started with the file browse tool, and selected arbitrary file types, specifically all files. Then I added an input tool which I pointed to my test.PDF. since that thought app is not recognized it throws an error. I'm sure I am missing something simple. Any help would be awesome! Thanks. Jacob
I get the following error message, anyone have advice on how to fix?
Error: Text to PDF (1): Traceback (most recent call last): File "PDF_to_textEngine.py", line 88, in pi_push_all_records File "PDF_to_textEngine.py", line 159, in get_data File "C:\Users\steve.hayden\AppData\Roaming\Alteryx\Engine\../Tools\PDF_to_text\Lib\site-packages\tika\parser.py", line 36, in from_file jsonOutput = parse1('all', filename, serverEndpoint, headers=headers) File "C:\Users\steve.hayden\AppData\Roaming\Alteryx\Engine\../Tools\PDF_to_text\Lib\site-packages\tika\tika.py", line 319, in parse1 headers, verbose, tikaServerJar, rawResponse=rawResponse) File "C:\Users\steve.hayden\AppData\Roaming\Alteryx\Engine\../Tools\PDF_to_text\Lib\site-packages\tika\tika.py", line 513, in callServer serverEndpoint = checkTikaServer(scheme, serverHost, port, tikaServerJar, classpath) File "C:\Users\steve.hayden\AppData\Roaming\Alteryx\Engine\../Tools\PDF_to_text\Lib\site-packages\tika\tika.py", line 571, in checkTikaServer raise RuntimeError("Unable to start Tika server.") RuntimeError: Unable to start Tika server.