Hello,
within the Alteryx Knowledge section I read about: ' Can Alteryx Parse A Word Doc Or PDF?' which was very helpful to get started. Now I am facing some more challenging topics:
I have hundreds of contractual relevant documents like Acceptance Sheets and Change Requests where I need to pick relevant commercial data on a regular basis. These documents are stored on our SharePoint in PDF format. Unfortunately the PDF files are protected. So what is working on a manual basis in my test environment:
- print the protected PDF with a PDF printer into a non protected version (in some cases they need to be unlocked with another tool beforehand)
- save the non protected PDF into plain text format
- run the Alteryx workflow to collect the relevant data
My qustions are:
- Does anyone see any chances to automate the entire workflow with alteryx?
- The DOCTOTEXT tool mentioned in the Knowlege section is not working for me from the start. Are there any tools known that I could use for command line usage to automate the entire workflow?
- Any other alteratives to solve the situation? Manual transformation is not an option for us as this is an onging requirement
Any input is highly appreciated.
Solved! Go to Solution.
would the results of "crack protected pdf" as a google search help you?
Have you tried using the Sharepoint List Input tool to pull your .pdf files? Since it requires username/password, does that provided enough information for the protected files that would be able to get non-protected .pdf files?
Unfortunately not, as I already can access the pdfs manually. I am looking for ways of automation.
The Sharepoint List Input tool does not really help. We are working on a solution to replace the protected pdfs in the future. But as these are contractual documents, we cannot change them right away. I was hoping the processing of pdfs would be more often used by others and therefore hoped for some best practices.
Currently Alteryx doesn't have a way to automate the opening of secured .pdf files. You may want to suggest that as a product enhancement, however (https://community.alteryx.com/t5/Ideas/ct-p/ideas). There may be resources out there that will help you do this (such as: https://community.alteryx.com/t5/Ideas/ct-p/ideas).
If you have (or can get) Adobe Acrobat Pro, that might be a better starting point. You can automate operations like the conversion within that application.
You could also combine a large batch of them at once, then convert that file to a spreadsheet or text file and use Alteryx from there.
I have been using WordCleaner (https://wordcleaner.com/) for converting Word Documents and PDFs into HTML. It has a text output option as well. My scenario the HTML worked better as I needed <table>, <td> and <tr> tags as this improves the processing of the files. Also wordcleaner removes all extra styles from the tags. Very helpful. Best I have been able to find for doing this. Note it does require a purchase $99 or $199 (for command line and other options).