Bring your best ideas to the AI Use Case Contest! Enter to win 40 hours of expert engineering support and bring your vision to life using the powerful combination of Alteryx + AI. Learn more now, or go straight to the submission form.
Start Free Trial

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.

Extracting Data from PDF

SH_94
11 - Bolide

Dear community,

 

I would like to ask how we can extract the data from mutipage - PDF (number and text) in Alteryx. I understand that we can use PDF tool - which we need to download and subscribe it for license.

 

May i knows if there is any way to extract the data without purchasing the license for PDF tool related - intelligence suite?

 

Thank you.

4 REPLIES 4
KGT
13 - Pulsar

There are various free solutions around the place. Take a look for PDF extraction on the community or in the gallery. As with any pdf parsing, there is a certain amount of customisation you need to do. Any free solution will have limitations and require customisation.

 

PDF is by it's very nature, not the same structure underneath. That's the whole purpose of PDF's, it's a universal presentation layer for data of any format. This means that pdf extraction on 2 docs that look the same, may end up different if using different algorithms, and it's also why a lot of pdf extraction converts it to an image and then extracts from that.

caltang
17 - Castor
17 - Castor

@KGT is right. You can find an example here: https://community.alteryx.com/t5/Community-Gallery/PDF-Reader-Tool/ta-p/937908

 

Though I would suggest for you to tinker and code something together with Alteryx. You can use LLMs to guide you, but stick to standard packages that are maintained.

Calvin Tang
Alteryx ACE
https://www.linkedin.com/in/calvintangkw/
pilsworth-bulien-com
13 - Pulsar

Hello @SH_94 

Although this isn't quite as powerful as the pdf to text tool, especially when it comes to template extraction, one method I've used before involves converting the PDFs to .docx files and then reading the .docx file to Alteryx. 

The conversion from pdf to .docx can be achieved with a PowerShell script (.ps1) file, which you can then run within the run command tool. 

Once your documents are converted, you can read the .docx file in as a .zip file to extract all the text. I have detailed these steps in another post here:

https://community.alteryx.com/t5/Alteryx-Designer-Desktop-Discussions/WORD-DOCUMENT-TEMPLATE-INPUT/m...


If you have any questions, please let me know. 

 

Regards - Pilsner

SH_94
11 - Bolide

Hi @caltang ,

 

Would like to seek your help to further elaborate on the following sentence and what i need to do in order to achieve this?

 

"Though I would suggest for you to tinker and code something together with Alteryx."

 

 

 

Thank you.

Labels
Top Solution Authors