PDF to Text

I've got a use case which requires the PDF to Text function and OCR capabilities. Thing is, the file is not standardized due to human writing being involved, which means cursive and unintelligible handwriting sometimes over the printed parts of the file.

End Goal is to parse out certain information from the file - I've done a few and got some results, but I'd say it's about 10% of the full stack...

How would one handle such a use case? Are there any examples out there from Maveryx community?

P.S: Sorry I cannot share the PDFs, they contain sensitive PII information that I cannot disclose. Looking for advice + guidance from the community!

Use Case Support

Accepted answers

acarter881

Hello, @caltang.

It depends on how many you have to do, how standardized the PDFs are, etc. I don't have much experience with the Intelligence Suite; however, your use case sounds too complex for a standard setup within Designer.

I suggest trying Google's Document AI: https://cloud.google.com/document-ai. You can upload some documents and test how well it's performing. There are other solutions, even others from Google, such as Cloud Vision: https://cloud.google.com/vision. If I were to try this in a programming language, I'd go for Python. It will likely involve a lot of setup, iterating, and research.

caltang

I’ll check out Document AI! Unfortunately, i don’t have an R&D team nor do I think the PDF To Text tool is advanced enough at this stage to do that.. guess I’ll have to look out of Alteryx as an alternative.

thanks @acarter881 !

All comments

acarter881

Hello, @caltang.

caltang

thanks @acarter881 !

acarter881

You're welcome, @caltang.

Some of the other large tech companies, such as Amazon (https://aws.amazon.com/textract/), have their equivalent services. I've found Document AI to be pretty impressive. Good luck! This is a fascinating topic. AI seems like the solution.

caltang

It seems to be a paid service... I'll have a look and see. Thanks @acarter881 !

gjjadhao

@caltang Utilization of Python Scripts for extracting Text from PDF can be useful, libraries/Modules like Pdfminer, tabula, camelot etc can be used for this purpose.

roughchr

@gjjadhao thanks for the tip - are you able to share any more specifics e.g. sample code for extracting text using these libraries/Modules like Pdfminer, tabula, camelot?

Quick Links

This months top contributors

AkimasaKajitani 387

mceleavey 383

mbarone 337

Hollingsworth 335

LanisC 335