PDF to Text
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
I've got a use case which requires the PDF to Text function and OCR capabilities. Thing is, the file is not standardized due to human writing being involved, which means cursive and unintelligible handwriting sometimes over the printed parts of the file.
End Goal is to parse out certain information from the file - I've done a few and got some results, but I'd say it's about 10% of the full stack...
How would one handle such a use case? Are there any examples out there from Maveryx community?
P.S: Sorry I cannot share the PDFs, they contain sensitive PII information that I cannot disclose. Looking for advice + guidance from the community!
Alteryx ACE
https://www.linkedin.com/in/calvintangkw/
Solved! Go to Solution.
- Labels:
- Use Case Support
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hello, @caltang.
It depends on how many you have to do, how standardized the PDFs are, etc. I don't have much experience with the Intelligence Suite; however, your use case sounds too complex for a standard setup within Designer.
I suggest trying Google's Document AI: https://cloud.google.com/document-ai. You can upload some documents and test how well it's performing. There are other solutions, even others from Google, such as Cloud Vision: https://cloud.google.com/vision. If I were to try this in a programming language, I'd go for Python. It will likely involve a lot of setup, iterating, and research.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
I’ll check out Document AI! Unfortunately, i don’t have an R&D team nor do I think the PDF To Text tool is advanced enough at this stage to do that.. guess I’ll have to look out of Alteryx as an alternative.
thanks @acarter881 !
Alteryx ACE
https://www.linkedin.com/in/calvintangkw/
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
You're welcome, @caltang.
Some of the other large tech companies, such as Amazon (https://aws.amazon.com/textract/), have their equivalent services. I've found Document AI to be pretty impressive. Good luck! This is a fascinating topic. AI seems like the solution. :)
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
It seems to be a paid service... I'll have a look and see. Thanks @acarter881 !
Alteryx ACE
https://www.linkedin.com/in/calvintangkw/
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
@caltang Utilization of Python Scripts for extracting Text from PDF can be useful, libraries/Modules like Pdfminer, tabula, camelot etc can be used for this purpose.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
@gjjadhao thanks for the tip - are you able to share any more specifics e.g. sample code for extracting text using these libraries/Modules like Pdfminer, tabula, camelot?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
hi