OCR tool on Python Workflow automated on Alteryx
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hi, I installed a module called pdfplumber for a OCR tool that I am working on. I understand that there are OCR tools available on Alteryx but I am trying to formulate this and propose it to my company without having to pay for the additional costs.
I am trying to modify the python code since the incoming and outgoing connection will be different using Alteryx, but I have issues with my code.
I have defined the code for the input data as 'df' but i can't seem to use the 'df' on the OCR code itself.
Thank you for all your help and I appreciate any feedback.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
I would skip the Input pdf (written in R) and use the Python tool to grab the pdf with the .open() statement :
After doing so you can convert it to data frame to use downstream.

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hi Paul, thank you for the reply.
In your code, the file you open was based on your own file, but how do I do it if I am trying to use the input data as per my picture below.
Defining the input data from Alteryx as 'df' did not seem to work.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
I would avoid trying to give a data frame to pdf plumber it wouldn't be able to open it.
You can define the variable to point to the pdf directly and then once you get the data you can convert it into data frames.
Here is my example (depending on the your pdf content this may not work):

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hi Paul
Ok noted on the input, I understand that the outgoing connection requires only pandas dataframes to pass through.
For my PDF document, there are no columns to parse from the file, so it doesn't work. Is there another recommended way?
I have attached the workflow for your reference, you have no idea how much you're helping me with.. thank you.
