Free Trial

Dev Space

Customize and extend the power of Alteryx with SDKs, APIs, custom tools, and more.
SOLVED

Input PDFs (they are images)

clant
8 - Asteroid

Hello all!

 

I am a bit stuck, I originally posted this in the designer forums but did not get many responses. 

 

The problem I have is: We currently receive pdf documents which have been faxed to us. These are hand written order forms. We need to take these forms and run some form of handwriting OCR on them to help our data input guys out.

 

My idea is to write a python tool which can convert the pdf to a jpg. We can then take this jpg and upload it to either azure or google ocr then get the results back. We have tested the azure ocr and it worked great.

 

So far I have written the python which does the pdf to jpg and am trying and failing to make this into a tool in alteryx (At the moment I am getting a "typeError:__init__ () takes 2 positional arguments but 4 were give" but i think i can work this out).

 

Can someone please advise if what I am doing will actually work or if there is a better way to do this?

 

Thank you!

 

Cheers

 

Chris

29 REPLIES 29
tlarsen7572
11 - Bolide
11 - Bolide

This is a really cool use case!

 

I think what you want to do is certainly very possible, it's just a matter of the implementation details.  At the moment I see 2 possible ways to get the data into the custom tool:

1. Import the PDFs as BLOBs using the Blob Input developer tool.  The Python tool can then use the raw byte data to convert to jpg and send to the appropriate cloud API

2. Send the Python tool a list of filenames.  For each file name, the tool can import the PDF, convert to JPG, and send it to the API

 

Which method you use really depends on the nature of the API and the PDF-to-JPG package you chose.  Most likely, 2 will be the easiest option.

 

What kind of data is returned from the Azure/Google API?

 

I would encourage you to give it a go.  If you run into any issues or errors, you can post them in this forum.  There is a bit of a trick to munge your Python code into a custom tool, but there are lots of smart people here to help.

clant
8 - Asteroid

Hi @tlarsen7572 

 

So I have managed to get my python working but with a error. I have attached the yxi file, this is ugly, i made this following the beginngers guide to python tools. This is currently giving me the error below.

 

Error: Python Convert PDF to Jpeg (7): Traceback (most recent call last):
File "PDFtoImage_Engine.py", line 171, in ii_push_record
Boost.Python.ArgumentError: Python argument types in
OutputAnchor.push_record(OutputAnchor, str)
did not match C++ signature:
push_record(class SRC::Python::OutputAnchor {lvalue}, class boost::shared_ptr<struct SRC::Python::ConstRecordRef> record, bool no_auto_close=False)

 

If anyone could help with how to fix this i would appreciate it! The way this tool works is you can drop a text input tool in front of it with the file location of a pdf. This will then create in the same location a jpg.

 

With regards to the data from azure this comes back as json which should be ok to parse out.

 

Cheers

 

Chris

tlarsen7572
11 - Bolide
11 - Bolide

Very close, see attached for a corrected python file (in a txt extension, because Alteryx's forum doesn't like me attached python scripts...).  All of my changes have comments preceding them, so I hope they are easy to find.  In summary, the changes I made were:

 

  1. When you add the summary field to the record info in ii_init, save the field instance to a property.  We will use that instance later to populate the string value.
  2. In ii_push_record, you are assigning the out_record to a string.  This is invalid; out_record needs to be a recordref object.
  3. In your if statement in ii_push_record, populate the summary field with the set_from_string function
  4. Just after the if statement in ii_push_record, create the out_record recordref object by calling finalize_record on your record creator.

 

Now everything should work....at least it does on my system.

clant
8 - Asteroid

Thank you so much @tlarsen7572 I have attached my tool which successfully converts pdfs to jpegs. I have tested with the directory tool and it worked great!

 

Thank you again for your help on this! Here is the tool!

 

Cheers

 

Chris

clant
8 - Asteroid

Hi @tlarsen7572 

 

Just a update for you as i know you were interested. I have successfully completed this now! I am using the read api on azure, this makes it a little more complicated because you have to post, then use a get to get the results as it has some processing time. This has worked amazingly well.

 

Cheers

 

Chris

tlarsen7572
11 - Bolide
11 - Bolide

Thanks for sharing the update, @clant!  This is a really cool use-case for Alteryx, I am glad you were able to get it working.  I am bookmarking this thread in case I ever have to do any kind of OCR extraction.  As a department (Internal Audit), we sometimes have to deal with PDF/paper docs; I could see this being very helpful around things like contract analysis.  Definitely next-level stuff.

Awesomeville
7 - Meteor

Hi @clant, currently I am trying to do exactly what you did, but after reading this thread and downloading your zip file, i still dont really get how the whole thing works on Alteryx. Do you mind sharing your workflow or elaborate further of how this (python script) is incorporated into Alteryx before utilizing read api on azure?

 

Thank you in advance for your help. Appreciate it.

 

Cheers,

Nick

Awesomeville
7 - Meteor

@tlarsen7572 Hello, since I couldnt reach @clant to seek for advice on how does his tool actually works, could I trouble you instead since I noticed you are actually one of the alteryx dev founder and also took part in the python tool contest to say the least. I was hoping you could help me out here.

 

Currently I am trying to utilize @clant's tool to retrieve texts (inclusive of handwritten) from PDF file formats which are actually scanned copies. Actually maybe not his tool, but i guess Python tool in generally to perform the conversion to retrieve the result but am failing. I was exploring the usage of Microsoft Batch Read File OCR to do it but I just cant seem to get it working somehow. I don't really get where it goes wrong because I pretty much followed what was mentioned on https://westus.dev.cognitive.microsoft.com/docs/services/5adf991815e1060e6355ad44/operations/2afb498.... This being said, I could not derive the Operation-Location header to even start utilizing Get Read Operation Result. Though I prolly wouldnt know how to join it together even if I could.

 

So I am turning to my alternative which is to convert scanned images of PDF file format into png files before utilizing the OCR to parse texts out (which I am able to do). So this is where the Python SDK comes in which i am stuck at. Would really appreciate if you can enlighten how I can even begin to utilize this or get it functional.

 

Cheers,
Nick

tlarsen7572
11 - Bolide
11 - Bolide

Hey @Awesomeville, thanks for reaching out.  I wouldn't call myself a 'founder' (I'm not an Alteryx employee or anything...).  I just like to code and find the SDKs to be the funnest part of Alteryx.

 

Anyway, I'd be happy to try and help.  My first thought is to try and get the Microsoft Batch Read API working.  Based on my read of the documentation, it should be possible.  But I have dealt with a lot of web APIs lately, and they can be tricky to get right.  I'll see if I can get a free or low-cost subscription key to work with.  We can either start with your code or I can put something together and send to you to try.  Either way, if I cannot obtain a key, there will likely be a bit of back-and-forth before we get it working.

 

If you want to work on something together, one option for communication would be Slack.  A few of us have a (mostly quiet) Slack channel (alteryxpython.slack.com) for facilitating custom tool work.  Of course, if you prefer, we can always communicate here as well.