topic Re: Input PDFs (they are images) in Dev Space

Input PDFs (they are images)

clant — Thu, 18 Apr 2019 08:46:39 GMT

Hello all!

I am a bit stuck, I originally posted this in the designer forums but did not get many responses.

The problem I have is: We currently receive pdf documents which have been faxed to us. These are hand written order forms. We need to take these forms and run some form of handwriting OCR on them to help our data input guys out.

My idea is to write a python tool which can convert the pdf to a jpg. We can then take this jpg and upload it to either azure or google ocr then get the results back. We have tested the azure ocr and it worked great.

So far I have written the python which does the pdf to jpg and am trying and failing to make this into a tool in alteryx (At the moment I am getting a "typeError:__init__ () takes 2 positional arguments but 4 were give" but i think i can work this out).

Can someone please advise if what I am doing will actually work or if there is a better way to do this?

Thank you!

Cheers

Chris

Re: Input PDFs (they are images)

tlarsen7572 — Thu, 18 Apr 2019 11:38:46 GMT

This is a really cool use case!

I think what you want to do is certainly very possible, it's just a matter of the implementation details. At the moment I see 2 possible ways to get the data into the custom tool:

1. Import the PDFs as BLOBs using the Blob Input developer tool. The Python tool can then use the raw byte data to convert to jpg and send to the appropriate cloud API

2. Send the Python tool a list of filenames. For each file name, the tool can import the PDF, convert to JPG, and send it to the API

Which method you use really depends on the nature of the API and the PDF-to-JPG package you chose. Most likely, 2 will be the easiest option.

What kind of data is returned from the Azure/Google API?

I would encourage you to give it a go. If you run into any issues or errors, you can post them in this forum. There is a bit of a trick to munge your Python code into a custom tool, but there are lots of smart people here to help.

Re: Input PDFs (they are images)

clant — Thu, 18 Apr 2019 15:24:05 GMT

Hi @tlarsen7572

So I have managed to get my python working but with a error. I have attached the yxi file, this is ugly, i made this following the beginngers guide to python tools. This is currently giving me the error below.

Error: Python Convert PDF to Jpeg (7): Traceback (most recent call last):
File "PDFtoImage_Engine.py", line 171, in ii_push_record
Boost.Python.ArgumentError: Python argument types in
OutputAnchor.push_record(OutputAnchor, str)
did not match C++ signature:
push_record(class SRC::Python::OutputAnchor {lvalue}, class boost::shared_ptr<struct SRC::Python::ConstRecordRef> record, bool no_auto_close=False)

If anyone could help with how to fix this i would appreciate it! The way this tool works is you can drop a text input tool in front of it with the file location of a pdf. This will then create in the same location a jpg.

With regards to the data from azure this comes back as json which should be ok to parse out.

Cheers

Chris

Re: Input PDFs (they are images)

tlarsen7572 — Thu, 18 Apr 2019 16:00:34 GMT

Very close, see attached for a corrected python file (in a txt extension, because Alteryx's forum doesn't like me attached python scripts...). All of my changes have comments preceding them, so I hope they are easy to find. In summary, the changes I made were:

When you add the summary field to the record info in ii_init, save the field instance to a property. We will use that instance later to populate the string value.
In ii_push_record, you are assigning the out_record to a string. This is invalid; out_record needs to be a recordref object.
In your if statement in ii_push_record, populate the summary field with the set_from_string function
Just after the if statement in ii_push_record, create the out_record recordref object by calling finalize_record on your record creator.

Now everything should work....at least it does on my system.

Re: Input PDFs (they are images)

clant — Tue, 23 Apr 2019 10:40:43 GMT

Thank you so much @tlarsen7572 I have attached my tool which successfully converts pdfs to jpegs. I have tested with the directory tool and it worked great!

Thank you again for your help on this! Here is the tool!

Cheers

Chris

Re: Input PDFs (they are images)

clant — Wed, 24 Apr 2019 10:52:15 GMT

Hi @tlarsen7572

Just a update for you as i know you were interested. I have successfully completed this now! I am using the read api on azure, this makes it a little more complicated because you have to post, then use a get to get the results as it has some processing time. This has worked amazingly well.

Cheers

Chris

Re: Input PDFs (they are images)

tlarsen7572 — Wed, 24 Apr 2019 11:17:22 GMT

Thanks for sharing the update, @clant! This is a really cool use-case for Alteryx, I am glad you were able to get it working. I am bookmarking this thread in case I ever have to do any kind of OCR extraction. As a department (Internal Audit), we sometimes have to deal with PDF/paper docs; I could see this being very helpful around things like contract analysis. Definitely next-level stuff.

Re: Input PDFs (they are images)

Awesomeville — Thu, 12 Sep 2019 09:37:56 GMT

Hi @clant, currently I am trying to do exactly what you did, but after reading this thread and downloading your zip file, i still dont really get how the whole thing works on Alteryx. Do you mind sharing your workflow or elaborate further of how this (python script) is incorporated into Alteryx before utilizing read api on azure?

Thank you in advance for your help. Appreciate it.

Cheers,

Nick

Re: Input PDFs (they are images)

Awesomeville — Tue, 17 Sep 2019 02:37:40 GMT

@tlarsen7572 Hello, since I couldnt reach @clant to seek for advice on how does his tool actually works, could I trouble you instead since I noticed you are actually one of the alteryx dev founder and also took part in the python tool contest to say the least. I was hoping you could help me out here.

Currently I am trying to utilize @clant's tool to retrieve texts (inclusive of handwritten) from PDF file formats which are actually scanned copies. Actually maybe not his tool, but i guess Python tool in generally to perform the conversion to retrieve the result but am failing. I was exploring the usage of Microsoft Batch Read File OCR to do it but I just cant seem to get it working somehow. I don't really get where it goes wrong because I pretty much followed what was mentioned on https://westus.dev.cognitive.microsoft.com/docs/services/5adf991815e1060e6355ad44/operations/2afb498089f74080d7ef85eb. This being said, I could not derive the Operation-Location header to even start utilizing Get Read Operation Result. Though I prolly wouldnt know how to join it together even if I could.

So I am turning to my alternative which is to convert scanned images of PDF file format into png files before utilizing the OCR to parse texts out (which I am able to do). So this is where the Python SDK comes in which i am stuck at. Would really appreciate if you can enlighten how I can even begin to utilize this or get it functional.

Cheers,
Nick

Re: Input PDFs (they are images)

tlarsen7572 — Tue, 17 Sep 2019 14:54:05 GMT

Hey @Awesomeville, thanks for reaching out. I wouldn't call myself a 'founder' (I'm not an Alteryx employee or anything...). I just like to code and find the SDKs to be the funnest part of Alteryx.

Anyway, I'd be happy to try and help. My first thought is to try and get the Microsoft Batch Read API working. Based on my read of the documentation, it should be possible. But I have dealt with a lot of web APIs lately, and they can be tricky to get right. I'll see if I can get a free or low-cost subscription key to work with. We can either start with your code or I can put something together and send to you to try. Either way, if I cannot obtain a key, there will likely be a bit of back-and-forth before we get it working.

If you want to work on something together, one option for communication would be Slack. A few of us have a (mostly quiet) Slack channel (alteryxpython.slack.com) for facilitating custom tool work. Of course, if you prefer, we can always communicate here as well.

Re: Input PDFs (they are images)

tlarsen7572 — Tue, 17 Sep 2019 21:08:13 GMT

Hey @Awesomeville, so I ended up taking a shot at this Azure service today. I was able to sign up for the free tier and start testing things out.

I was able to get a custom tool working that sends images and PDFs to the Azure endpoint, waits for Azure to process the files, and then downloads and parses the results. I tested this on some handwritten sentences I wrote and scanned to PDF for testing, and am amazed at how well it works. I can see a huge potential use case for my department regarding things like contract analysis. This is a powerful OCR service Microsoft provides. The tool is attached to this message if you want to try it out. Let me know if you run into issues. Also, you can view the code here on GitHub.

If you want to talk about how it works, feel free to start a discussion and I can walk you through the code.

Re: Input PDFs (they are images)

Nick612Haylund — Tue, 17 Sep 2019 22:47:56 GMT

Not too shabby at all @tlarsen7572 (awesome)

Re: Input PDFs (they are images)

MattDuncan — Thu, 19 Sep 2019 08:27:25 GMT

Thanks for the awesome work!

I can't input the tool so will need to use the code on GitHub. Can you walk me through how to put this into a workflow? I'm quite new to the Python SDK world.

If you attach an example workflow showing how to convert a PDF into data, that would be perfect

Re: Input PDFs (they are images)

tlarsen7572 — Thu, 19 Sep 2019 10:09:28 GMT

Hey @MattDuncan, welcome to the Python SDK world!

The easiest place to start would be installing the tool from the yxi. What do you mean by, 'I can't input the tool'? Inside the zip should be a yxi file. Extract it and open it from Alteryx. Alteryx will present an installation dialog. Once you install the tool you can find it in the Laboratory tab:

If you cannot find the Laboratory tab, click the plus sign at the right of the tabs and make sure Laboratory is selected:

Once the tool is installed, start your workflow by creating a list of file paths you want converted. I usually use the Text Input tool or the Directory tool for this:

Add the OCR tool and configure it with the endpoint and key from your Azure portal:

The easiest way to get the endpoint and key is to go to the Overview or Quick start sections on Azure. This is what my Quick start looks like. I can copy the endpoint and key right from this page and paste it into the Alteryx tool:

And that should be it. The beauty of the Python SDK is that there is no configuration required on your end beyond installing the tool with the YXI file. If you are having an error doing so, let us know and we can troubleshoot.

Re: Input PDFs (they are images)

Jamie12 — Tue, 19 Nov 2019 16:41:06 GMT

Hi @tlarsen7572,

Thank you for sharing! I was able to successfully install the OCR tool in Alteryx. However, I've been having trouble locating the endpoint to use in the configuration since my Quick Start section in Azure doesn't look like yours in the screenshot. In an attempt to create an endpoint, I added a virtual machine in Azure with a static IP address and tried to use that as the endpoint. Though, I'm not sure if that is correct or necessary.

I was also unsure if the Subscription Key needed for the configuration is the same as the Subscription ID that I see in Azure. I would greatly appreciate any tips you have on how to overcome this!

Re: Input PDFs (they are images)

tlarsen7572 — Tue, 19 Nov 2019 17:07:00 GMT

Hi @Jamie12! Did you create a Computer Vision resource in your Azure portal? I just checked my Quick Start and it hasn't changed it's appearance.

From the home page of your Azure portal, click 'Create a resource'

Search the marketplace for 'computer vision'. You should see something like below

Once you create the Computer Vision resource, you should have access to the Quick Start page that looks like mine and which will provide you with the key and the endpoint.

Does that help, or are you still unable to access the API?

Re: Input PDFs (they are images)

Jamie12 — Tue, 19 Nov 2019 17:31:38 GMT

That did the trick! Thank you so much @tlarsen7572!

Re: Input PDFs (they are images)

trettelap — Wed, 19 Feb 2020 13:52:10 GMT

Awesome tool! @tlarsen7572. Is there any way to see the backend code behind the tool? I know you can do this for macros but I can't figure out in this case. I am guessing this was developed using the python SDK?

Re: Input PDFs (they are images)

tlarsen7572 — Wed, 19 Feb 2020 14:31:59 GMT

Hi @trettelap, glad you like the tool! It certainly was developed using the Python SDK.

The code is available on GitHub here. Also, you can see the code on your local PC at one of the following paths:

If you installed the tool as admin: C:\ProgramData\Alteryx\Tools\OCR

If you installed the tool user-specific: C:\Users\Your User Name\AppData\Roaming\Alteryx\Tools\OCR

Re: Input PDFs (they are images)

agendel — Mon, 09 Mar 2020 14:29:07 GMT

Hey @tlarsen7572 I downloaded the OCR tool and was able to input a pdf into alteryx. However, it seems that it's not taking any pdf documents above 200 KB, do you have any idea why and how I could fix this if possible? Thanks 🙂