community
cancel
Showing results for 
Search instead for 
Did you mean: 

Alteryx designer Discussions

Find answers, ask questions, and share expertise about Alteryx Designer.

Help Utilizing Microsoft Azure Batch Read File in Alteryx

Hi all,

 

I was wondering if anyone have tried utilizing Microsoft Cognitive Services to do Batch Read File for parsing information from pdf rather than image files (eg. png, jpeg).

 

I have went through the steps from this particular discussion below but this method only works for the rest of the image files and not for pdfs.

https://community.alteryx.com/t5/Alteryx-Designer-Discussions/Image-Face-Recognition-Using-Microsoft...

 

Aside from that, if you were to look at Batch Read File API documentation from Microsoft Azure and even online, there's little on it. I tried using https://{endpoint}/vision/v2.0/read/core/asyncBatchAnalyze but it does not seem to work. I am getting this response. 

Warning: Download (4): No data received from "https://{endpoint}/vision/v2.0/read/core/asyncBatchAnalyze"; Check the headers for details.

ConvError: JSON Parse (5): Error message: The document is empty. at character position: 0

My endpoint is as per what I generated so it shouldnt be an issue and I couldnt figure out what is wrong with it. Also it seems like this particular API is supposed to be used with Get Read Operation Result by retrieving the Operation-Location, of which I dont really get it as well because my result does not even churn out this field.

 

Before I get smashed for not checking up on other threads on how to parse PDFs to text, those pdfs I have are forms that are printed out and filled with handwritten texts, before getting scanned back as pdf images, thus my idea of using OCR to retrieve the data. Or at least that's the only thing i know that can retrieve the data I require.

Currently, I am working on a use case to parse out 6000 pdfs per day and am exploring options to do so. Feel free to recommend other APIs or even methods to utilize this particular azure batch read api.

 

Any help is greatly appreciated.

 

Cheers,

Nick

Moderator
Moderator

Hey @Awesomeville

Are you trying to use Azure API to send a pdf to, parse it into text, and send it back? This isn't 'truly' supported by Alteryx, but there are some things I can push your way.

 

https://community.alteryx.com/t5/Alteryx-Designer-Knowledge-Base/Can-Alteryx-Parse-A-Word-Doc-Or-PDF...

https://community.alteryx.com/t5/Alteryx-Designer-Knowledge-Base/PDF-Parsing-in-Alteryx-using-R/ta-p...

 

You can also try this library within Jupyter - http://www.unixuser.org/~euske/python/pdfminer/index.html

 

You may need to reach out to Microsoft if there are any issues with the API setup as well.


Hopefully this helps get you in the right direction!

 

Thanks,
TrevorS

Community Moderator

Hey Nick,

 

Continuing from the dev space thread and PM...

 

This sort of thing is where the Python SDK can really shine.  There are 2 ways of running Python code in Alteryx: use the Python tool or create a custom tool that uses Python code as its engine.  In situations like this I prefer creating a custom tool because it kinda hides all the little implementation details.  When dealing with an API that requires multiple trips, like this one, it's mentally easier to have a tool that 'just does OCR' rather than always having to wade through a Jupyter notebook to remember what is going on.  Also, the custom tool setup is a bit cleaner if you want to unit-test your code (something I admittedly didn't do so well with this one...)

 

So, to answer the first question in your PM, that is why there is a tool installer in the zip file (attached here as well).  I packaged my code into a custom tool that now works just like any other Alteryx tool.

 

A basic workflow that uses the tool looks something like this.  First I send the tool a list of file paths I want to run OCR on:

OCR Workflow 1.PNG

 

Then I configure the tool by giving it the endpoint and subscription key from my Azure portal:

OCR Workflow 2.PNG

 

The tool is coded to send the binary file data to the OCR endpoint.  If successful, it gets a response from Microsoft containing the URL where we can download the OCR results when they are finished.  The tool polls that URL every 5 seconds until it gets either a failure or successful response.  If successful, it parses the JSON containing the OCR results and sends it to downstream tools:

OCR Workflow 3.PNG

 

The code that does this is on GitHub in ocr.py.  Most of ocr.py is scaffolding from the SDK so our tool can...be a tool.  The relevant code for actually interacting with the Cognitive Services endpoint is in the IncomingInterface class in the ii_push_record method:

class IncomingInterface:
    def ii_push_record(self, in_record: Sdk.RecordRef) -> bool:
        update_only = self.parent.alteryx_engine.get_init_var(self.parent.n_tool_id, 'UpdateOnly') == 'True'
        if update_only:
            return True

        file_path = self.uploadFileField.get_as_string(in_record)
        with open(file_path, mode='rb') as file:
            upload_bytes = file.read()

        key = self._decrypt_value(self.parent.key)
        headers = {"Content-Type": "application/octet-stream", "Ocp-Apim-Subscription-Key": key}
        batch_response = requests.post(self.batch_read_url, data=upload_bytes, headers=headers)
        if batch_response.status_code != 202:
            self.parent.display_error_msg(batch_response.text)
            return False

        time.sleep(5)

        headers = {"Ocp-Apim-Subscription-Key": key}
        operation_location: str = batch_response.headers['Operation-Location']
        still_running = True
        while still_running:
            get_read_response = requests.get(operation_location, headers=headers)
            if get_read_response.status_code != 200:
                self.parent.display_error_msg(get_read_response.text)
                return False
            get_read_json = json.loads(get_read_response.content)
            if get_read_json['status'] == 'Failed':
                self.parent.display_error_msg("The text recognition process failed")
                return False
            if get_read_json['status'] == 'Succeeded':
                break
            time.sleep(5)

        results = parse_read_operation.parse_recognition_results(get_read_json['recognitionResults'])
        for result in results:
            self.output_info.get_field_by_name('FilePath').set_from_string(self.output_creator, file_path)
            self.output_info.get_field_by_name('Page').set_from_int64(self.output_creator, result.page)
            self.output_info.get_field_by_name('ClockwiseOrientation').set_from_double(self.output_creator, result.clockwiseOrientation)
            self.output_info.get_field_by_name('PageWidth').set_from_double(self.output_creator, result.pageWidth)
            self.output_info.get_field_by_name('PageHeight').set_from_double(self.output_creator, result.pageHeight)
            self.output_info.get_field_by_name('Unit').set_from_string(self.output_creator, result.unit)
            self.output_info.get_field_by_name('Text').set_from_string(self.output_creator, result.text)
            self.output_info.get_field_by_name('TopLeftX').set_from_double(self.output_creator, result.topLeftX)
            self.output_info.get_field_by_name('TopLeftY').set_from_double(self.output_creator, result.topLeftY)
            self.output_info.get_field_by_name('TopRightX').set_from_double(self.output_creator, result.topRightX)
            self.output_info.get_field_by_name('TopRightY').set_from_double(self.output_creator, result.topRightY)
            self.output_info.get_field_by_name('BottomRightX').set_from_double(self.output_creator, result.bottomRightX)
            self.output_info.get_field_by_name('BottomRightY').set_from_double(self.output_creator, result.bottomRightY)
            self.output_info.get_field_by_name('BottomLeftX').set_from_double(self.output_creator, result.bottomLeftX)
            self.output_info.get_field_by_name('BottomLeftY').set_from_double(self.output_creator, result.bottomLeftY)
            data = self.output_creator.finalize_record()
            self.parent.output.push_record(data)
            self.output_creator.reset()
        return True

 

Honestly, the code is a bit messy.  It's a minimum viable product that I haven't refactored and didn't to proper TDD with.  But if you have questions about it, feel free to ask.  Also, if you want to learn more about how Alteryx's Python SDK works, check out the docs; they are quite good.

Labels