Dev Space

Customize and extend the power of Alteryx with SDKs, APIs, custom tools, and more.
SOLVED

Input PDFs (they are images)

clant
8 - Asteroid

Hello all!

 

I am a bit stuck, I originally posted this in the designer forums but did not get many responses. 

 

The problem I have is: We currently receive pdf documents which have been faxed to us. These are hand written order forms. We need to take these forms and run some form of handwriting OCR on them to help our data input guys out.

 

My idea is to write a python tool which can convert the pdf to a jpg. We can then take this jpg and upload it to either azure or google ocr then get the results back. We have tested the azure ocr and it worked great.

 

So far I have written the python which does the pdf to jpg and am trying and failing to make this into a tool in alteryx (At the moment I am getting a "typeError:__init__ () takes 2 positional arguments but 4 were give" but i think i can work this out).

 

Can someone please advise if what I am doing will actually work or if there is a better way to do this?

 

Thank you!

 

Cheers

 

Chris

29 REPLIES 29
harrymp33
5 - Atom

I have my personal list of apps which is used to convert png file to pdf very easily , this apps are friendly with android and iOS system. Here is list of apps --- convert png to pdf apps

Sri9
8 - Asteroid

I couldnt extract the zip file as its empty. 

GalegO
7 - Meteor

Hi All,

 

Sorry bringing this topic to life, but, is possible to someone update the clant's PDFToImage tool (‎04-23-2019 01:54 AM) to use Python 3.8?

 

If not, could someone explain to me how can I do it?

 

Thank you!

tlarsen7572
11 - Bolide

@GalegO, the easiest way to do this is to package the tool as a YXI. Alteryx will create a venv for the tool when you run it as an installer. The attached YXI should do it.

 

YXI files are just ZIP files with a different file extension. To build the attached YXI, I removed the existing venv files and simplified requirements.txt to the bare minimum required. I tested on my system and everything works.

 

Note that you will need to download poppler and add its bin directory to your system path. The tool will not work, otherwise.

GalegO
7 - Meteor

@tlarsen7572 Thank you for replying. I got the error (attached) during the install.

 

Also, about the poppler. When you said bin system's directory, you mean, (1) a windows folder and edit the environment variable to point to that path, (2)Alteryx bin folder, (3)Alteryx custom tools folder or (4) something else?

 

Thank you again for the help!

GalegO
7 - Meteor

@tlarsen7572 good news, I managed to make it work!

 

The install error was solved closing and opening the Alteryx. So, no error on the installer.

 

Related to the poppler I just did the following:

 

  1. Downloaded the latest Windows package (.zip)
  2. Extracted the package
  3. Moved the extracted directory to the desired place on my system
  4. Added the bin/ directory to my PATH system environment
  5. Tested that all went well by opening cmd and making sure that I can call pdftoppm -h

I saw I need to update some code because my PDF has multiple pages, but this won't be difficult.

 

Thank you for the help!

tlarsen7572
11 - Bolide

I'm glad you were able to get it working! I'm not 100% sure why the yxi process sometimes errors out. But restarting Alteryx and opening the YXI from Alteryx usually does the trick for me, too.

GalegO
7 - Meteor

@tlarsen7572 I was able to change the code to output images from all the pages, but I could not understand how to do the same on the Output, is showing only the last page.

tlarsen7572
11 - Bolide

You should be able to do this by indenting the code that generates the records so that it falls under the for loop in ii_push_record. Something like this:

 

def ii_push_record(self, in_record: object) -> bool:
	"""
	Responsible for pushing records out
	Called when an input record is being sent to the plugin.
	:param in_record: The data for the incoming record.
	:return: False if method calling limit (record_cnt) is hit.
	"""
	# Copy the data from the incoming record into the outgoing record.
	self.record_creator.reset()
	self.record_copier.copy(self.record_creator, in_record)
										   
	if self.parent.input_field.get_as_string(in_record) is not None:
		input = self.parent.input_field.get_as_string(in_record)
		#pages = convert_from_path(Path,325)
		pages = convert_from_path(input,325)
		index = 0
		for page in pages:
			index += 1
			output = input + str(index) +'test.jpg'
			page.save(output,'JPEG')

			# DELETED ********************************************************************************************************
			#out_record = output

			# ADDED *********************************************************************************************************
			self.summary_field.set_from_string(self.record_creator, output)

			#ADDED *********************************************************************************************************
			out_record = self.record_creator.finalize_record()

			# Push the record downstream and quit if there's a downstream error.
			if not self.parent.output_anchor.push_record(out_record):
				return False
	 
	return True

 

GalegO
7 - Meteor

Wow, I've indented until this line: 

			#ADDED *********************************************************************************************************
			out_record = self.record_creator.finalize_record()

 

And I was thinking if there something to do in ii_close.

 

Thank you again, this tool will be a life saver :)

 

PS: The code to generate the pages was igual as mine, besides the variable name to count the pages