Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.

Alteryx OCR Tools

WellyLiyanto
8 - Asteroid

Hi All

 

Just tought i could share this tools i made with you All based on this post :

https://community.alteryx.com/t5/Alteryx-Designer-Discussions/Is-anybody-using-OCR-optical-character...

 

I made this tools using python pytesseract library but for using this tools first you will need to install tesseract OCR first to get the languange library from https://github.com/UB-Mannheim/tesseract/wiki

 

For now i put the tesseract OCR default library folder in python code at C:\Program Files\Tesseract-OCR (since alteryx will be installed in 64 bit windows,it should be same for all windows user that has default C folder),feel free to change it if it was needed in the python code

 

clipboard_image_0.png

 

Also,i put the image sample in English.rar to see which file that could be scanned and which file  cannot (will return null)

18 REPLIES 18
upul_guna_19
5 - Atom

I need to read engineering drawing and pull length, width and diameter data from the Auto cad or PDF files.  Will this OCR tool work for this application ?

 

Thank you for sharing.

 

- Upul

WellyLiyanto
8 - Asteroid

HI @upul_guna_19 ,

Sorry for the late response

 

Just want to make sure , did the data from your AutoCad or PDF file was in the shape of dwg file which contain image and design? the type of file that can be read using OCR tools was only jpg ,png, or bmp

 

For pdf you can use this tools from gallery 

https://gallery.alteryx.com/#!app/PDF-Input/5b685aff0462d710907f7a3b

But of course you will need to do a lot of parsing if you want to get specific data since it will only read the text of the image, for getting attribute of the data, will need to look further on how to read dwg file

 

Also ,for reading dwg file, maybe you may try the option of convert it into shp file to extract the information from the file 

 

Cheers

 

Welly

Saraabdi955
8 - Asteroid

Hi @WellyLiyanto 
I run OCR workflow and install_pill but I give this error yet:

Error: OCR (5): Tool #1: ---------------------------------------------------------------------------
CalledProcessError Traceback (most recent call last)
<ipython-input-1-94adb072d5fc> in <module>
5
6 if Package.isPackageInstalled("pytesseract") == False:
----> 7 Package.installPackages(['tesseract','pytesseract'])
c:\program files\alteryx\bin\miniconda3\envs\jupytertool_venv\lib\site-packages\ayx\Package.py in installPackages(package, install_type, debug)
200 print(pip_install_result["msg"])
201 if not pip_install_result["success"]:
--> 202 raise pip_install_result["err"]
c:\program files\alteryx\bin\miniconda3\envs\jupytertool_venv\lib\site-packages\ayx\Utils.py in runSubprocess(args_list, debug)
118
119 try:
--> 120 result = subprocess.check_output(args_list, stderr=subprocess.STDOUT)
121 if debug:
122 print("[Subprocess success!]")
c:\program files\alteryx\bin\miniconda3\envs\jupytertool_venv\lib\subprocess.py in check_output(timeout, *popenargs, **kwargs)
409 kwargs['input'] = '' if kwargs.get('universal_newlines', False) else b''
410
--> 411 return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
412 **kwargs).stdout
413
c:\program files\alteryx\bin\miniconda3\envs\jupytertool_venv\lib\subprocess.py in run(input, capture_output, timeout, check, *popenargs, **kwargs)
510 retcode = process.poll()
511 if check and retcode:
--> 512 raise CalledProcessError(retcode, process.args,
513 output=stdout, stderr=stderr)
514 return CompletedProcess(process.args, retcode, stdout, stderr)
CalledProcessError: Command '['c:\\program files\\alteryx\\bin\\miniconda3\\envs\\jupytertool_venv\\python.exe', '-I', '-m', 'pip', 'install', 'tesseract', 'pytesseract']' returned non-zero exit status 1.

Error: OCR (5): Tool #1: ---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
<ipython-input-2-b47efc3914ed> in <module>
1 from ayx import Alteryx
2 from PIL import Image
----> 3 import pytesseract
4
5 pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract'
ModuleNotFoundError: No module named 'pytesseract'



I'll be so thanks for helping you,
Regards
Sara

 

WellyLiyanto
8 - Asteroid

Hi @Saraabdi955 

 

I'm sorry for the late reply. It's been a busy week, haha.

 

From what I have seen, I believe you actually only need to try to run the tool as an administrator first to install the python package.

 

If you still have an error after that, please let me know.

 

Best Regards

 

Welly

Idyllic_Data_Geek
8 - Asteroid

@WellyLiyanto I have a requirement to convert a scanned image to text...will your tool solve for it?

WellyLiyanto
8 - Asteroid

@Idyllic_Data_Geek yes, this tool should provide the basic capability to scan image text, thought not as advanced as Alteryx intelligence suite, haha

paredesg
5 - Atom

All my img_text results are Null. Is this correct?Capturaocr.JPG

WellyLiyanto
8 - Asteroid

Hi @paredesg, can you send me 1 sample image?

NeilFisk
9 - Comet

I am not able to run like some others have posted.  I am running the non-admin version of Alteryx Designer and my installation of tesseract is in a different path, so that is all I changed.  I'm not a coder/programmer and a novice with Python.  The error I get is as follows:

 

Error: OCR (5): Tool #1: ---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-2-a7b0f53046c8> in <module>
5 pytesseract.pytesseract.tesseract_cmd = r'C:\Users\username\AppData\Local\Tesseract-OCR\tesseract'
6
----> 7 img_list = Alteryx.read("#1")
8 img_text = []
9
c:\users\username\appdata\local\alteryx\bin\miniconda3\envs\designerbasetools_venv\lib\site-packages\ayx\export.py in read(incoming_connection_name, debug, **kwargs)
33 When running the workflow in Alteryx, this function will convert incoming data streams to pandas dataframes when executing the code written in the Python tool. When called from the Jupyter notebook interactively, it will read in a copy of the incoming data that was cached on the previous run of the Alteryx workflow.
34 """
---> 35 return __CachedData__(debug=debug).read(incoming_connection_name, **kwargs)
36
37
c:\users\username\appdata\local\alteryx\bin\miniconda3\envs\designerbasetools_venv\lib\site-packages\ayx\CachedData.py in read(self, incoming_connection_name)
304 try:
305 # get the data from the sql db (if only one table exists, no need to specify the table name)
--> 306 data = db.getData()
307 # print success message
308 print("".join(["SUCCESS: ", msg_action]))
c:\users\username\appdata\local\alteryx\bin\miniconda3\envs\designerbasetools_venv\lib\site-packages\ayx\Datafiles.py in getData(self, data, metadata)
498 if data is None:
499 # read in data as a list of numpy ndarrays
--> 500 data = self.connection.read_nparrays()
501 # check if data is a list of numpy structs
502 elif isinstance(data, list) and all(
RuntimeError: DataWrap2WrigleyDb::GoRecord: Attempt to seek past the end of the file

Labels