Alteryx OCR Tools
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hi All
Just tought i could share this tools i made with you All based on this post :
I made this tools using python pytesseract library but for using this tools first you will need to install tesseract OCR first to get the languange library from https://github.com/UB-Mannheim/tesseract/wiki
For now i put the tesseract OCR default library folder in python code at C:\Program Files\Tesseract-OCR (since alteryx will be installed in 64 bit windows,it should be same for all windows user that has default C folder),feel free to change it if it was needed in the python code
Also,i put the image sample in English.rar to see which file that could be scanned and which file cannot (will return null)
- Labels:
- Custom Tools
- Input
- Macros
- Python
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
I need to read engineering drawing and pull length, width and diameter data from the Auto cad or PDF files. Will this OCR tool work for this application ?
Thank you for sharing.
- Upul
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
HI @upul_guna_19 ,
Sorry for the late response
Just want to make sure , did the data from your AutoCad or PDF file was in the shape of dwg file which contain image and design? the type of file that can be read using OCR tools was only jpg ,png, or bmp
For pdf you can use this tools from gallery
https://gallery.alteryx.com/#!app/PDF-Input/5b685aff0462d710907f7a3b
But of course you will need to do a lot of parsing if you want to get specific data since it will only read the text of the image, for getting attribute of the data, will need to look further on how to read dwg file
Also ,for reading dwg file, maybe you may try the option of convert it into shp file to extract the information from the file
Cheers
Welly
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hi @WLAYX
I run OCR workflow and install_pill but I give this error yet:
Error: OCR (5): Tool #1: ---------------------------------------------------------------------------
CalledProcessError Traceback (most recent call last)
<ipython-input-1-94adb072d5fc> in <module>
5
6 if Package.isPackageInstalled("pytesseract") == False:
----> 7 Package.installPackages(['tesseract','pytesseract'])
c:\program files\alteryx\bin\miniconda3\envs\jupytertool_venv\lib\site-packages\ayx\Package.py in installPackages(package, install_type, debug)
200 print(pip_install_result["msg"])
201 if not pip_install_result["success"]:
--> 202 raise pip_install_result["err"]
c:\program files\alteryx\bin\miniconda3\envs\jupytertool_venv\lib\site-packages\ayx\Utils.py in runSubprocess(args_list, debug)
118
119 try:
--> 120 result = subprocess.check_output(args_list, stderr=subprocess.STDOUT)
121 if debug:
122 print("[Subprocess success!]")
c:\program files\alteryx\bin\miniconda3\envs\jupytertool_venv\lib\subprocess.py in check_output(timeout, *popenargs, **kwargs)
409 kwargs['input'] = '' if kwargs.get('universal_newlines', False) else b''
410
--> 411 return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
412 **kwargs).stdout
413
c:\program files\alteryx\bin\miniconda3\envs\jupytertool_venv\lib\subprocess.py in run(input, capture_output, timeout, check, *popenargs, **kwargs)
510 retcode = process.poll()
511 if check and retcode:
--> 512 raise CalledProcessError(retcode, process.args,
513 output=stdout, stderr=stderr)
514 return CompletedProcess(process.args, retcode, stdout, stderr)
CalledProcessError: Command '['c:\\program files\\alteryx\\bin\\miniconda3\\envs\\jupytertool_venv\\python.exe', '-I', '-m', 'pip', 'install', 'tesseract', 'pytesseract']' returned non-zero exit status 1.
Error: OCR (5): Tool #1: ---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
<ipython-input-2-b47efc3914ed> in <module>
1 from ayx import Alteryx
2 from PIL import Image
----> 3 import pytesseract
4
5 pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract'
ModuleNotFoundError: No module named 'pytesseract'
I'll be so thanks for helping you,
Regards
Sara
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hi @Saraabdi955
I'm sorry for the late reply. It's been a busy week, haha.
From what I have seen, I believe you actually only need to try to run the tool as an administrator first to install the python package.
If you still have an error after that, please let me know.
Best Regards
Welly
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
@WLAYX I have a requirement to convert a scanned image to text...will your tool solve for it?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
@Idyllic_Data_Geek yes, this tool should provide the basic capability to scan image text, thought not as advanced as Alteryx intelligence suite, haha
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
All my img_text results are Null. Is this correct?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hi @paredesg, can you send me 1 sample image?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
I am not able to run like some others have posted. I am running the non-admin version of Alteryx Designer and my installation of tesseract is in a different path, so that is all I changed. I'm not a coder/programmer and a novice with Python. The error I get is as follows:
Error: OCR (5): Tool #1: ---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
<ipython-input-2-a7b0f53046c8> in <module>
5 pytesseract.pytesseract.tesseract_cmd = r'C:\Users\username\AppData\Local\Tesseract-OCR\tesseract'
6
----> 7 img_list = Alteryx.read("#1")
8 img_text = []
9
c:\users\username\appdata\local\alteryx\bin\miniconda3\envs\designerbasetools_venv\lib\site-packages\ayx\export.py in read(incoming_connection_name, debug, **kwargs)
33 When running the workflow in Alteryx, this function will convert incoming data streams to pandas dataframes when executing the code written in the Python tool. When called from the Jupyter notebook interactively, it will read in a copy of the incoming data that was cached on the previous run of the Alteryx workflow.
34 """
---> 35 return __CachedData__(debug=debug).read(incoming_connection_name, **kwargs)
36
37
c:\users\username\appdata\local\alteryx\bin\miniconda3\envs\designerbasetools_venv\lib\site-packages\ayx\CachedData.py in read(self, incoming_connection_name)
304 try:
305 # get the data from the sql db (if only one table exists, no need to specify the table name)
--> 306 data = db.getData()
307 # print success message
308 print("".join(["SUCCESS: ", msg_action]))
c:\users\username\appdata\local\alteryx\bin\miniconda3\envs\designerbasetools_venv\lib\site-packages\ayx\Datafiles.py in getData(self, data, metadata)
498 if data is None:
499 # read in data as a list of numpy ndarrays
--> 500 data = self.connection.read_nparrays()
501 # check if data is a list of numpy structs
502 elif isinstance(data, list) and all(
RuntimeError: DataWrap2WrigleyDb::GoRecord: Attempt to seek past the end of the file

- « Previous
-
- 1
- 2
- Next »