Last updated on 2 April 2015
A recent project of mine called for optical character recognition. After a brief Google search and a personal recommendation I decided to use tesseract because it is cross platform, under active development, and has a Python API (pytesseract).
Installing these was surprisingly easy:
tesseract has a Windows installer which comes with the English language data available here.
pytesseract can be installed using pip:
pip install pytesseract
pytesseract states that it requires Python Imaging Library (PIL) however this project no longer appears to be active, so I used the maintained fork of that project pillow. This can be installed using pip:
pip install pillow
And that’s it!
You should now be able to do some optical recognition with python:
import pytesseract from PIL import Image print pytesseract.image_to_string(Image.open('test.jpg'))
As always, if you have any comments or suggestions please feel free to get in touch.
Nice and easy explained. Thanks !
Hello, thanks for the guide. Do you know how to do an image to Spanish_language string using pytesseract?
Hi Sesha
I’m sorry for the slow reply, I hope the following can still be of use to you or others:
I have not tested this, but if you install have the Spanish language package for tesseract, you can specify the language to use in pytesseract like this:
print(pytesseract.image_to_string(Image.open('test-european.jpg'), lang='spa'))
hello grimhacker, i tried the steps accordingly but when i try to run the code it gives me this error below.
WindowsError: [Error 2] The system cannot find the file specified
my actual code is:
image = cv2.imread(“test.png”,0)
cv2.imshow(“text”, image)
img = Image.fromarray(image)
print pytesseract.image_to_string(img)
cv2.waitKey(0)
any suggestion?
Hi Paul,
I have seen that error previously when the python library is installed but the tesseract binary is not in your system path.
Make sure you have tesseract installed and and it is working using the information here: https://github.com/tesseract-ocr/tesseract/wiki
Let me know if that works for you 🙂
hi, i’m having a problem with pytesseract. i have been through paul’s error however, another error sprung.
“Traceback (most recent call last):
File “”, line 1, in
a()
File “”, line 5, in a
p.image_to_string(img)
File “C:\Python34\lib\site-packages\pytesseract\pytesseract.py”, line 163, in image_to_string
errors = get_errors(error_string)
File “C:\Python34\lib\site-packages\pytesseract\pytesseract.py”, line 110, in get_errors
error_lines = tuple(line for line in lines if line.find(‘Error’) >= 0)
File “C:\Python34\lib\site-packages\pytesseract\pytesseract.py”, line 110, in
error_lines = tuple(line for line in lines if line.find(‘Error’) >= 0)
TypeError: ‘str’ does not support the buffer interface”
i have read in other webpages about bytes, strings and encodings. your reply is very much appreciated.
cheers!
Hi John
I think this is a known issue in pytesseract when the image you are converting contains characters of a different encoding: https://github.com/madmaze/pytesseract/issues/32
I don’t think there is anything i can do to help you solve that one, sorry 🙁
A pytesseract installation using pip, in March 2017, did not appear to include updates from the latest merged pull request, number 33. PR 33 provides for potential encoding issues resulting from output of Tesseract-OCR.
The pytesseract project page – https://pypi.python.org/pypi/pytesseract, appears to reflect an upload date of 2015-03-19. Is that the date of the files installed when using pip? How does pip get “forced” to update to and install the latest “official” version from Github?
Pip’s a nice tool for convenient installation, but if packages aren’t installed using the current version, that seems to diminish the value.
Hi
As far as I am aware PyPI does not generate packages, it is up to the maintainer to upload the latest version of their module.
You might consider opening an Issue on the project’s GitHub page requesting PyPI be updated with the latest version, or if you have a pressing need to use the latest version, follow the instructions in the readme to install from source:
$> git clone git@github.com:madmaze/pytesseract.git
$ (env)> python setup.py install
Hi, I’m trying to use de pytesseract but I’m having the same problem for the windows 8 an 10, on Python 3.4 e 3.6.
Python 3.6.1 (v3.6.1:69c0db5, Mar 21 2017, 17:54:52) [MSC v.1900 32 bit (Intel)]
on win32
Type “help”, “copyright”, “credits” or “license” for more information.
>>> from PIL import Image
>>> import pytesseract
>>> img = Image.open(‘C:\\Users\\User\\Desktop\\Docs\\20170124_184232.jpg’)
>>> img
>>> pytesseract.pytesseract.tesseract_cmd = pytesseract.__path__
>>> print(pytesseract.image_to_string(img))
Traceback (most recent call last):
File “”, line 1, in
File “C:\Users\User\AppData\Local\Programs\Python\Python36-32\lib\site-package
s\pytesseract\pytesseract.py”, line 109, in image_to_string
config=config)
File “C:\Users\User\AppData\Local\Programs\Python\Python36-32\lib\site-package
s\pytesseract\pytesseract.py”, line 42, in run_tesseract
stderr=subprocess.PIPE)
File “C:\Users\User\AppData\Local\Programs\Python\Python36-32\lib\subprocess.p
y”, line 707, in __init__
restore_signals, start_new_session)
File “C:\Users\User\AppData\Local\Programs\Python\Python36-32\lib\subprocess.p
y”, line 990, in _execute_child
startupinfo)
PermissionError: [WinError 5] Acesso negado
>>>
I already tried to run as admin and change the directory of the image.
Thanks! 🙂