Installing pytesseract – practically painless

A recent project of mine called for optical character recognition.  After a brief Google search and a personal recommendation I decided to use tesseract because it is cross platform, under active development, and has a Python API (pytesseract).

Installing these was surprisingly easy:

tesseract has a Windows installer which comes with the English language data available here.

pytesseract can be installed using pip:

pip install pytesseract

pytesseract states that it requires Python Imaging Library (PIL) however this project no longer appears to be active, so I used the maintained fork of that project pillow. This can be installed using pip:

pip install pillow

And that’s it!

You should now be able to do some optical recognition with python:

import pytesseract
from PIL import Image
print pytesseract.image_to_string(Image.open('test.jpg'))

 


As always, if you have any comments or suggestions please feel free to get in touch.

9 thoughts on “Installing pytesseract – practically painless”

    1. Hi Sesha
      I’m sorry for the slow reply, I hope the following can still be of use to you or others:
      I have not tested this, but if you install have the Spanish language package for tesseract, you can specify the language to use in pytesseract like this:

      print(pytesseract.image_to_string(Image.open('test-european.jpg'), lang='spa'))

  1. hello grimhacker, i tried the steps accordingly but when i try to run the code it gives me this error below.
    WindowsError: [Error 2] The system cannot find the file specified

    my actual code is:
    image = cv2.imread(“test.png”,0)
    cv2.imshow(“text”, image)
    img = Image.fromarray(image)
    print pytesseract.image_to_string(img)
    cv2.waitKey(0)
    any suggestion?

  2. hi, i’m having a problem with pytesseract. i have been through paul’s error however, another error sprung.
    “Traceback (most recent call last):
    File “”, line 1, in
    a()
    File “”, line 5, in a
    p.image_to_string(img)
    File “C:\Python34\lib\site-packages\pytesseract\pytesseract.py”, line 163, in image_to_string
    errors = get_errors(error_string)
    File “C:\Python34\lib\site-packages\pytesseract\pytesseract.py”, line 110, in get_errors
    error_lines = tuple(line for line in lines if line.find(‘Error’) >= 0)
    File “C:\Python34\lib\site-packages\pytesseract\pytesseract.py”, line 110, in
    error_lines = tuple(line for line in lines if line.find(‘Error’) >= 0)
    TypeError: ‘str’ does not support the buffer interface”

    i have read in other webpages about bytes, strings and encodings. your reply is very much appreciated.
    cheers!

  3. A pytesseract installation using pip, in March 2017, did not appear to include updates from the latest merged pull request, number 33. PR 33 provides for potential encoding issues resulting from output of Tesseract-OCR.

    The pytesseract project page – https://pypi.python.org/pypi/pytesseract, appears to reflect an upload date of 2015-03-19. Is that the date of the files installed when using pip? How does pip get “forced” to update to and install the latest “official” version from Github?

    Pip’s a nice tool for convenient installation, but if packages aren’t installed using the current version, that seems to diminish the value.

    1. Hi
      As far as I am aware PyPI does not generate packages, it is up to the maintainer to upload the latest version of their module.
      You might consider opening an Issue on the project’s GitHub page requesting PyPI be updated with the latest version, or if you have a pressing need to use the latest version, follow the instructions in the readme to install from source:
      $> git clone git@github.com:madmaze/pytesseract.git
      $ (env)> python setup.py install

Leave a Reply

Your email address will not be published. Required fields are marked *