Ocropus Windows Installer

Posted on
Ocropus

I'm working on a project and need to use OCRopus, I tried to install it on windows but failed, so I moved to Ubuntu. I'm not a nerdy when it comes to Ubuntu, so I'm stuck now.

OCRopus is a collection of document analysis programs, not a turn-key OCR system.In order to apply it to your documents, you may need to do some image preprocessing,and possibly also train new models.

  • Jump to bottom. How to install Ocropus on a Mac? How to install Ocropus on Windows?
  • Mar 31, 2009 - sudo apt-get install libeditline-dev $ svn checkout. Request how to build ocropus in (windows platform XP), if possible. Jonathan Brinley says.

In addition to the recognition scripts themselves, there are a number of scripts forground truth editing and correction, measuring error rates, determining confusion matrices, etc.OCRopus commands will generally print a stack trace along with an error message;this is not generally indicative of a problem (in a future release, we'll suppress the stacktrace by default since it seems to confuse too many users).

Installing

Ocropus

To install OCRopus dependencies system-wide:

Alternatively, dependencies can be installed into aPython Virtual Environment:

An additional method using Conda is also possible:

To test the recognizer, run:

Running

To recognize pages of text, you need to run separate commands: binarization, page layoutanalysis, and text line recognition. The default parameters and settings of OCRopus assume300dpi binary black-on-white images. If your images are scanned at a different resolution, thesimplest thing to do is to downscale/upscale them to 300dpi. The text line recognizer isfairly robust to different resolutions, but the layout analysis is quite resolution dependent.

Here is an example for a page of Fraktur text (German);you need to download the Fraktur model from tmbdev.net/ocropy/fraktur.pyrnn.gz to run thisexample:

There are some things the currently trained models for ocropus-rpredwill not handle well, largely because they are nearly absent in thecurrent training data. That includes all-caps text, some special symbols(including '?'), typewriter fonts, and subscripts/superscripts. This willbe addressed in a future release, and, of course, you are welcome to contributenew, trained models.

You can also generate training data using ocropus-linegen:

This will create a directory 'linegen/...' containing training datasuitable for training OCRopus with synthetic data.

Roadmap

Project Announcements
The text line recognizer has been ported to C++ and is now a separate project, the CLSTM project, available here: https://github.com/tmbdev/clstm
New GPU-capable text line recognizers and deep-learning based layout analysis methods are in the works and will be published as separate projects some time in 2017.
Please welcome @zuphilip and @kba as additional project maintainers. @tmb is busy developing new DNN models for document analysis (among other things). (10/15/2016)

A lot of excellent packages have become available for deep learning, vision, and GPU computing over the last few years.At the same time, it has become feasible now to address problems like layout analysis and text line followingthrough attentional and reinforcement learning mechanisms. I (@tmb) am planning on developing new software using thesenew tools and techniques for the traditional document analysis tasks. These will become available as separateprojects.

Note that for text line recognition and language modeling, you can also use the CLSTM command line tools. Except for taking different command line options, they are otherwise drop-in replacements for the Python-based text line recognizer.

Contributing

OCRopy and CLSTM are both command line driven programs. The best way to contribute is to create new command line programs using the same (simple) persistent representations as the rest of OCRopus.

The biggest needs are in the following areas:

  • text/image segmentation
  • text line detection and extraction
  • output generation (hOCR and hOCR-to-* transformations)

CLSTM vs OCRopy

The CLSTM project (https://github.com/tmbdev/clstm) is a replacement forocropus-rtrain and ocropus-rpred in C++ (it used to be a subproject ofocropy but has been moved into a separate project now). It is significantly faster thanthe Python versions and has minimal library dependencies, so it is suitablefor embedding into C++ programs.

Python and C++ models can not be interchanged, both because the save fileformats are different and because the text line normalization is slightlydifferent. Error rates are about the same.

In addition, the C++ command line tool (clstmctc) has different command lineoptions and currently requires loading training data into HDF5 files, insteadof being trained off a list of image files directly (image file-based trainingwill be added to clstmctc soon).

The CLSTM project also provides LSTM-based language modeling that works verywell with post-processing and correcting OCR output, as well as solving a numberof other OCR-related tasks, such as dehyphenation or changes in orthography(see our publications). You can train language models using clstmtext.

Generally, your best bet for CLSTM and OCRopy is to rely only on the commandline tools; that makes it easy to replace different components. In addition, youshould keep your OCR training data in .png/.gt.txt files so that you can easilyretrain models as better recognizers become available.

After making CLSTM a full replacement for ocropus-rtrain/ocropus-rpred, thenext step will be to replace the binarization, text/image segmentation, and layoutanalysis in OCRopus with trainable 2D LSTM models.

Ocropus Windows Installer

I'm working on a project and need to use OCRopus, I tried to install it on windows but failed, so I moved to Ubuntu. I'm not a nerdy when it comes to Ubuntu, so I'm stuck now.

I have installed python 2.7 and all the requirements 1 and 2, also I've installed opencv.

Then I tried to install ocropy as written in this link:

But failed at this line:mv en-default.pyrnn.gz models/

I got the following message:
mv: cannot move ‘en-default.pyrnn.gz’ to ‘models/’: Not a directory

Installer

I actually don't understand the command, because previous line gets a .gz then we want to move it to a model directory (which is not created yet!) then we need to run setup.py which is not there.

So I don't know if I'm missing something.

Why this is happening?

HendkHendk

1 Answer

The steps is misleading for you, use this:

You need to clone the master tree in order to proceed with the installation.

You need git via apt-get install git.

Aizuddin ZaliAizuddin Zali

Not the answer you're looking for? Browse other questions tagged software-installationpythonmvocr or ask your own question.