Context Navigation

← Previous Change
Next Change →

README.txt

Timestamp:

2020-06-16T15:00:39+12:00 (4 years ago)

Author:

ak19

Message:

In order to get tika + tesseract to OCR PDFs (note that tesseract can't OCR PDFs on its own), need to pass a tika-config.xml file to tika that is configured to use txt OR hocr as outputType, and if outputType=hocr then need to have the tesseract/tessdata/configs folder contain a file called hocr at minimum. Now the build process ensures that the tessdata/configs and other tessdata subfolders in the extracted tesseract source package get copied across into the GEXTTESS_INSTALLED install location. Updating the README with the notes and the tesseract bin tarball.

File:

: 1 edited

gs2-extensions/tesseract/trunk/README.txt (modified) (3 diffs)

Legend:

: Unmodified
: Added
: Removed

gs2-extensions/tesseract/trunk/README.txt

-              r34184
+              r34186
 -------------------------------------------------
+COMPILING TESSERACT GS2-EXTENSION
+CONTENTS
+-------------------------------------------------
+In this file:
+A. COMPILING TESSERACT GS2-EXTENSION
+& CREATING THE CUT-DOWN BINARY-ONLY TARBALL
+B. GETTING TIKA AND TESSERACT TO OCR A PDF
+-------------------------------------------------
+A. COMPILING TESSERACT GS2-EXTENSION
 & CREATING THE CUT-DOWN BINARY-ONLY TARBALL
 -------------------------------------------------
 …
    cat out.txt
+. If successful, create a folder at the same level as src alled tesseract
+. If successful,
+a. create a folder at the same level as src called tesseract
    cd src
    cd ..
    mkdir tesseract
 COPY the setup files and MOVE the installed folder (src/linux) into the new cut-down tesseract folder:
+b. COPY the setup files and MOVE the installed folder (src/linux) into the new cut-down tesseract folder:
    cp src/setup.ba* tesseract/.
    mv src/linux tesseract/.
 COPY the TESSERACT-APACHE-LICENSE and LEPTONICA-LICENSE txt files (note it uses
 Amercan spelling!) from src/packages into the cut-down tesseract/linux:
+c. COPY the TESSERACT-APACHE-LICENSE and LEPTONICA-LICENSE txt files (note it uses
+American spelling!) from src/packages into the cut-down tesseract/linux:
    cp src/packages/*LICENSE.txt tesseract/linux/.
+d. REMOVE folder "man" from tesseract/linux:
+   rm -rf tesseract/linux/man
 . Create a tarball of the cut down tesseract folder named tesseract-<os>-<arch>.tar.gz:
 …
    svn commit -m "MESSAGE" tesseract-linux-x64.tar.gz
 -------------------------------------------------
+B. GETTING TIKA AND TESSERACT TO OCR A PDF
+-------------------------------------------------
+Tesseract does not OCR PDFs (https://github.com/tesseract-ocr/tesseract/issues/1476).
+Trying to do so, you'll see:
+       tesseract pdf05-notext.pdf notext
+       Tesseract Open Source OCR Engine v5.0.0-alpha-694-g6ee3 with Leptonica
+       Error in pixReadStream: Pdf reading is not supported
+       Error in pixRead: pix not read
+       Error during processing.
+Tesseract can OCR the individual images constituting a page of the PDF, but to OCR PDFs
+with Tesseract, you need an additional tool to split PDFs into its pages and extract images
+from them, feed each page's image into Tesseract to get it OCR-ed and then create an html or
+txt file collating all the individual OCR-ed page content.
+Tika does this.
+By default if Tika is on the environment and TESSDATA_PREFIX is set to the tessdata folder
+containing the language files, Tika is able to get Tesseract to OCR images out of the box.
+Yet not PDFs. Tika will output empty OCR for (x)html/txt, extracting only extractable text
+from PDFs and no OCR until the following is correct.
+To get Tika (app v1.24.x) and Tesseract (v5.0.0) set up to OCR PDFs, needed to do 2 more
+things:
+. Have to run the tika-app-*.jar with in --config=/path/to/a/tika-config.xml file
+configured correctly for the TesseractOCRParser and PDFParser
+. The <tika-config.xml> file passed to tika-app-*.jar should configure the "outputType"
+param's configuration of the TesseractOCRParser as follows:
+   a. Set the "outputType" param to "txt" so Tesseract produces the OCR in .txt format, and
+      Tesseract will produce .txt as OCR output which Tika will intercept and process,
+   b. OR if you want the OCR generated by Tesseract to be in hocr (html ocr) format, then set
+   the "outputType" param value to "hocr" AND ensure a config file also called hocr exists in
+   $TESSDATA_PREFIX/configs containing the following (taken from
+   https://github.com/tesseract-ocr/tessconfigs/blob/3decf1c8252ba6dbeef0bf908f4b0aab7f18d113/configs/hocr):
+               tessedit_create_hocr 1
+               hocr_font_info 0
+. In order to have a tessdata/configs/hocr, needed to correct the Tesseract we were
+cascade-making to get it to put the "configs" subfolder inside the installed Tessereact's
+tessdata folder. The source version of tesseract has this folder, but it wasn't getting
+included in the built version despite us exporting TESSDATA_PREFIX before CASCADE-MAKing.

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 34186 for gs2-extensions/tesseract/trunk/README.txt

Legend:

gs2-extensions/tesseract/trunk/README.txt

Download in other formats: