Ignore:
Timestamp:
2020-06-16T15:00:39+12:00 (4 years ago)
Author:
ak19
Message:

In order to get tika + tesseract to OCR PDFs (note that tesseract can't OCR PDFs on its own), need to pass a tika-config.xml file to tika that is configured to use txt OR hocr as outputType, and if outputType=hocr then need to have the tesseract/tessdata/configs folder contain a file called hocr at minimum. Now the build process ensures that the tessdata/configs and other tessdata subfolders in the extracted tesseract source package get copied across into the GEXTTESS_INSTALLED install location. Updating the README with the notes and the tesseract bin tarball.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • gs2-extensions/tesseract/trunk/README.txt

    r34184 r34186  
    11-------------------------------------------------
    2 COMPILING TESSERACT GS2-EXTENSION
     2CONTENTS
     3-------------------------------------------------
     4In this file:
     5
     6A. COMPILING TESSERACT GS2-EXTENSION
     7& CREATING THE CUT-DOWN BINARY-ONLY TARBALL
     8
     9B. GETTING TIKA AND TESSERACT TO OCR A PDF
     10
     11
     12-------------------------------------------------
     13A. COMPILING TESSERACT GS2-EXTENSION
    314& CREATING THE CUT-DOWN BINARY-ONLY TARBALL
    415-------------------------------------------------
     
    3243   cat out.txt
    3344
    34 5. If successful, create a folder at the same level as src alled tesseract
     455. If successful,
     46
     47a. create a folder at the same level as src called tesseract
    3548   cd src
    3649   cd ..
    3750   mkdir tesseract
    3851
    39 COPY the setup files and MOVE the installed folder (src/linux) into the new cut-down tesseract folder:
     52b. COPY the setup files and MOVE the installed folder (src/linux) into the new cut-down tesseract folder:
    4053
    4154   cp src/setup.ba* tesseract/.
    4255   mv src/linux tesseract/.
    4356
    44 COPY the TESSERACT-APACHE-LICENSE and LEPTONICA-LICENSE txt files (note it uses
    45 Amercan spelling!) from src/packages into the cut-down tesseract/linux:
     57c. COPY the TESSERACT-APACHE-LICENSE and LEPTONICA-LICENSE txt files (note it uses
     58American spelling!) from src/packages into the cut-down tesseract/linux:
    4659
    4760   cp src/packages/*LICENSE.txt tesseract/linux/.
    4861
     62d. REMOVE folder "man" from tesseract/linux:
     63   rm -rf tesseract/linux/man
    4964
    50656. Create a tarball of the cut down tesseract folder named tesseract-<os>-<arch>.tar.gz:
     
    5772   svn commit -m "MESSAGE" tesseract-linux-x64.tar.gz
    5873
    59  
     74
    6075-------------------------------------------------
     76B. GETTING TIKA AND TESSERACT TO OCR A PDF
     77-------------------------------------------------
     78Tesseract does not OCR PDFs (https://github.com/tesseract-ocr/tesseract/issues/1476).
     79Trying to do so, you'll see:
     80       tesseract pdf05-notext.pdf notext
     81       Tesseract Open Source OCR Engine v5.0.0-alpha-694-g6ee3 with Leptonica
     82       Error in pixReadStream: Pdf reading is not supported
     83       Error in pixRead: pix not read
     84       Error during processing.
     85
     86Tesseract can OCR the individual images constituting a page of the PDF, but to OCR PDFs
     87with Tesseract, you need an additional tool to split PDFs into its pages and extract images
     88from them, feed each page's image into Tesseract to get it OCR-ed and then create an html or
     89txt file collating all the individual OCR-ed page content.
     90
     91Tika does this.
     92
     93By default if Tika is on the environment and TESSDATA_PREFIX is set to the tessdata folder
     94containing the language files, Tika is able to get Tesseract to OCR images out of the box.
     95Yet not PDFs. Tika will output empty OCR for (x)html/txt, extracting only extractable text
     96from PDFs and no OCR until the following is correct.
     97
     98To get Tika (app v1.24.x) and Tesseract (v5.0.0) set up to OCR PDFs, needed to do 2 more
     99things:
     1001. Have to run the tika-app-*.jar with in --config=/path/to/a/tika-config.xml file
     101configured correctly for the TesseractOCRParser and PDFParser
     1022. The <tika-config.xml> file passed to tika-app-*.jar should configure the "outputType"
     103param's configuration of the TesseractOCRParser as follows:
     104   a. Set the "outputType" param to "txt" so Tesseract produces the OCR in .txt format, and
     105      Tesseract will produce .txt as OCR output which Tika will intercept and process,
     106   b. OR if you want the OCR generated by Tesseract to be in hocr (html ocr) format, then set
     107   the "outputType" param value to "hocr" AND ensure a config file also called hocr exists in
     108   $TESSDATA_PREFIX/configs containing the following (taken from
     109   https://github.com/tesseract-ocr/tessconfigs/blob/3decf1c8252ba6dbeef0bf908f4b0aab7f18d113/configs/hocr):
     110               tessedit_create_hocr 1
     111               hocr_font_info 0
     112           
     1132. In order to have a tessdata/configs/hocr, needed to correct the Tesseract we were
     114cascade-making to get it to put the "configs" subfolder inside the installed Tessereact's
     115tessdata folder. The source version of tesseract has this folder, but it wasn't getting
     116included in the built version despite us exporting TESSDATA_PREFIX before CASCADE-MAKing.
     117
Note: See TracChangeset for help on using the changeset viewer.