Ignore:
Timestamp:
2020-06-16T15:22:04+12:00 (4 years ago)
Author:
ak19
Message:

Committing the tika-config.xml that sets up Tika's PDFParser and TesseractOCRParser to OCR PDFs. Without this, despite Tika detecting Tesseract, PDFs weren't getting OCR-ed. This problem wasn't documented anywhere either and onlly by change did I find what was needed: that a correctly configured tika-config.xml was compulsory to get PDFs OCR-ed by Tika+Tesseract, and that the Tesseract installation I created had been missing TESSDATA_PREFIX/configs/hocr

File:
1 edited

Legend:

Unmodified
Added
Removed
  • gs2-extensions/gstika/trunk/GS_TIKA_README.txt

    r34177 r34187  
     1--------------------------------------------------------------
     2CONTENTS:
     3--------------------------------------------------------------
     4
     5A. Some background information on Apache Tika and related:
     6B. Here are some examples of running Tika on the command line:
     7C. COMPARE OUTPUT - IMG EXTRACTION vs TEXT:
     8D. THE --encoding= FLAG TO TIKA
     9E. WRITING A CUSTOMISED TIKA-CLI TO OUTPUT HTML-WITH-IMAGES
     10F. COMPILING TIKA FROM SOURCE
     11G. GETTING TIKA TO WORK WITH TESSERACT TO OCR PDFs (tika-config.xml)
     12
    113--------------------------------------------------------------
    214A. Some background information on Apache Tika and related:
     
    274286
    275287--------------------------------------------------------------
     288G. GETTING TIKA TO WORK WITH TESSERACT TO OCR PDFs (tika-config.xml)
     289--------------------------------------------------------------
     290
     291If you have Tesseract installed correctly, its bin folder on PATH and TESSDATA_PREFIX
     292environment variable set, the current version of Tika (tika-app-1.24.x.jar) and will
     293turn on Tesseract OCR automatically for images.
     294
     295But Tika is not configured out of the box to work with Tesseract to OCR PDFs (Tesseract
     296on its own does not OCR PDFs, only images).
     297
     298To get Tika to work with Tesseract to OCR PDFs:
     2991. Must pass a config.xml file to Tika, where the TesseractOCRParser and PDFParser are
     300configured correctly. Run as:
     301       tika-app-*.jar --config=<tika-congif.xml>
     302       
     3032. The "outputType" param of the TesseractOCRParser in this config file must have one of
     304these 2 values:
     305      a. "txt" - which requests Tesseract to output OCR as text
     306      b. "hocr" - which asks Tesseract to output OCR as html (hence format called hocr)
     307
     308For the hocr param to have any effect (else the PDF pages will not be OCR-ed), on the
     309tesseract end, the $TESSDATA_PREFIX/configs/hocr file must exist and contain
     310these values (given at
     311https://github.com/tesseract-ocr/tessconfigs/blob/3decf1c8252ba6dbeef0bf908f4b0aab7f18d113/configs/hocr):
     312    tessedit_create_hocr 1
     313    hocr_font_info 0
     314
     315The latest Tesseract tarball should now contain this $TESSDATA_PREFIX/configs/hocr file.
     316
     317
     318I'm committing an appropriate tika-config.xml file (based on https://opensourceconnections.com/blog/2019/11/26/tika-and-tesseract-outside-of-solr/) for GSTika, containing:
     319
     320*************************************************************
     321<?xml version="1.0" encoding="UTF-8" standalone="no"?>
     322<!--
     323    (XML comments only allowed after xml processor instruction.)
     324
     325    https://opensourceconnections.com/blog/2019/11/26/tika-and-tesseract-outside-of-solr/
     326    which links to their sample tika-config.xml (copied below) which configures the PDF and OCR Parsers to behave just as the old PDFParser.props and OCR Parser properties files did.
     327   
     328    - new way of one tika-config.xml: https://github.com/o19s/pdf-discovery-demo/blob/crazy_tika_tesseract_inside_of_solr/ocr/tika-config.xml
     329    - old way of 2 props files: https://github.com/o19s/pdf-discovery-demo/tree/6f5b37305dd863a73af4617db64cbe853c5ecd2a/ocr/tika-properties/org/apache/tika/parser
     330   
     331    https://tika.apache.org/1.16/configuring.html
     332    https://issues.apache.org/jira/browse/TIKA-2624
     333-->
     334<properties>
     335  <parsers>
     336    <parser class="org.apache.tika.parser.DefaultParser">
     337      <parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
     338      <parser-exclude class="org.apache.tika.parser.pdf.PDFParser"/>
     339    </parser>
     340    <parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
     341      <params>
     342    <!-- Setting the following 2 params is unnecessary, since sourcing Greenstone puts the Tesseract binary
     343         on the path AND also sets the TESSDATA_PREFIX env var needed by Tesseract -->
     344        <!--
     345        <param name="tesseractPath" type="string">/path/to/GS3/gs2build/ext/tesseract/linux/bin</param>
     346            <param name="tessdataPath" type="string">/path/to/GS3/gs2build/ext/tesseract/linux/tessdata</param>
     347    -->
     348
     349    <!-- IMPORTANT!! -->
     350        <param name="outputType" type="string">hocr</param><!-- Choose one of: hocr, txt -->
     351    <!-- hocr is preferred as Tesseract produces nicely formatted html that better reflects
     352         the placement of the original text in the scanned page. (Can compare running with horc vs txt)
     353         
     354         However, initially, the above value had to be fixed as "txt", as outputType value = hocr prevented
     355         Tika+Tesseract from OCR-ing pdfs (no OCR output).
     356         Until $GEXT_INSTALLED/tessdata/configs/hocr (tesseract config file) was created containing specific
     357         property values in point 2b below.
     358
     359         To get Tika to work with Tesseract to OCR pages of a scanned PDF:
     360         1. always pass in this file as __config=/path/to/tika-config.xml to tika-app-*.jar cmd,
     361         2. AND do one of the following:
     362            a. Set the above outputType param to "txt" so Tesseract produces the OCR in .txt format, and things should work,
     363            b. OR if you want the OCR generated by Tesseract to be in hocr (html ocr) format, then set the outputType above
     364        to "hocr" AND ensure a config file also called hocr exists in $TESSDATA_PREFIX/configs containing the following
     365        (taken from
     366        https://github.com/tesseract-ocr/tessconfigs/blob/3decf1c8252ba6dbeef0bf908f4b0aab7f18d113/configs/hocr):
     367               tessedit_create_hocr 1
     368               hocr_font_info 0
     369
     370        More information about tesseract config options by running:
     371           tesseract __print-parameters
     372    -->
     373        <param name="language" type="string">eng</param>
     374        <param name="pageSegMode" type="string">1</param>
     375      </params>
     376    </parser>
     377    <parser class="org.apache.tika.parser.pdf.PDFParser">
     378      <params>
     379        <param name="ocrStrategy" type="string">ocr_and_text</param>
     380      </params>
     381    </parser>
     382
     383  </parsers>
     384</properties>
     385*************************************************************
     386
     387
     388--------------------------------------------------------------
Note: See TracChangeset for help on using the changeset viewer.