Context Navigation

← Previous Change
Next Change →

GS_TIKA_README.txt

Timestamp:

2020-06-16T15:22:04+12:00 (4 years ago)

Author:

ak19

Message:

Committing the tika-config.xml that sets up Tika's PDFParser and TesseractOCRParser to OCR PDFs. Without this, despite Tika detecting Tesseract, PDFs weren't getting OCR-ed. This problem wasn't documented anywhere either and onlly by change did I find what was needed: that a correctly configured tika-config.xml was compulsory to get PDFs OCR-ed by Tika+Tesseract, and that the Tesseract installation I created had been missing TESSDATA_PREFIX/configs/hocr

File:

: 1 edited

gs2-extensions/gstika/trunk/GS_TIKA_README.txt (modified) (2 diffs)

Legend:

: Unmodified
: Added
: Removed

gs2-extensions/gstika/trunk/GS_TIKA_README.txt

-              r34177
+              r34187
+--------------------------------------------------------------
+CONTENTS:
+--------------------------------------------------------------
+A. Some background information on Apache Tika and related:
+B. Here are some examples of running Tika on the command line:
+C. COMPARE OUTPUT - IMG EXTRACTION vs TEXT:
+D. THE --encoding= FLAG TO TIKA
+E. WRITING A CUSTOMISED TIKA-CLI TO OUTPUT HTML-WITH-IMAGES
+F. COMPILING TIKA FROM SOURCE
+G. GETTING TIKA TO WORK WITH TESSERACT TO OCR PDFs (tika-config.xml)
 --------------------------------------------------------------
 A. Some background information on Apache Tika and related:
 …
 --------------------------------------------------------------
+G. GETTING TIKA TO WORK WITH TESSERACT TO OCR PDFs (tika-config.xml)
+--------------------------------------------------------------
+If you have Tesseract installed correctly, its bin folder on PATH and TESSDATA_PREFIX
+environment variable set, the current version of Tika (tika-app-1.24.x.jar) and will
+turn on Tesseract OCR automatically for images.
+But Tika is not configured out of the box to work with Tesseract to OCR PDFs (Tesseract
+on its own does not OCR PDFs, only images).
+To get Tika to work with Tesseract to OCR PDFs:
+. Must pass a config.xml file to Tika, where the TesseractOCRParser and PDFParser are
+configured correctly. Run as:
+       tika-app-*.jar --config=<tika-congif.xml>
+. The "outputType" param of the TesseractOCRParser in this config file must have one of
+these 2 values:
+      a. "txt" - which requests Tesseract to output OCR as text
+      b. "hocr" - which asks Tesseract to output OCR as html (hence format called hocr)
+For the hocr param to have any effect (else the PDF pages will not be OCR-ed), on the
+tesseract end, the $TESSDATA_PREFIX/configs/hocr file must exist and contain
+these values (given at
+https://github.com/tesseract-ocr/tessconfigs/blob/3decf1c8252ba6dbeef0bf908f4b0aab7f18d113/configs/hocr):
+    tessedit_create_hocr 1
+    hocr_font_info 0
+The latest Tesseract tarball should now contain this $TESSDATA_PREFIX/configs/hocr file.
+I'm committing an appropriate tika-config.xml file (based on https://opensourceconnections.com/blog/2019/11/26/tika-and-tesseract-outside-of-solr/) for GSTika, containing:
+*************************************************************
+<?xml version="1.0" encoding="UTF-8" standalone="no"?>
+<!--
+    (XML comments only allowed after xml processor instruction.)
+    https://opensourceconnections.com/blog/2019/11/26/tika-and-tesseract-outside-of-solr/
+    which links to their sample tika-config.xml (copied below) which configures the PDF and OCR Parsers to behave just as the old PDFParser.props and OCR Parser properties files did.
+    - new way of one tika-config.xml: https://github.com/o19s/pdf-discovery-demo/blob/crazy_tika_tesseract_inside_of_solr/ocr/tika-config.xml
+    - old way of 2 props files: https://github.com/o19s/pdf-discovery-demo/tree/6f5b37305dd863a73af4617db64cbe853c5ecd2a/ocr/tika-properties/org/apache/tika/parser
+    https://tika.apache.org/1.16/configuring.html
+    https://issues.apache.org/jira/browse/TIKA-2624
+-->
+<properties>
+  <parsers>
+    <parser class="org.apache.tika.parser.DefaultParser">
+      <parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
+      <parser-exclude class="org.apache.tika.parser.pdf.PDFParser"/>
+    </parser>
+    <parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
+      <params>
+    <!-- Setting the following 2 params is unnecessary, since sourcing Greenstone puts the Tesseract binary
+         on the path AND also sets the TESSDATA_PREFIX env var needed by Tesseract -->
+        <!--
+        <param name="tesseractPath" type="string">/path/to/GS3/gs2build/ext/tesseract/linux/bin</param>
+            <param name="tessdataPath" type="string">/path/to/GS3/gs2build/ext/tesseract/linux/tessdata</param>
+    -->
+    <!-- IMPORTANT!! -->
+        <param name="outputType" type="string">hocr</param><!-- Choose one of: hocr, txt -->
+    <!-- hocr is preferred as Tesseract produces nicely formatted html that better reflects
+         the placement of the original text in the scanned page. (Can compare running with horc vs txt)
+         However, initially, the above value had to be fixed as "txt", as outputType value = hocr prevented
+         Tika+Tesseract from OCR-ing pdfs (no OCR output).
+         Until $GEXT_INSTALLED/tessdata/configs/hocr (tesseract config file) was created containing specific
+         property values in point 2b below.
+         To get Tika to work with Tesseract to OCR pages of a scanned PDF:
+. always pass in this file as __config=/path/to/tika-config.xml to tika-app-*.jar cmd,
+. AND do one of the following:
+            a. Set the above outputType param to "txt" so Tesseract produces the OCR in .txt format, and things should work,
+            b. OR if you want the OCR generated by Tesseract to be in hocr (html ocr) format, then set the outputType above
+        to "hocr" AND ensure a config file also called hocr exists in $TESSDATA_PREFIX/configs containing the following
+        (taken from
+        https://github.com/tesseract-ocr/tessconfigs/blob/3decf1c8252ba6dbeef0bf908f4b0aab7f18d113/configs/hocr):
+               tessedit_create_hocr 1
+               hocr_font_info 0
+        More information about tesseract config options by running:
+           tesseract __print-parameters
+    -->
+        <param name="language" type="string">eng</param>
+        <param name="pageSegMode" type="string">1</param>
+      </params>
+    </parser>
+    <parser class="org.apache.tika.parser.pdf.PDFParser">
+      <params>
+        <param name="ocrStrategy" type="string">ocr_and_text</param>
+      </params>
+    </parser>
+  </parsers>
+</properties>
+*************************************************************
+--------------------------------------------------------------

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 34187 for gs2-extensions/gstika/trunk/GS_TIKA_README.txt

Legend:

gs2-extensions/gstika/trunk/GS_TIKA_README.txt

Download in other formats: