source: main/trunk/greenstone2/ext/tika/tika-config.xml@ 34188

Last change on this file since 34188 was 34188, checked in by ak19, 4 years ago

Tika config file to get Tika+Tesseract to OCR PDFs. This file must be passed into tika-app with the config= flag whenever GS3 wants Tika to OCR PDFs.

File size: 3.3 KB
Line 
1<?xml version="1.0" encoding="UTF-8" standalone="no"?>
2<!--
3 (XML comments only allowed after xml processor instruction.)
4
5 https://opensourceconnections.com/blog/2019/11/26/tika-and-tesseract-outside-of-solr/
6 which links to their sample tika-config.xml (copied below) which configures the PDF and OCR Parsers to behave just as the old PDFParser.props and OCR Parser properties files did.
7
8 - new way of one tika-config.xml: https://github.com/o19s/pdf-discovery-demo/blob/crazy_tika_tesseract_inside_of_solr/ocr/tika-config.xml
9 - old way of 2 props files: https://github.com/o19s/pdf-discovery-demo/tree/6f5b37305dd863a73af4617db64cbe853c5ecd2a/ocr/tika-properties/org/apache/tika/parser
10
11 https://tika.apache.org/1.16/configuring.html
12 https://issues.apache.org/jira/browse/TIKA-2624
13-->
14<properties>
15 <parsers>
16 <parser class="org.apache.tika.parser.DefaultParser">
17 <parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
18 <parser-exclude class="org.apache.tika.parser.pdf.PDFParser"/>
19 </parser>
20 <parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
21 <params>
22 <!-- Setting the following 2 params is unnecessary, since sourcing Greenstone puts the Tesseract binary
23 on the path AND also sets the TESSDATA_PREFIX env var needed by Tesseract -->
24 <!--
25 <param name="tesseractPath" type="string">/path/to/GS3/gs2build/ext/tesseract/linux/bin</param>
26 <param name="tessdataPath" type="string">/path/to/GS3/gs2build/ext/tesseract/linux/tessdata</param>
27 -->
28
29 <!-- IMPORTANT!! -->
30 <param name="outputType" type="string">hocr</param><!-- Choose one of: hocr, txt -->
31 <!-- hocr is preferred as Tesseract produces nicely formatted html that better reflects
32 the placement of the original text in the scanned page. (Can compare running with horc vs txt)
33
34 However, initially, the above value had to be fixed as "txt", as outputType value = hocr prevented
35 Tika+Tesseract from OCR-ing pdfs (no OCR output).
36 Until $GEXT_INSTALLED/tessdata/configs/hocr (tesseract config file) was created containing specific
37 property values in point 2b below.
38
39 To get Tika to work with Tesseract to OCR pages of a scanned PDF:
40 1. always pass in this file as __config=/path/to/tika-config.xml to tika-app-*.jar cmd,
41 2. AND do one of the following:
42 a. Set the above outputType param to "txt" so Tesseract produces the OCR in .txt format, and things should work,
43 b. OR if you want the OCR generated by Tesseract to be in hocr (html ocr) format, then set the outputType above
44 to "hocr" AND ensure a config file also called hocr exists in $TESSDATA_PREFIX/configs containing the following
45 (taken from
46 https://github.com/tesseract-ocr/tessconfigs/blob/3decf1c8252ba6dbeef0bf908f4b0aab7f18d113/configs/hocr):
47 tessedit_create_hocr 1
48 hocr_font_info 0
49
50 More information about tesseract config options by running:
51 tesseract __print-parameters
52 -->
53 <param name="language" type="string">eng</param>
54 <param name="pageSegMode" type="string">1</param>
55 </params>
56 </parser>
57 <parser class="org.apache.tika.parser.pdf.PDFParser">
58 <params>
59 <param name="ocrStrategy" type="string">ocr_and_text</param>
60 </params>
61 </parser>
62
63 </parsers>
64</properties>
Note: See TracBrowser for help on using the repository browser.