1 | <?xml version="1.0" encoding="UTF-8" standalone="no"?>
|
---|
2 | <!--
|
---|
3 | (XML comments only allowed after xml processor instruction.)
|
---|
4 |
|
---|
5 | https://opensourceconnections.com/blog/2019/11/26/tika-and-tesseract-outside-of-solr/
|
---|
6 | which links to their sample tika-config.xml (copied below) which configures the PDF and OCR Parsers to behave just as the old PDFParser.props and OCR Parser properties files did.
|
---|
7 |
|
---|
8 | - new way of one tika-config.xml: https://github.com/o19s/pdf-discovery-demo/blob/crazy_tika_tesseract_inside_of_solr/ocr/tika-config.xml
|
---|
9 | - old way of 2 props files: https://github.com/o19s/pdf-discovery-demo/tree/6f5b37305dd863a73af4617db64cbe853c5ecd2a/ocr/tika-properties/org/apache/tika/parser
|
---|
10 |
|
---|
11 | https://tika.apache.org/1.16/configuring.html
|
---|
12 | https://issues.apache.org/jira/browse/TIKA-2624
|
---|
13 | -->
|
---|
14 | <properties>
|
---|
15 | <parsers>
|
---|
16 | <parser class="org.apache.tika.parser.DefaultParser">
|
---|
17 | <parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
|
---|
18 | <parser-exclude class="org.apache.tika.parser.pdf.PDFParser"/>
|
---|
19 | </parser>
|
---|
20 | <parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
|
---|
21 | <params>
|
---|
22 | <!-- Setting the following 2 params is unnecessary, since sourcing Greenstone puts the Tesseract binary
|
---|
23 | on the path AND also sets the TESSDATA_PREFIX env var needed by Tesseract -->
|
---|
24 | <!--
|
---|
25 | <param name="tesseractPath" type="string">/path/to/GS3/gs2build/ext/tesseract/linux/bin</param>
|
---|
26 | <param name="tessdataPath" type="string">/path/to/GS3/gs2build/ext/tesseract/linux/tessdata</param>
|
---|
27 | -->
|
---|
28 |
|
---|
29 | <!-- IMPORTANT!! -->
|
---|
30 | <param name="outputType" type="string">hocr</param><!-- Choose one of: hocr, txt -->
|
---|
31 | <!-- hocr is preferred as Tesseract produces nicely formatted html that better reflects
|
---|
32 | the placement of the original text in the scanned page. (Can compare running with horc vs txt)
|
---|
33 |
|
---|
34 | However, initially, the above value had to be fixed as "txt", as outputType value = hocr prevented
|
---|
35 | Tika+Tesseract from OCR-ing pdfs (no OCR output).
|
---|
36 | Until $GEXT_INSTALLED/tessdata/configs/hocr (tesseract config file) was created containing specific
|
---|
37 | property values in point 2b below.
|
---|
38 |
|
---|
39 | To get Tika to work with Tesseract to OCR pages of a scanned PDF:
|
---|
40 | 1. always pass in this file as __config=/path/to/tika-config.xml to tika-app-*.jar cmd,
|
---|
41 | 2. AND do one of the following:
|
---|
42 | a. Set the above outputType param to "txt" so Tesseract produces the OCR in .txt format, and things should work,
|
---|
43 | b. OR if you want the OCR generated by Tesseract to be in hocr (html ocr) format, then set the outputType above
|
---|
44 | to "hocr" AND ensure a config file also called hocr exists in $TESSDATA_PREFIX/configs containing the following
|
---|
45 | (taken from
|
---|
46 | https://github.com/tesseract-ocr/tessconfigs/blob/3decf1c8252ba6dbeef0bf908f4b0aab7f18d113/configs/hocr):
|
---|
47 | tessedit_create_hocr 1
|
---|
48 | hocr_font_info 0
|
---|
49 |
|
---|
50 | More information about tesseract config options by running:
|
---|
51 | tesseract __print-parameters
|
---|
52 | -->
|
---|
53 | <param name="language" type="string">eng</param>
|
---|
54 | <param name="pageSegMode" type="string">1</param>
|
---|
55 | </params>
|
---|
56 | </parser>
|
---|
57 | <parser class="org.apache.tika.parser.pdf.PDFParser">
|
---|
58 | <params>
|
---|
59 | <param name="ocrStrategy" type="string">ocr_and_text</param>
|
---|
60 | </params>
|
---|
61 | </parser>
|
---|
62 |
|
---|
63 | </parsers>
|
---|
64 | </properties>
|
---|