source: gs2-extensions/gstika/trunk/java/ocr-pdfs-config.xml@ 34195

Last change on this file since 34195 was 34195, checked in by ak19, 4 years ago

Renaming config files so one is configured for OCR-ing PDFs, the other for turning off OCR when Tesseract is installed (else Tika will autodetect if OCR-ing applies when Tesseract is installed. Maybe there's some minor savings in overhead with a no-ocr-config.xml?). With no config flag passed to tika, it will by default perform OCR only where it applies and if Tesseract is installed

File size: 3.7 KB
Line 
1<?xml version="1.0" encoding="UTF-8" standalone="no"?>
2<!--
3 (XML comments only allowed after xml processor instruction.)
4
5 https://opensourceconnections.com/blog/2019/11/26/tika-and-tesseract-outside-of-solr/
6 which links to their sample tika-config.xml (copied below) which configures the PDF and OCR Parsers to behave just as the old PDFParser.props and OCR Parser properties files did.
7
8 - new way of one tika-config.xml: https://github.com/o19s/pdf-discovery-demo/blob/crazy_tika_tesseract_inside_of_solr/ocr/tika-config.xml
9 - old way of 2 props files: https://github.com/o19s/pdf-discovery-demo/tree/6f5b37305dd863a73af4617db64cbe853c5ecd2a/ocr/tika-properties/org/apache/tika/parser
10
11 Further useful information on configuring tika for OCR (or no OCR) at:
12 - https://tika.apache.org/1.16/configuring.html
13 - https://issues.apache.org/jira/browse/TIKA-2624
14 - https://stackoverflow.com/questions/51655510/how-do-you-enable-the-tesseractocrparser-using-tikaconfig-and-the-tika-command-l#51668962 (out of date?)
15 - https://stackoverflow.com/questions/56232720/is-there-a-way-to-disable-ocr-mode-in-tika-without-uninstalling-tesseract
16-->
17<properties>
18 <parsers>
19 <parser class="org.apache.tika.parser.DefaultParser">
20 <parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
21 <parser-exclude class="org.apache.tika.parser.pdf.PDFParser"/>
22 </parser>
23 <parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
24 <params>
25 <!-- Setting the following 2 params is unnecessary, since sourcing Greenstone puts the Tesseract binary
26 on the path AND also sets the TESSDATA_PREFIX env var needed by Tesseract -->
27 <!--
28 <param name="tesseractPath" type="string">/path/to/GS3/gs2build/ext/tesseract/linux/bin</param>
29 <param name="tessdataPath" type="string">/path/to/GS3/gs2build/ext/tesseract/linux/tessdata</param>
30 -->
31
32 <!-- IMPORTANT!! -->
33 <param name="outputType" type="string">hocr</param><!-- Choose one of: hocr, txt -->
34 <!-- hocr is preferred as Tesseract produces nicely formatted html that better reflects
35 the placement of the original text in the scanned page. (Can compare running with horc vs txt)
36
37 However, initially, the above value had to be fixed as "txt", as outputType value = hocr prevented
38 Tika+Tesseract from OCR-ing pdfs (no OCR output).
39 Until $GEXT_INSTALLED/tessdata/configs/hocr (tesseract config file) was created containing specific
40 property values in point 2b below.
41
42 To get Tika to work with Tesseract to OCR pages of a scanned PDF:
43 1. always pass in this file as &#45;&#45;config=/path/to/tika-config.xml to tika-app-*.jar cmd,
44 2. AND do one of the following:
45 a. Set the above outputType param to "txt" so Tesseract produces the OCR in .txt format, and things should work,
46 b. OR if you want the OCR generated by Tesseract to be in hocr (html ocr) format, then set the outputType above
47 to "hocr" AND ensure a config file also called hocr exists in $TESSDATA_PREFIX/configs containing the following
48 (taken from
49 https://github.com/tesseract-ocr/tessconfigs/blob/3decf1c8252ba6dbeef0bf908f4b0aab7f18d113/configs/hocr):
50 tessedit_create_hocr 1
51 hocr_font_info 0
52
53 More information about tesseract config options by running:
54 tesseract &#45;&#45;print-parameters
55 -->
56 <param name="language" type="string">eng</param>
57 <param name="pageSegMode" type="string">1</param>
58 </params>
59 </parser>
60 <parser class="org.apache.tika.parser.pdf.PDFParser">
61 <params>
62 <param name="ocrStrategy" type="string">ocr_and_text</param>
63 </params>
64 </parser>
65
66 </parsers>
67</properties>
Note: See TracBrowser for help on using the repository browser.