Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Blame
Revision Log

tika-config.xml@ 34187

Last change on this file since 34187 was 34187, checked in by ak19, 4 years ago

Committing the tika-config.xml that sets up Tika's PDFParser and TesseractOCRParser to OCR PDFs. Without this, despite Tika detecting Tesseract, PDFs weren't getting OCR-ed. This problem wasn't documented anywhere either and onlly by change did I find what was needed: that a correctly configured tika-config.xml was compulsory to get PDFs OCR-ed by Tika+Tesseract, and that the Tesseract installation I created had been missing TESSDATA_PREFIX/configs/hocr

File size: 3.3 KB

Line
1	<?xml version="1.0" encoding="UTF-8" standalone="no"?>
2	<!--
3	(XML comments only allowed after xml processor instruction.)
4
5	https://opensourceconnections.com/blog/2019/11/26/tika-and-tesseract-outside-of-solr/
6	which links to their sample tika-config.xml (copied below) which configures the PDF and OCR Parsers to behave just as the old PDFParser.props and OCR Parser properties files did.
7
8	- new way of one tika-config.xml: https://github.com/o19s/pdf-discovery-demo/blob/crazy_tika_tesseract_inside_of_solr/ocr/tika-config.xml
9	- old way of 2 props files: https://github.com/o19s/pdf-discovery-demo/tree/6f5b37305dd863a73af4617db64cbe853c5ecd2a/ocr/tika-properties/org/apache/tika/parser
10
11	https://tika.apache.org/1.16/configuring.html
12	https://issues.apache.org/jira/browse/TIKA-2624
13	-->
14	<properties>
15	<parsers>
16	<parser class="org.apache.tika.parser.DefaultParser">
17	<parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
18	<parser-exclude class="org.apache.tika.parser.pdf.PDFParser"/>
19	</parser>
20	<parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
21	<params>
22	<!-- Setting the following 2 params is unnecessary, since sourcing Greenstone puts the Tesseract binary
23	on the path AND also sets the TESSDATA_PREFIX env var needed by Tesseract -->
24	<!--
25	<param name="tesseractPath" type="string">/path/to/GS3/gs2build/ext/tesseract/linux/bin</param>
26	<param name="tessdataPath" type="string">/path/to/GS3/gs2build/ext/tesseract/linux/tessdata</param>
27	-->
28
29	<!-- IMPORTANT!! -->
30	<param name="outputType" type="string">hocr</param><!-- Choose one of: hocr, txt -->
31	<!-- hocr is preferred as Tesseract produces nicely formatted html that better reflects
32	the placement of the original text in the scanned page. (Can compare running with horc vs txt)
33
34	However, initially, the above value had to be fixed as "txt", as outputType value = hocr prevented
35	Tika+Tesseract from OCR-ing pdfs (no OCR output).
36	Until $GEXT_INSTALLED/tessdata/configs/hocr (tesseract config file) was created containing specific
37	property values in point 2b below.
38
39	To get Tika to work with Tesseract to OCR pages of a scanned PDF:
40	1. always pass in this file as __config=/path/to/tika-config.xml to tika-app-*.jar cmd,
41	2. AND do one of the following:
42	a. Set the above outputType param to "txt" so Tesseract produces the OCR in .txt format, and things should work,
43	b. OR if you want the OCR generated by Tesseract to be in hocr (html ocr) format, then set the outputType above
44	to "hocr" AND ensure a config file also called hocr exists in $TESSDATA_PREFIX/configs containing the following
45	(taken from
46	https://github.com/tesseract-ocr/tessconfigs/blob/3decf1c8252ba6dbeef0bf908f4b0aab7f18d113/configs/hocr):
47	tessedit_create_hocr 1
48	hocr_font_info 0
49
50	More information about tesseract config options by running:
51	tesseract __print-parameters
52	-->
53	<param name="language" type="string">eng</param>
54	<param name="pageSegMode" type="string">1</param>
55	</params>
56	</parser>
57	<parser class="org.apache.tika.parser.pdf.PDFParser">
58	<params>
59	<param name="ocrStrategy" type="string">ocr_and_text</param>
60	</params>
61	</parser>
62
63	</parsers>
64	</properties>

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: gs2-extensions/gstika/trunk/java/tika-config.xml@ 34187

Download in other formats: