source: gs2-extensions/gstika/trunk/GS_TIKA_README.txt@ 34187

Last change on this file since 34187 was 34187, checked in by ak19, 4 years ago

Committing the tika-config.xml that sets up Tika's PDFParser and TesseractOCRParser to OCR PDFs. Without this, despite Tika detecting Tesseract, PDFs weren't getting OCR-ed. This problem wasn't documented anywhere either and onlly by change did I find what was needed: that a correctly configured tika-config.xml was compulsory to get PDFs OCR-ed by Tika+Tesseract, and that the Tesseract installation I created had been missing TESSDATA_PREFIX/configs/hocr

File size: 18.1 KB
Line 
1--------------------------------------------------------------
2CONTENTS:
3--------------------------------------------------------------
4
5A. Some background information on Apache Tika and related:
6B. Here are some examples of running Tika on the command line:
7C. COMPARE OUTPUT - IMG EXTRACTION vs TEXT:
8D. THE --encoding= FLAG TO TIKA
9E. WRITING A CUSTOMISED TIKA-CLI TO OUTPUT HTML-WITH-IMAGES
10F. COMPILING TIKA FROM SOURCE
11G. GETTING TIKA TO WORK WITH TESSERACT TO OCR PDFs (tika-config.xml)
12
13--------------------------------------------------------------
14A. Some background information on Apache Tika and related:
15--------------------------------------------------------------
16* https://tika.apache.org/1.5/gettingstarted.html
17Refer to the heading "Using Tika as a command line utility" for available cmd line options
18
19* https://tika.apache.org/download.html
20is where the tika-app-1.24.1.jar was downloaded from
21(We don't need any of the other jars, as explained under heading "Build artifacts"at https://tika.apache.org/1.5/gettingstarted.html)
22
23* Apache 2.0 license
24 https://tika.apache.org/license.html
25
26* Mime-types for docx and other office suite docs:
27 https://stackoverflow.com/questions/4212861/what-is-a-correct-mime-type-for-docx-pptx-etc
28
29* Tesseract for OCR with Tika:
30https://dingyuliang.me/use-tika-1-14-extract-text-image-tesseract-ocr/
31Use Tika 1.14 to extract text from image by Tesseract OCR
32
33* API usage examples - if modifying Tika code:
34https://tika.apache.org/1.8/examples.html
35https://stackoverflow.com/questions/38577468/convert-a-word-documents-to-html-with-embedded-images-by-tika
36
37--------------------------------------------------------------
38B. Here are some examples of running Tika on the command line:
39--------------------------------------------------------------
401. HTML:
41
42GS3/gs2build/ext/gstika>java -jar tika-app-*.jar --html /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.htm
43
442. XHTML - looks the same as HTML:
45
46GS3/gs2build/ext/gstika>java -jar tika-app-*.jar --xml /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html
47
483. PLAIN TEXT CONTENT - NO META:
49
50GS3/gs2build/ext/gstika>java -jar tika-app-*.jar --text-main /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html
51
52 a. PLAIN TEXT WITH META:
53
54GS3/gs2build/ext/gstika>java -jar tika-app-*.jar --text /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html
55
56 b. JUST META:
57
58GS3/gs2build/ext/gstika>java -jar tika-app-*.jar --metadata /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html)
59
604. IMAGES CAN'T DO HTML + IMAGES IN ONE STEP by throwing in any of the above flags in addition):
61
62Extracts all attachments (images etc) into specified dir (-z or --extract and then specify a dir for it)
63GS3/gs2build/ext/gstika>java -jar tika-app-*.jar --extract --extract-dir=/PATH/TO/GS3/gs2build/ext/tmp /PATH/TO/testword.docx
64
65
66--------------------------------------------------------------
67C. COMPARE OUTPUT - IMG EXTRACTION vs TEXT:
68--------------------------------------------------------------
69* GS3/gs2build/ext/gstika>java -jar tika-app-*.jar -z --extract-dir=/PATH/TO/GS3/gs2build/ext/tmp /PATH/TO/testword.docx
70
71INFO As a convenience, TikaCLI has turned on extraction of
72inline images for the PDFParser (TIKA-2374).
73Aside from the -z option, this is not the default behavior
74in Tika generally or in tika-server.
75Jun 14, 2020 1:28:17 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
76WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
77See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
78for optional dependencies.
79
80Jun 14, 2020 1:28:17 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
81WARNING: org.xerial's sqlite-jdbc is not loaded.
82Please provide the jar on your classpath to parse sqlite files.
83See tika-parsers/pom.xml for the correct version.
84
85
86* GS3/gs2build/ext/gstika>java -jar tika-app-*.jar --text-main /PATH/TO/testword.docx
87
88Jun 14, 2020 1:29:42 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
89WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
90See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
91for optional dependencies.
92
93Jun 14, 2020 1:29:42 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
94WARNING: org.xerial's sqlite-jdbc is not loaded.
95Please provide the jar on your classpath to parse sqlite files.
96See tika-parsers/pom.xml for the correct version.
97<ACTUAL TEXT IN INPUT DOCUMENT OUTPUT HERE>
98
99
100--------------------------------------------------------------
101D. THE --encoding= FLAG TO TIKA
102--------------------------------------------------------------
103> java -jar tika-app-*.jar --help
104 ...
105 -eX or --encoding=X Use output encoding X
106 ...
107
108You can't specify invalid encodings (e.g. --encoding=nonexistent)
109It seems to be insensitive to case, e.g. --encoding=UTF-8, --encoding=utf-8, --encoding=iso-8859-1
110
111Since my tests have been to convert docs that contain ASCII using Tika,
112it's only obvious that the encoding flag has been taken into account in any way when the output is
113xhtml which is the default (or can pass in -x or --xml to get xhtml out).
114
115
116COMPARE, noting also the case of the encoding in the Tika command, vs in the output:
117
118(1) >java -jar tika-app-*.jar --encoding=iso-8859-1 /Scratch/ak19/testword.docx
119 <?xml version="1.0" encoding="utf-8"?><html xmlns="http://www.w3.org/1999/xhtml">
120 <head>
121 <meta name="date" content="2013-09-18T02:46:00Z"/>
122 ...
123
124(2) >java -jar tika-app-*.jar --encoding=UTF-8 /Scratch/ak19/testword.docx
125 <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
126 <head>
127 ...
128
129(3) >java -jar tika-app-*.jar --encoding=iso-8859-1 /Scratch/ak19/testword.docx
130 <?xml version="1.0" encoding="ISO-8859-1"?><html xmlns="http://www.w3.org/1999/xhtml">
131 <head>
132 ...
133
134(4) >java -jar tika-app-*.jar --encoding=ISO-8859-1 /Scratch/ak19/testword.docx
135 <?xml version="1.0" encoding="ISO-8859-1"?><html xmlns="http://www.w3.org/1999/xhtml">
136 <head>
137 ...
138
139(5) >java -jar tika-app-*.jar --encoding=nonexistent /Scratch/ak19/testword.docx
140 Warning: The encoding 'nonexistent' is not supported by the Java runtime.
141 Warning: encoding "nonexistent" not supported, using UTF-8
142 <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
143 <head>
144 ...
145
146(6) (Output to html)
147 > java -jar tika-app-*.jar --encoding=nonexistent --html /Scratch/ak19/testword.docx
148 Warning: The encoding 'nonexistent' is not supported by the Java runtime.
149 Warning: encoding "nonexistent" not supported, using UTF-8
150 <html xmlns="http://www.w3.org/1999/xhtml">
151 <head>
152 ...
153The warning to STDERR is all that indicates that the encoding flag is taken into account
154when --html flag is turned. The actual html output sent to STDOUT makes no mention of any
155encoding in the file.
156
157(7) (Output to html case 2)
158 > java -jar tika-app-*.jar --html --encoding=iso-8859-1 /Scratch/ak19/testword.docx
159 <html xmlns="http://www.w3.org/1999/xhtml">
160 <head>
161 <meta name="date" content="2013-09-18T02:46:00Z"/>
162 <meta name="Total-Time" content="5"/>
163 ...
164No warnings, but also no mention of the encoding in the html output.
165
166
167The warning messages in (6) indicate that the output encoding is also taken into account when
168the output format is set to html, by passing in the flag --html to tika.
169Since we use --html as the output format, and UTF-8 is the character encoding Greenstone prefers
170to work with, it therefore seems meaningful to set --encoding=UTF-8.
171
172Also passing in --pretty-print to get supposedly better formatted output.
173
174
175--------------------------------------------------------------
176E. WRITING A CUSTOMISED TIKA-CLI TO OUTPUT HTML-WITH-IMAGES
177--------------------------------------------------------------
178
179The default Tika cli app accepts --html and --xml (for xhtml) flags to output html and xhtml respectively.
180To extract images, the Tika cli app needs to be run separately with a --extract flag and optional --extract-dir=<dir>
181However, running --html and then --extract sequentially does not produce an html file referring to the extracted
182images because the extracted images are renamed to rId<digit>_<imagefilename>.<ext>, while the html file generated
183refers to "embedded:<imagefilename>.<ext>" as the value for the src attributes of image elements.
184
185So the problem is two-fold:
186- Need to not be prefixing anything to the extracted images
187- Need to remove "embedded:" prefix from the img src attributes in the html produced. Ideally don't want the string
188"embedded:" prefixed at all, but that would require editing many source files in the Tika project rather than just one.
189
190The solution turned out not to require compiling up apache-tika from source at all, but having a source checkout
191to locate and modify code was handy.
192
193
194SOLUTION TO OUTPUT (X)HTML WITH IMAGES EXTRACTED IN THE SAME LOCATION:
1951. I wrote the org.greenstone.tika.GSTikaClient.java which is based on the TikaClient.java
196with some minor modifications to be documented below.
197
1982. It stands alone and can be compiled and run against the tika-app-*.jar file on the classpath:
199To compile
200 GS3/gs2build/ext/gstika>javac -cp `pwd`/lib/tika-app-*.jar org/greenstone/tika/GSTikaCLI.java
201To run:
202 GS3/gs2build/ext/gstika>java -cp "`pwd`/lib/tika-app-*.jar:." org.greenstone.tika.GSTikaCLI --html-with-images <inputfilepath> > output.html
203
204(Can pass existing flags, e.g. --html for html without images extracted)
205
206To compile code that lives in a directory called "src" and compile it into a directory called "build":
207
208 GS3/gs2build/ext/gstika>javac -cp `pwd`/lib/tika-app-*.jar -d `pwd`/build src/org/greenstone/tika/GSTikaCLI.java
209
210To run the compiled class that's now in folder "build":
211 GS3/gs2build/ext/gstika>javac -cp "`pwd`/lib/tika-app-*.jar:`pwd`/build" --html-with-images <inputfilepath> > output.html
212
213
2143. GSTikaClient.java is based on TikaClient.java with the modifications marked with comments mentioning "GSDL".
215
216a. The major changes are that inner class method FileEmbeddedDocumentExtractor.getOutputFile() no longer
217prefixes the unwanted "rId_" prefix to the filenames of the extracted images
218
219b. The return type of the static method getTransformerHandler() is no longer TransformerHandler, but its superclass ContentHandler.
220
221When the new --html-with-imgs (or xhtml-with-images) flag is passed into GSTikaClient, function getTransformerHandler() will further process the existing html/xml result generated by the function, by removing "embedded:" prefixes in img src attributes. This is done by copying some source code from tika-app/src/main/java/org/apache/tika/gui/TikaGUI.java source code and modifying it (look for code about a ContentHandlerDecorator in TikaGUI.java).
222
223c. Other changes are to support the 2 new additional input flags --html-with-imgs and --xhtml-with-imgs, and additional call the image extraction functions, and ensuring an extraction directory flag is still supported in this mode. (Though when not provided, the images will be extracted into the same level as the input file.)
224
225
2264. Next added a makeGSTikaCLI.sh script for compiling and the GSTikaCLI.sh script for minor simplification of running.
227
228
229cd gs2build/ext/gstika
230./makeGSTikaCLI.sh
231./GSTikaCLI.sh --html-with-images <inputfile> > <outputfile>
232e.g. ./GSTikaCLI.sh --html-with-imgs --pretty-print --encoding=UTF-8 tmp/<file>.docx > tmp/<file>.html
233
234
235--------------------------------------------------------------
236F. COMPILING TIKA FROM SOURCE
237--------------------------------------------------------------
238
239Refer to https://github.com/apache/tika
240
241(a) Need Maven 3 to compile up Tika.
242 export MAVEN_HOME=/Path/To/apache-maven3
243 export PATH=$MAVEN_HOME/bin:$PATH
244
245(b) Need to configure Maven to grab artifacts using https, since some are only available over https.
246Refer to https://stackoverflow.com/questions/25393298/what-is-the-correct-way-of-forcing-maven-to-use-https-for-maven-central
247which instructs adding the following to your $MAVEN_HOME/conf/settings.xml into <profiles> section:
248
249 <profile>
250 <id>maven-https</id>
251 <activation>
252 <activeByDefault>true</activeByDefault>
253 </activation>
254 <repositories>
255 <repository>
256 <id>central</id>
257 <url>https://repo1.maven.org/maven2</url>
258 <snapshots>
259 <enabled>false</enabled>
260 </snapshots>
261 </repository>
262 </repositories>
263 <pluginRepositories>
264 <pluginRepository>
265 <id>central</id>
266 <url>https://repo1.maven.org/maven2</url>
267 <snapshots>
268 <enabled>false</enabled>
269 </snapshots>
270 </pluginRepository>
271 </pluginRepositories>
272 </profile>
273
274(c) Grab tika from git and attempt to compile it with maven
275 > git clone https://github.com/apache/tika.git
276 > cd tika
277 > mvn clean install
278Takes 42-45 mins to compile up!
279
280
281This compiles up version 2.0.0 tika-app jar file, whereas the precompiled downloadable jar is version 1.24.1.
282
283Compiling this wasn't necessary to compile or run GSTikaClient.java!
284However, having the source code to base GSTikaCLI.java off of TikaCLI.java
285was useful.
286
287--------------------------------------------------------------
288G. GETTING TIKA TO WORK WITH TESSERACT TO OCR PDFs (tika-config.xml)
289--------------------------------------------------------------
290
291If you have Tesseract installed correctly, its bin folder on PATH and TESSDATA_PREFIX
292environment variable set, the current version of Tika (tika-app-1.24.x.jar) and will
293turn on Tesseract OCR automatically for images.
294
295But Tika is not configured out of the box to work with Tesseract to OCR PDFs (Tesseract
296on its own does not OCR PDFs, only images).
297
298To get Tika to work with Tesseract to OCR PDFs:
2991. Must pass a config.xml file to Tika, where the TesseractOCRParser and PDFParser are
300configured correctly. Run as:
301 tika-app-*.jar --config=<tika-congif.xml>
302
3032. The "outputType" param of the TesseractOCRParser in this config file must have one of
304these 2 values:
305 a. "txt" - which requests Tesseract to output OCR as text
306 b. "hocr" - which asks Tesseract to output OCR as html (hence format called hocr)
307
308For the hocr param to have any effect (else the PDF pages will not be OCR-ed), on the
309tesseract end, the $TESSDATA_PREFIX/configs/hocr file must exist and contain
310these values (given at
311https://github.com/tesseract-ocr/tessconfigs/blob/3decf1c8252ba6dbeef0bf908f4b0aab7f18d113/configs/hocr):
312 tessedit_create_hocr 1
313 hocr_font_info 0
314
315The latest Tesseract tarball should now contain this $TESSDATA_PREFIX/configs/hocr file.
316
317
318I'm committing an appropriate tika-config.xml file (based on https://opensourceconnections.com/blog/2019/11/26/tika-and-tesseract-outside-of-solr/) for GSTika, containing:
319
320*************************************************************
321<?xml version="1.0" encoding="UTF-8" standalone="no"?>
322<!--
323 (XML comments only allowed after xml processor instruction.)
324
325 https://opensourceconnections.com/blog/2019/11/26/tika-and-tesseract-outside-of-solr/
326 which links to their sample tika-config.xml (copied below) which configures the PDF and OCR Parsers to behave just as the old PDFParser.props and OCR Parser properties files did.
327
328 - new way of one tika-config.xml: https://github.com/o19s/pdf-discovery-demo/blob/crazy_tika_tesseract_inside_of_solr/ocr/tika-config.xml
329 - old way of 2 props files: https://github.com/o19s/pdf-discovery-demo/tree/6f5b37305dd863a73af4617db64cbe853c5ecd2a/ocr/tika-properties/org/apache/tika/parser
330
331 https://tika.apache.org/1.16/configuring.html
332 https://issues.apache.org/jira/browse/TIKA-2624
333-->
334<properties>
335 <parsers>
336 <parser class="org.apache.tika.parser.DefaultParser">
337 <parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
338 <parser-exclude class="org.apache.tika.parser.pdf.PDFParser"/>
339 </parser>
340 <parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
341 <params>
342 <!-- Setting the following 2 params is unnecessary, since sourcing Greenstone puts the Tesseract binary
343 on the path AND also sets the TESSDATA_PREFIX env var needed by Tesseract -->
344 <!--
345 <param name="tesseractPath" type="string">/path/to/GS3/gs2build/ext/tesseract/linux/bin</param>
346 <param name="tessdataPath" type="string">/path/to/GS3/gs2build/ext/tesseract/linux/tessdata</param>
347 -->
348
349 <!-- IMPORTANT!! -->
350 <param name="outputType" type="string">hocr</param><!-- Choose one of: hocr, txt -->
351 <!-- hocr is preferred as Tesseract produces nicely formatted html that better reflects
352 the placement of the original text in the scanned page. (Can compare running with horc vs txt)
353
354 However, initially, the above value had to be fixed as "txt", as outputType value = hocr prevented
355 Tika+Tesseract from OCR-ing pdfs (no OCR output).
356 Until $GEXT_INSTALLED/tessdata/configs/hocr (tesseract config file) was created containing specific
357 property values in point 2b below.
358
359 To get Tika to work with Tesseract to OCR pages of a scanned PDF:
360 1. always pass in this file as __config=/path/to/tika-config.xml to tika-app-*.jar cmd,
361 2. AND do one of the following:
362 a. Set the above outputType param to "txt" so Tesseract produces the OCR in .txt format, and things should work,
363 b. OR if you want the OCR generated by Tesseract to be in hocr (html ocr) format, then set the outputType above
364 to "hocr" AND ensure a config file also called hocr exists in $TESSDATA_PREFIX/configs containing the following
365 (taken from
366 https://github.com/tesseract-ocr/tessconfigs/blob/3decf1c8252ba6dbeef0bf908f4b0aab7f18d113/configs/hocr):
367 tessedit_create_hocr 1
368 hocr_font_info 0
369
370 More information about tesseract config options by running:
371 tesseract __print-parameters
372 -->
373 <param name="language" type="string">eng</param>
374 <param name="pageSegMode" type="string">1</param>
375 </params>
376 </parser>
377 <parser class="org.apache.tika.parser.pdf.PDFParser">
378 <params>
379 <param name="ocrStrategy" type="string">ocr_and_text</param>
380 </params>
381 </parser>
382
383 </parsers>
384</properties>
385*************************************************************
386
387
388--------------------------------------------------------------
Note: See TracBrowser for help on using the repository browser.