Context Navigation

GS_TIKA_README.txt@ 34187

Last change on this file since 34187 was 34187, checked in by ak19, 4 years ago

Committing the tika-config.xml that sets up Tika's PDFParser and TesseractOCRParser to OCR PDFs. Without this, despite Tika detecting Tesseract, PDFs weren't getting OCR-ed. This problem wasn't documented anywhere either and onlly by change did I find what was needed: that a correctly configured tika-config.xml was compulsory to get PDFs OCR-ed by Tika+Tesseract, and that the Tesseract installation I created had been missing TESSDATA_PREFIX/configs/hocr

File size: 18.1 KB

Line
1	--------------------------------------------------------------
2	CONTENTS:
3	--------------------------------------------------------------
4
5	A. Some background information on Apache Tika and related:
6	B. Here are some examples of running Tika on the command line:
7	C. COMPARE OUTPUT - IMG EXTRACTION vs TEXT:
8	D. THE --encoding= FLAG TO TIKA
9	E. WRITING A CUSTOMISED TIKA-CLI TO OUTPUT HTML-WITH-IMAGES
10	F. COMPILING TIKA FROM SOURCE
11	G. GETTING TIKA TO WORK WITH TESSERACT TO OCR PDFs (tika-config.xml)
12
13	--------------------------------------------------------------
14	A. Some background information on Apache Tika and related:
15	--------------------------------------------------------------
16	* https://tika.apache.org/1.5/gettingstarted.html
17	Refer to the heading "Using Tika as a command line utility" for available cmd line options
18
19	* https://tika.apache.org/download.html
20	is where the tika-app-1.24.1.jar was downloaded from
21	(We don't need any of the other jars, as explained under heading "Build artifacts"at https://tika.apache.org/1.5/gettingstarted.html)
22
23	* Apache 2.0 license
24	https://tika.apache.org/license.html
25
26	* Mime-types for docx and other office suite docs:
27	https://stackoverflow.com/questions/4212861/what-is-a-correct-mime-type-for-docx-pptx-etc
28
29	* Tesseract for OCR with Tika:
30	https://dingyuliang.me/use-tika-1-14-extract-text-image-tesseract-ocr/
31	Use Tika 1.14 to extract text from image by Tesseract OCR
32
33	* API usage examples - if modifying Tika code:
34	https://tika.apache.org/1.8/examples.html
35	https://stackoverflow.com/questions/38577468/convert-a-word-documents-to-html-with-embedded-images-by-tika
36
37	--------------------------------------------------------------
38	B. Here are some examples of running Tika on the command line:
39	--------------------------------------------------------------
40	1. HTML:
41
42	GS3/gs2build/ext/gstika>java -jar tika-app-*.jar --html /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.htm
43
44	2. XHTML - looks the same as HTML:
45
46	GS3/gs2build/ext/gstika>java -jar tika-app-*.jar --xml /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html
47
48	3. PLAIN TEXT CONTENT - NO META:
49
50	GS3/gs2build/ext/gstika>java -jar tika-app-*.jar --text-main /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html
51
52	a. PLAIN TEXT WITH META:
53
54	GS3/gs2build/ext/gstika>java -jar tika-app-*.jar --text /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html
55
56	b. JUST META:
57
58	GS3/gs2build/ext/gstika>java -jar tika-app-*.jar --metadata /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html)
59
60	4. IMAGES CAN'T DO HTML + IMAGES IN ONE STEP by throwing in any of the above flags in addition):
61
62	Extracts all attachments (images etc) into specified dir (-z or --extract and then specify a dir for it)
63	GS3/gs2build/ext/gstika>java -jar tika-app-*.jar --extract --extract-dir=/PATH/TO/GS3/gs2build/ext/tmp /PATH/TO/testword.docx
64
65
66	--------------------------------------------------------------
67	C. COMPARE OUTPUT - IMG EXTRACTION vs TEXT:
68	--------------------------------------------------------------
69	* GS3/gs2build/ext/gstika>java -jar tika-app-*.jar -z --extract-dir=/PATH/TO/GS3/gs2build/ext/tmp /PATH/TO/testword.docx
70
71	INFO As a convenience, TikaCLI has turned on extraction of
72	inline images for the PDFParser (TIKA-2374).
73	Aside from the -z option, this is not the default behavior
74	in Tika generally or in tika-server.
75	Jun 14, 2020 1:28:17 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
76	WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
77	See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
78	for optional dependencies.
79
80	Jun 14, 2020 1:28:17 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
81	WARNING: org.xerial's sqlite-jdbc is not loaded.
82	Please provide the jar on your classpath to parse sqlite files.
83	See tika-parsers/pom.xml for the correct version.
84
85
86	* GS3/gs2build/ext/gstika>java -jar tika-app-*.jar --text-main /PATH/TO/testword.docx
87
88	Jun 14, 2020 1:29:42 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
89	WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
90	See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
91	for optional dependencies.
92
93	Jun 14, 2020 1:29:42 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
94	WARNING: org.xerial's sqlite-jdbc is not loaded.
95	Please provide the jar on your classpath to parse sqlite files.
96	See tika-parsers/pom.xml for the correct version.
97	<ACTUAL TEXT IN INPUT DOCUMENT OUTPUT HERE>
98
99
100	--------------------------------------------------------------
101	D. THE --encoding= FLAG TO TIKA
102	--------------------------------------------------------------
103	> java -jar tika-app-*.jar --help
104	...
105	-eX or --encoding=X Use output encoding X
106	...
107
108	You can't specify invalid encodings (e.g. --encoding=nonexistent)
109	It seems to be insensitive to case, e.g. --encoding=UTF-8, --encoding=utf-8, --encoding=iso-8859-1
110
111	Since my tests have been to convert docs that contain ASCII using Tika,
112	it's only obvious that the encoding flag has been taken into account in any way when the output is
113	xhtml which is the default (or can pass in -x or --xml to get xhtml out).
114
115
116	COMPARE, noting also the case of the encoding in the Tika command, vs in the output:
117
118	(1) >java -jar tika-app-*.jar --encoding=iso-8859-1 /Scratch/ak19/testword.docx
119	<?xml version="1.0" encoding="utf-8"?><html xmlns="http://www.w3.org/1999/xhtml">
120	<head>
121	<meta name="date" content="2013-09-18T02:46:00Z"/>
122	...
123
124	(2) >java -jar tika-app-*.jar --encoding=UTF-8 /Scratch/ak19/testword.docx
125	<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
126	<head>
127	...
128
129	(3) >java -jar tika-app-*.jar --encoding=iso-8859-1 /Scratch/ak19/testword.docx
130	<?xml version="1.0" encoding="ISO-8859-1"?><html xmlns="http://www.w3.org/1999/xhtml">
131	<head>
132	...
133
134	(4) >java -jar tika-app-*.jar --encoding=ISO-8859-1 /Scratch/ak19/testword.docx
135	<?xml version="1.0" encoding="ISO-8859-1"?><html xmlns="http://www.w3.org/1999/xhtml">
136	<head>
137	...
138
139	(5) >java -jar tika-app-*.jar --encoding=nonexistent /Scratch/ak19/testword.docx
140	Warning: The encoding 'nonexistent' is not supported by the Java runtime.
141	Warning: encoding "nonexistent" not supported, using UTF-8
142	<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
143	<head>
144	...
145
146	(6) (Output to html)
147	> java -jar tika-app-*.jar --encoding=nonexistent --html /Scratch/ak19/testword.docx
148	Warning: The encoding 'nonexistent' is not supported by the Java runtime.
149	Warning: encoding "nonexistent" not supported, using UTF-8
150	<html xmlns="http://www.w3.org/1999/xhtml">
151	<head>
152	...
153	The warning to STDERR is all that indicates that the encoding flag is taken into account
154	when --html flag is turned. The actual html output sent to STDOUT makes no mention of any
155	encoding in the file.
156
157	(7) (Output to html case 2)
158	> java -jar tika-app-*.jar --html --encoding=iso-8859-1 /Scratch/ak19/testword.docx
159	<html xmlns="http://www.w3.org/1999/xhtml">
160	<head>
161	<meta name="date" content="2013-09-18T02:46:00Z"/>
162	<meta name="Total-Time" content="5"/>
163	...
164	No warnings, but also no mention of the encoding in the html output.
165
166
167	The warning messages in (6) indicate that the output encoding is also taken into account when
168	the output format is set to html, by passing in the flag --html to tika.
169	Since we use --html as the output format, and UTF-8 is the character encoding Greenstone prefers
170	to work with, it therefore seems meaningful to set --encoding=UTF-8.
171
172	Also passing in --pretty-print to get supposedly better formatted output.
173
174
175	--------------------------------------------------------------
176	E. WRITING A CUSTOMISED TIKA-CLI TO OUTPUT HTML-WITH-IMAGES
177	--------------------------------------------------------------
178
179	The default Tika cli app accepts --html and --xml (for xhtml) flags to output html and xhtml respectively.
180	To extract images, the Tika cli app needs to be run separately with a --extract flag and optional --extract-dir=<dir>
181	However, running --html and then --extract sequentially does not produce an html file referring to the extracted
182	images because the extracted images are renamed to rId<digit>_<imagefilename>.<ext>, while the html file generated
183	refers to "embedded:<imagefilename>.<ext>" as the value for the src attributes of image elements.
184
185	So the problem is two-fold:
186	- Need to not be prefixing anything to the extracted images
187	- Need to remove "embedded:" prefix from the img src attributes in the html produced. Ideally don't want the string
188	"embedded:" prefixed at all, but that would require editing many source files in the Tika project rather than just one.
189
190	The solution turned out not to require compiling up apache-tika from source at all, but having a source checkout
191	to locate and modify code was handy.
192
193
194	SOLUTION TO OUTPUT (X)HTML WITH IMAGES EXTRACTED IN THE SAME LOCATION:
195	1. I wrote the org.greenstone.tika.GSTikaClient.java which is based on the TikaClient.java
196	with some minor modifications to be documented below.
197
198	2. It stands alone and can be compiled and run against the tika-app-*.jar file on the classpath:
199	To compile
200	GS3/gs2build/ext/gstika>javac -cp `pwd`/lib/tika-app-*.jar org/greenstone/tika/GSTikaCLI.java
201	To run:
202	GS3/gs2build/ext/gstika>java -cp "`pwd`/lib/tika-app-*.jar:." org.greenstone.tika.GSTikaCLI --html-with-images <inputfilepath> > output.html
203
204	(Can pass existing flags, e.g. --html for html without images extracted)
205
206	To compile code that lives in a directory called "src" and compile it into a directory called "build":
207
208	GS3/gs2build/ext/gstika>javac -cp `pwd`/lib/tika-app-*.jar -d `pwd`/build src/org/greenstone/tika/GSTikaCLI.java
209
210	To run the compiled class that's now in folder "build":
211	GS3/gs2build/ext/gstika>javac -cp "`pwd`/lib/tika-app-*.jar:`pwd`/build" --html-with-images <inputfilepath> > output.html
212
213
214	3. GSTikaClient.java is based on TikaClient.java with the modifications marked with comments mentioning "GSDL".
215
216	a. The major changes are that inner class method FileEmbeddedDocumentExtractor.getOutputFile() no longer
217	prefixes the unwanted "rId_" prefix to the filenames of the extracted images
218
219	b. The return type of the static method getTransformerHandler() is no longer TransformerHandler, but its superclass ContentHandler.
220
221	When the new --html-with-imgs (or xhtml-with-images) flag is passed into GSTikaClient, function getTransformerHandler() will further process the existing html/xml result generated by the function, by removing "embedded:" prefixes in img src attributes. This is done by copying some source code from tika-app/src/main/java/org/apache/tika/gui/TikaGUI.java source code and modifying it (look for code about a ContentHandlerDecorator in TikaGUI.java).
222
223	c. Other changes are to support the 2 new additional input flags --html-with-imgs and --xhtml-with-imgs, and additional call the image extraction functions, and ensuring an extraction directory flag is still supported in this mode. (Though when not provided, the images will be extracted into the same level as the input file.)
224
225
226	4. Next added a makeGSTikaCLI.sh script for compiling and the GSTikaCLI.sh script for minor simplification of running.
227
228
229	cd gs2build/ext/gstika
230	./makeGSTikaCLI.sh
231	./GSTikaCLI.sh --html-with-images <inputfile> > <outputfile>
232	e.g. ./GSTikaCLI.sh --html-with-imgs --pretty-print --encoding=UTF-8 tmp/<file>.docx > tmp/<file>.html
233
234
235	--------------------------------------------------------------
236	F. COMPILING TIKA FROM SOURCE
237	--------------------------------------------------------------
238
239	Refer to https://github.com/apache/tika
240
241	(a) Need Maven 3 to compile up Tika.
242	export MAVEN_HOME=/Path/To/apache-maven3
243	export PATH=$MAVEN_HOME/bin:$PATH
244
245	(b) Need to configure Maven to grab artifacts using https, since some are only available over https.
246	Refer to https://stackoverflow.com/questions/25393298/what-is-the-correct-way-of-forcing-maven-to-use-https-for-maven-central
247	which instructs adding the following to your $MAVEN_HOME/conf/settings.xml into <profiles> section:
248
249	<profile>
250	<id>maven-https</id>
251	<activation>
252	<activeByDefault>true</activeByDefault>
253	</activation>
254	<repositories>
255	<repository>
256	<id>central</id>
257	<url>https://repo1.maven.org/maven2</url>
258	<snapshots>
259	<enabled>false</enabled>
260	</snapshots>
261	</repository>
262	</repositories>
263	<pluginRepositories>
264	<pluginRepository>
265	<id>central</id>
266	<url>https://repo1.maven.org/maven2</url>
267	<snapshots>
268	<enabled>false</enabled>
269	</snapshots>
270	</pluginRepository>
271	</pluginRepositories>
272	</profile>
273
274	(c) Grab tika from git and attempt to compile it with maven
275	> git clone https://github.com/apache/tika.git
276	> cd tika
277	> mvn clean install
278	Takes 42-45 mins to compile up!
279
280
281	This compiles up version 2.0.0 tika-app jar file, whereas the precompiled downloadable jar is version 1.24.1.
282
283	Compiling this wasn't necessary to compile or run GSTikaClient.java!
284	However, having the source code to base GSTikaCLI.java off of TikaCLI.java
285	was useful.
286
287	--------------------------------------------------------------
288	G. GETTING TIKA TO WORK WITH TESSERACT TO OCR PDFs (tika-config.xml)
289	--------------------------------------------------------------
290
291	If you have Tesseract installed correctly, its bin folder on PATH and TESSDATA_PREFIX
292	environment variable set, the current version of Tika (tika-app-1.24.x.jar) and will
293	turn on Tesseract OCR automatically for images.
294
295	But Tika is not configured out of the box to work with Tesseract to OCR PDFs (Tesseract
296	on its own does not OCR PDFs, only images).
297
298	To get Tika to work with Tesseract to OCR PDFs:
299	1. Must pass a config.xml file to Tika, where the TesseractOCRParser and PDFParser are
300	configured correctly. Run as:
301	tika-app-*.jar --config=<tika-congif.xml>
302
303	2. The "outputType" param of the TesseractOCRParser in this config file must have one of
304	these 2 values:
305	a. "txt" - which requests Tesseract to output OCR as text
306	b. "hocr" - which asks Tesseract to output OCR as html (hence format called hocr)
307
308	For the hocr param to have any effect (else the PDF pages will not be OCR-ed), on the
309	tesseract end, the $TESSDATA_PREFIX/configs/hocr file must exist and contain
310	these values (given at
311	https://github.com/tesseract-ocr/tessconfigs/blob/3decf1c8252ba6dbeef0bf908f4b0aab7f18d113/configs/hocr):
312	tessedit_create_hocr 1
313	hocr_font_info 0
314
315	The latest Tesseract tarball should now contain this $TESSDATA_PREFIX/configs/hocr file.
316
317
318	I'm committing an appropriate tika-config.xml file (based on https://opensourceconnections.com/blog/2019/11/26/tika-and-tesseract-outside-of-solr/) for GSTika, containing:
319
320	*************************************************************
321	<?xml version="1.0" encoding="UTF-8" standalone="no"?>
322	<!--
323	(XML comments only allowed after xml processor instruction.)
324
325	https://opensourceconnections.com/blog/2019/11/26/tika-and-tesseract-outside-of-solr/
326	which links to their sample tika-config.xml (copied below) which configures the PDF and OCR Parsers to behave just as the old PDFParser.props and OCR Parser properties files did.
327
328	- new way of one tika-config.xml: https://github.com/o19s/pdf-discovery-demo/blob/crazy_tika_tesseract_inside_of_solr/ocr/tika-config.xml
329	- old way of 2 props files: https://github.com/o19s/pdf-discovery-demo/tree/6f5b37305dd863a73af4617db64cbe853c5ecd2a/ocr/tika-properties/org/apache/tika/parser
330
331	https://tika.apache.org/1.16/configuring.html
332	https://issues.apache.org/jira/browse/TIKA-2624
333	-->
334	<properties>
335	<parsers>
336	<parser class="org.apache.tika.parser.DefaultParser">
337	<parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
338	<parser-exclude class="org.apache.tika.parser.pdf.PDFParser"/>
339	</parser>
340	<parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
341	<params>
342	<!-- Setting the following 2 params is unnecessary, since sourcing Greenstone puts the Tesseract binary
343	on the path AND also sets the TESSDATA_PREFIX env var needed by Tesseract -->
344	<!--
345	<param name="tesseractPath" type="string">/path/to/GS3/gs2build/ext/tesseract/linux/bin</param>
346	<param name="tessdataPath" type="string">/path/to/GS3/gs2build/ext/tesseract/linux/tessdata</param>
347	-->
348
349	<!-- IMPORTANT!! -->
350	<param name="outputType" type="string">hocr</param><!-- Choose one of: hocr, txt -->
351	<!-- hocr is preferred as Tesseract produces nicely formatted html that better reflects
352	the placement of the original text in the scanned page. (Can compare running with horc vs txt)
353
354	However, initially, the above value had to be fixed as "txt", as outputType value = hocr prevented
355	Tika+Tesseract from OCR-ing pdfs (no OCR output).
356	Until $GEXT_INSTALLED/tessdata/configs/hocr (tesseract config file) was created containing specific
357	property values in point 2b below.
358
359	To get Tika to work with Tesseract to OCR pages of a scanned PDF:
360	1. always pass in this file as __config=/path/to/tika-config.xml to tika-app-*.jar cmd,
361	2. AND do one of the following:
362	a. Set the above outputType param to "txt" so Tesseract produces the OCR in .txt format, and things should work,
363	b. OR if you want the OCR generated by Tesseract to be in hocr (html ocr) format, then set the outputType above
364	to "hocr" AND ensure a config file also called hocr exists in $TESSDATA_PREFIX/configs containing the following
365	(taken from
366	https://github.com/tesseract-ocr/tessconfigs/blob/3decf1c8252ba6dbeef0bf908f4b0aab7f18d113/configs/hocr):
367	tessedit_create_hocr 1
368	hocr_font_info 0
369
370	More information about tesseract config options by running:
371	tesseract __print-parameters
372	-->
373	<param name="language" type="string">eng</param>
374	<param name="pageSegMode" type="string">1</param>
375	</params>
376	</parser>
377	<parser class="org.apache.tika.parser.pdf.PDFParser">
378	<params>
379	<param name="ocrStrategy" type="string">ocr_and_text</param>
380	</params>
381	</parser>
382
383	</parsers>
384	</properties>
385	*************************************************************
386
387
388	--------------------------------------------------------------

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: gs2-extensions/gstika/trunk/GS_TIKA_README.txt@ 34187

Download in other formats: