Context Navigation

source: main/trunk/greenstone2/ext/tika/README.txt@ 34194

Last change on this file since 34194 was 34172, checked in by ak19, 4 years ago
Some minor improvements to the UnknownConverterPlugin settings for tika's conversion (of docx files) to html. Also documenting the reasoning.
File size: 7.2 KB

Line
1	--------------------------------------------------------------
2	A. Some background information on Apache Tika and related:
3	--------------------------------------------------------------
4	* https://tika.apache.org/1.5/gettingstarted.html
5	Refer to the heading "Using Tika as a command line utility" for available cmd line options
6
7	* https://tika.apache.org/download.html
8	is where the tika-app-1.24.1.jar was downloaded from
9	(We don't need any of the other jars, as explained under heading "Build artifacts"at https://tika.apache.org/1.5/gettingstarted.html)
10
11	* Apache 2.0 license
12	https://tika.apache.org/license.html
13
14	* Mime-types for docx and other office suite docs:
15	https://stackoverflow.com/questions/4212861/what-is-a-correct-mime-type-for-docx-pptx-etc
16
17	* Tesseract for OCR with Tika:
18	https://dingyuliang.me/use-tika-1-14-extract-text-image-tesseract-ocr/
19	Use Tika 1.14 to extract text from image by Tesseract OCR
20
21	* API usage examples - if modifying Tika code:
22	https://tika.apache.org/1.8/examples.html
23	https://stackoverflow.com/questions/38577468/convert-a-word-documents-to-html-with-embedded-images-by-tika
24
25	--------------------------------------------------------------
26	B. Here are some examples of running Tika on the command line:
27	--------------------------------------------------------------
28	1. HTML:
29
30	GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar --html /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.htm
31
32	2. XHTML - looks the same as HTML:
33
34	GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar --xml /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html
35
36	3. PLAIN TEXT CONTENT - NO META:
37
38	GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar --text-main /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html
39
40	a. PLAIN TEXT WITH META:
41
42	GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar --text /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html
43
44	b. JUST META:
45
46	GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar --metadata /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html)
47
48	4. IMAGES CAN'T DO HTML + IMAGES IN ONE STEP by throwing in any of the above flags in addition):
49
50	Extracts all attachments (images etc) into specified dir (-z or --extract and then specify a dir for it)
51	GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar --extract --extract-dir=/PATH/TO/GS3/gs2build/ext/tmp /PATH/TO/testword.docx
52
53
54	--------------------------------------------------------------
55	C. COMPARE OUTPUT - IMG EXTRACTION vs TEXT:
56	--------------------------------------------------------------
57	* GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar -z --extract-dir=/PATH/TO/GS3/gs2build/ext/tmp /PATH/TO/testword.docx
58
59	INFO As a convenience, TikaCLI has turned on extraction of
60	inline images for the PDFParser (TIKA-2374).
61	Aside from the -z option, this is not the default behavior
62	in Tika generally or in tika-server.
63	Jun 14, 2020 1:28:17 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
64	WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
65	See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
66	for optional dependencies.
67
68	Jun 14, 2020 1:28:17 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
69	WARNING: org.xerial's sqlite-jdbc is not loaded.
70	Please provide the jar on your classpath to parse sqlite files.
71	See tika-parsers/pom.xml for the correct version.
72
73
74	* GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar --text-main /PATH/TO/testword.docx
75
76	Jun 14, 2020 1:29:42 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
77	WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
78	See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
79	for optional dependencies.
80
81	Jun 14, 2020 1:29:42 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
82	WARNING: org.xerial's sqlite-jdbc is not loaded.
83	Please provide the jar on your classpath to parse sqlite files.
84	See tika-parsers/pom.xml for the correct version.
85	<ACTUAL TEXT IN INPUT DOCUMENT OUTPUT HERE>
86
87
88	--------------------------------------------------------------
89	D. THE --encoding= FLAG TO TIKA
90	--------------------------------------------------------------
91	> java -jar tika-app-1.24.1.jar --help
92	...
93	-eX or --encoding=X Use output encoding X
94	...
95
96	You can't specify invalid encodings (e.g. --encoding=nonexistent)
97	It seems to be insensitive to case, e.g. --encoding=UTF-8, --encoding=utf-8, --encoding=iso-8859-1
98
99	Since my tests have been to convert docs that contain ASCII using Tika,
100	it's only obvious that the encoding flag has been taken into account in any way when the output is
101	xhtml which is the default (or can pass in -x or --xml to get xhtml out).
102
103
104	COMPARE, noting also the case of the encoding in the Tika command, vs in the output:
105
106	(1) >java -jar tika-app-1.24.1.jar --encoding=iso-8859-1 /Scratch/ak19/testword.docx
107	<?xml version="1.0" encoding="utf-8"?><html xmlns="http://www.w3.org/1999/xhtml">
108	<head>
109	<meta name="date" content="2013-09-18T02:46:00Z"/>
110	...
111
112	(2) >java -jar tika-app-1.24.1.jar --encoding=UTF-8 /Scratch/ak19/testword.docx
113	<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
114	<head>
115	...
116
117	(3) >java -jar tika-app-1.24.1.jar --encoding=iso-8859-1 /Scratch/ak19/testword.docx
118	<?xml version="1.0" encoding="ISO-8859-1"?><html xmlns="http://www.w3.org/1999/xhtml">
119	<head>
120	...
121
122	(4) >java -jar tika-app-1.24.1.jar --encoding=ISO-8859-1 /Scratch/ak19/testword.docx
123	<?xml version="1.0" encoding="ISO-8859-1"?><html xmlns="http://www.w3.org/1999/xhtml">
124	<head>
125	...
126
127	(5) >java -jar tika-app-1.24.1.jar --encoding=nonexistent /Scratch/ak19/testword.docx
128	Warning: The encoding 'nonexistent' is not supported by the Java runtime.
129	Warning: encoding "nonexistent" not supported, using UTF-8
130	<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
131	<head>
132	...
133
134	(6) (Output to html)
135	> java -jar tika-app-1.24.1.jar --encoding=nonexistent --html /Scratch/ak19/testword.docx
136	Warning: The encoding 'nonexistent' is not supported by the Java runtime.
137	Warning: encoding "nonexistent" not supported, using UTF-8
138	<html xmlns="http://www.w3.org/1999/xhtml">
139	<head>
140	...
141	The warning to STDERR is all that indicates that the encoding flag is taken into account
142	when --html flag is turned. The actual html output sent to STDOUT makes no mention of any
143	encoding in the file.
144
145	(7) (Output to html case 2)
146	> java -jar tika-app-1.24.1.jar --html --encoding=iso-8859-1 /Scratch/ak19/testword.docx
147	<html xmlns="http://www.w3.org/1999/xhtml">
148	<head>
149	<meta name="date" content="2013-09-18T02:46:00Z"/>
150	<meta name="Total-Time" content="5"/>
151	...
152	No warnings, but also no mention of the encoding in the html output.
153
154
155	The warning messages in (6) indicate that the output encoding is also taken into account when
156	the output format is set to html, by passing in the flag --html to tika.
157	Since we use --html as the output format, and UTF-8 is the character encoding Greenstone prefers
158	to work with, it therefore seems meaningful to set --encoding=UTF-8.
159
160	Also passing in --pretty-print to get supposedly better formatted output.
161
162
163	--------------------------------------------------------------

Note: See TracBrowser for help on using the repository browser.

Download in other formats: