Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Blame
Revision Log

GS-README.txt@ 37201

Last change on this file since 37201 was 35402, checked in by anupama, 3 years ago
File renaming tika ext's README to GS-README, as it was our own custom readme file and not any official one by tika.
File size: 7.5 KB

Line
1	--------------------------------------------------------------
2	About tika-app.jar:
3	--------------------------------------------------------------
4	Last updated version is currently 1.24.1 (tika-app-1.24.1.jar)
5	which can be found in the final line of output of running:
6	java -jar %GSDLHOME%\ext\tika\tika-app.jar --version
7	on Windows:
8	or on Linux,
9	java -jar $GSDLHOME/ext/tika/tika-app.jar --version
10
11
12
13	--------------------------------------------------------------
14	A. Some background information on Apache Tika and related:
15	--------------------------------------------------------------
16	* https://tika.apache.org/1.5/gettingstarted.html
17	Refer to the heading "Using Tika as a command line utility" for available cmd line options
18
19	* https://tika.apache.org/download.html
20	is where the tika-app-1.24.1.jar was downloaded from
21	(We don't need any of the other jars, as explained under heading "Build artifacts"at https://tika.apache.org/1.5/gettingstarted.html)
22
23	* Apache 2.0 license
24	https://tika.apache.org/license.html
25
26	* Mime-types for docx and other office suite docs:
27	https://stackoverflow.com/questions/4212861/what-is-a-correct-mime-type-for-docx-pptx-etc
28
29	* Tesseract for OCR with Tika:
30	https://dingyuliang.me/use-tika-1-14-extract-text-image-tesseract-ocr/
31	Use Tika 1.14 to extract text from image by Tesseract OCR
32
33	* API usage examples - if modifying Tika code:
34	https://tika.apache.org/1.8/examples.html
35	https://stackoverflow.com/questions/38577468/convert-a-word-documents-to-html-with-embedded-images-by-tika
36
37	--------------------------------------------------------------
38	B. Here are some examples of running Tika on the command line:
39	--------------------------------------------------------------
40	1. HTML:
41
42	GS3/gs2build/ext/tika>java -jar tika-app.jar --html /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.htm
43
44	2. XHTML - looks the same as HTML:
45
46	GS3/gs2build/ext/tika>java -jar tika-app.jar --xml /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html
47
48	3. PLAIN TEXT CONTENT - NO META:
49
50	GS3/gs2build/ext/tika>java -jar tika-app.jar --text-main /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html
51
52	a. PLAIN TEXT WITH META:
53
54	GS3/gs2build/ext/tika>java -jar tika-app.jar --text /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html
55
56	b. JUST META:
57
58	GS3/gs2build/ext/tika>java -jar tika-app.jar --metadata /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html)
59
60	4. IMAGES CAN'T DO HTML + IMAGES IN ONE STEP by throwing in any of the above flags in addition):
61
62	Extracts all attachments (images etc) into specified dir (-z or --extract and then specify a dir for it)
63	GS3/gs2build/ext/tika>java -jar tika-app.jar --extract --extract-dir=/PATH/TO/GS3/gs2build/ext/tmp /PATH/TO/testword.docx
64
65
66	--------------------------------------------------------------
67	C. COMPARE OUTPUT - IMG EXTRACTION vs TEXT:
68	--------------------------------------------------------------
69	* GS3/gs2build/ext/tika>java -jar tika-app.jar -z --extract-dir=/PATH/TO/GS3/gs2build/ext/tmp /PATH/TO/testword.docx
70
71	INFO As a convenience, TikaCLI has turned on extraction of
72	inline images for the PDFParser (TIKA-2374).
73	Aside from the -z option, this is not the default behavior
74	in Tika generally or in tika-server.
75	Jun 14, 2020 1:28:17 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
76	WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
77	See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
78	for optional dependencies.
79
80	Jun 14, 2020 1:28:17 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
81	WARNING: org.xerial's sqlite-jdbc is not loaded.
82	Please provide the jar on your classpath to parse sqlite files.
83	See tika-parsers/pom.xml for the correct version.
84
85
86	* GS3/gs2build/ext/tika>java -jar tika-app.jar --text-main /PATH/TO/testword.docx
87
88	Jun 14, 2020 1:29:42 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
89	WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
90	See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
91	for optional dependencies.
92
93	Jun 14, 2020 1:29:42 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
94	WARNING: org.xerial's sqlite-jdbc is not loaded.
95	Please provide the jar on your classpath to parse sqlite files.
96	See tika-parsers/pom.xml for the correct version.
97	<ACTUAL TEXT IN INPUT DOCUMENT OUTPUT HERE>
98
99
100	--------------------------------------------------------------
101	D. THE --encoding= FLAG TO TIKA
102	--------------------------------------------------------------
103	> java -jar tika-app.jar --help
104	...
105	-eX or --encoding=X Use output encoding X
106	...
107
108	You can't specify invalid encodings (e.g. --encoding=nonexistent)
109	It seems to be insensitive to case, e.g. --encoding=UTF-8, --encoding=utf-8, --encoding=iso-8859-1
110
111	Since my tests have been to convert docs that contain ASCII using Tika,
112	it's only obvious that the encoding flag has been taken into account in any way when the output is
113	xhtml which is the default (or can pass in -x or --xml to get xhtml out).
114
115
116	COMPARE, noting also the case of the encoding in the Tika command, vs in the output:
117
118	(1) >java -jar tika-app.jar --encoding=iso-8859-1 /Scratch/ak19/testword.docx
119	<?xml version="1.0" encoding="utf-8"?><html xmlns="http://www.w3.org/1999/xhtml">
120	<head>
121	<meta name="date" content="2013-09-18T02:46:00Z"/>
122	...
123
124	(2) >java -jar tika-app.jar --encoding=UTF-8 /Scratch/ak19/testword.docx
125	<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
126	<head>
127	...
128
129	(3) >java -jar tika-app.jar --encoding=iso-8859-1 /Scratch/ak19/testword.docx
130	<?xml version="1.0" encoding="ISO-8859-1"?><html xmlns="http://www.w3.org/1999/xhtml">
131	<head>
132	...
133
134	(4) >java -jar tika-app.jar --encoding=ISO-8859-1 /Scratch/ak19/testword.docx
135	<?xml version="1.0" encoding="ISO-8859-1"?><html xmlns="http://www.w3.org/1999/xhtml">
136	<head>
137	...
138
139	(5) >java -jar tika-app.jar --encoding=nonexistent /Scratch/ak19/testword.docx
140	Warning: The encoding 'nonexistent' is not supported by the Java runtime.
141	Warning: encoding "nonexistent" not supported, using UTF-8
142	<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
143	<head>
144	...
145
146	(6) (Output to html)
147	> java -jar tika-app.jar --encoding=nonexistent --html /Scratch/ak19/testword.docx
148	Warning: The encoding 'nonexistent' is not supported by the Java runtime.
149	Warning: encoding "nonexistent" not supported, using UTF-8
150	<html xmlns="http://www.w3.org/1999/xhtml">
151	<head>
152	...
153	The warning to STDERR is all that indicates that the encoding flag is taken into account
154	when --html flag is turned. The actual html output sent to STDOUT makes no mention of any
155	encoding in the file.
156
157	(7) (Output to html case 2)
158	> java -jar tika-app.jar --html --encoding=iso-8859-1 /Scratch/ak19/testword.docx
159	<html xmlns="http://www.w3.org/1999/xhtml">
160	<head>
161	<meta name="date" content="2013-09-18T02:46:00Z"/>
162	<meta name="Total-Time" content="5"/>
163	...
164	No warnings, but also no mention of the encoding in the html output.
165
166
167	The warning messages in (6) indicate that the output encoding is also taken into account when
168	the output format is set to html, by passing in the flag --html to tika.
169	Since we use --html as the output format, and UTF-8 is the character encoding Greenstone prefers
170	to work with, it therefore seems meaningful to set --encoding=UTF-8.
171
172	Also passing in --pretty-print to get supposedly better formatted output.
173
174
175	--------------------------------------------------------------

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: main/trunk/greenstone2/ext/tika/GS-README.txt@ 37201

Download in other formats: