Context Navigation

source: gs2-extensions/gstika/trunk/GS_TIKA_README.txt@ 34175

Last change on this file since 34175 was 34175, checked in by ak19, 4 years ago
Minor changes to folder names
File size: 12.6 KB

Line
1	--------------------------------------------------------------
2	A. Some background information on Apache Tika and related:
3	--------------------------------------------------------------
4	* https://tika.apache.org/1.5/gettingstarted.html
5	Refer to the heading "Using Tika as a command line utility" for available cmd line options
6
7	* https://tika.apache.org/download.html
8	is where the tika-app-1.24.1.jar was downloaded from
9	(We don't need any of the other jars, as explained under heading "Build artifacts"at https://tika.apache.org/1.5/gettingstarted.html)
10
11	* Apache 2.0 license
12	https://tika.apache.org/license.html
13
14	* Mime-types for docx and other office suite docs:
15	https://stackoverflow.com/questions/4212861/what-is-a-correct-mime-type-for-docx-pptx-etc
16
17	* Tesseract for OCR with Tika:
18	https://dingyuliang.me/use-tika-1-14-extract-text-image-tesseract-ocr/
19	Use Tika 1.14 to extract text from image by Tesseract OCR
20
21	* API usage examples - if modifying Tika code:
22	https://tika.apache.org/1.8/examples.html
23	https://stackoverflow.com/questions/38577468/convert-a-word-documents-to-html-with-embedded-images-by-tika
24
25	--------------------------------------------------------------
26	B. Here are some examples of running Tika on the command line:
27	--------------------------------------------------------------
28	1. HTML:
29
30	GS3/gs2build/ext/gstika>java -jar tika-app-*.jar --html /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.htm
31
32	2. XHTML - looks the same as HTML:
33
34	GS3/gs2build/ext/gstika>java -jar tika-app-*.jar --xml /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html
35
36	3. PLAIN TEXT CONTENT - NO META:
37
38	GS3/gs2build/ext/gstika>java -jar tika-app-*.jar --text-main /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html
39
40	a. PLAIN TEXT WITH META:
41
42	GS3/gs2build/ext/gstika>java -jar tika-app-*.jar --text /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html
43
44	b. JUST META:
45
46	GS3/gs2build/ext/gstika>java -jar tika-app-*.jar --metadata /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html)
47
48	4. IMAGES CAN'T DO HTML + IMAGES IN ONE STEP by throwing in any of the above flags in addition):
49
50	Extracts all attachments (images etc) into specified dir (-z or --extract and then specify a dir for it)
51	GS3/gs2build/ext/gstika>java -jar tika-app-*.jar --extract --extract-dir=/PATH/TO/GS3/gs2build/ext/tmp /PATH/TO/testword.docx
52
53
54	--------------------------------------------------------------
55	C. COMPARE OUTPUT - IMG EXTRACTION vs TEXT:
56	--------------------------------------------------------------
57	* GS3/gs2build/ext/gstika>java -jar tika-app-*.jar -z --extract-dir=/PATH/TO/GS3/gs2build/ext/tmp /PATH/TO/testword.docx
58
59	INFO As a convenience, TikaCLI has turned on extraction of
60	inline images for the PDFParser (TIKA-2374).
61	Aside from the -z option, this is not the default behavior
62	in Tika generally or in tika-server.
63	Jun 14, 2020 1:28:17 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
64	WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
65	See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
66	for optional dependencies.
67
68	Jun 14, 2020 1:28:17 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
69	WARNING: org.xerial's sqlite-jdbc is not loaded.
70	Please provide the jar on your classpath to parse sqlite files.
71	See tika-parsers/pom.xml for the correct version.
72
73
74	* GS3/gs2build/ext/gstika>java -jar tika-app-*.jar --text-main /PATH/TO/testword.docx
75
76	Jun 14, 2020 1:29:42 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
77	WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
78	See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
79	for optional dependencies.
80
81	Jun 14, 2020 1:29:42 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
82	WARNING: org.xerial's sqlite-jdbc is not loaded.
83	Please provide the jar on your classpath to parse sqlite files.
84	See tika-parsers/pom.xml for the correct version.
85	<ACTUAL TEXT IN INPUT DOCUMENT OUTPUT HERE>
86
87
88	--------------------------------------------------------------
89	D. THE --encoding= FLAG TO TIKA
90	--------------------------------------------------------------
91	> java -jar tika-app-*.jar --help
92	...
93	-eX or --encoding=X Use output encoding X
94	...
95
96	You can't specify invalid encodings (e.g. --encoding=nonexistent)
97	It seems to be insensitive to case, e.g. --encoding=UTF-8, --encoding=utf-8, --encoding=iso-8859-1
98
99	Since my tests have been to convert docs that contain ASCII using Tika,
100	it's only obvious that the encoding flag has been taken into account in any way when the output is
101	xhtml which is the default (or can pass in -x or --xml to get xhtml out).
102
103
104	COMPARE, noting also the case of the encoding in the Tika command, vs in the output:
105
106	(1) >java -jar tika-app-*.jar --encoding=iso-8859-1 /Scratch/ak19/testword.docx
107	<?xml version="1.0" encoding="utf-8"?><html xmlns="http://www.w3.org/1999/xhtml">
108	<head>
109	<meta name="date" content="2013-09-18T02:46:00Z"/>
110	...
111
112	(2) >java -jar tika-app-*.jar --encoding=UTF-8 /Scratch/ak19/testword.docx
113	<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
114	<head>
115	...
116
117	(3) >java -jar tika-app-*.jar --encoding=iso-8859-1 /Scratch/ak19/testword.docx
118	<?xml version="1.0" encoding="ISO-8859-1"?><html xmlns="http://www.w3.org/1999/xhtml">
119	<head>
120	...
121
122	(4) >java -jar tika-app-*.jar --encoding=ISO-8859-1 /Scratch/ak19/testword.docx
123	<?xml version="1.0" encoding="ISO-8859-1"?><html xmlns="http://www.w3.org/1999/xhtml">
124	<head>
125	...
126
127	(5) >java -jar tika-app-*.jar --encoding=nonexistent /Scratch/ak19/testword.docx
128	Warning: The encoding 'nonexistent' is not supported by the Java runtime.
129	Warning: encoding "nonexistent" not supported, using UTF-8
130	<?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
131	<head>
132	...
133
134	(6) (Output to html)
135	> java -jar tika-app-*.jar --encoding=nonexistent --html /Scratch/ak19/testword.docx
136	Warning: The encoding 'nonexistent' is not supported by the Java runtime.
137	Warning: encoding "nonexistent" not supported, using UTF-8
138	<html xmlns="http://www.w3.org/1999/xhtml">
139	<head>
140	...
141	The warning to STDERR is all that indicates that the encoding flag is taken into account
142	when --html flag is turned. The actual html output sent to STDOUT makes no mention of any
143	encoding in the file.
144
145	(7) (Output to html case 2)
146	> java -jar tika-app-*.jar --html --encoding=iso-8859-1 /Scratch/ak19/testword.docx
147	<html xmlns="http://www.w3.org/1999/xhtml">
148	<head>
149	<meta name="date" content="2013-09-18T02:46:00Z"/>
150	<meta name="Total-Time" content="5"/>
151	...
152	No warnings, but also no mention of the encoding in the html output.
153
154
155	The warning messages in (6) indicate that the output encoding is also taken into account when
156	the output format is set to html, by passing in the flag --html to tika.
157	Since we use --html as the output format, and UTF-8 is the character encoding Greenstone prefers
158	to work with, it therefore seems meaningful to set --encoding=UTF-8.
159
160	Also passing in --pretty-print to get supposedly better formatted output.
161
162
163	--------------------------------------------------------------
164	E. WRITING A CUSTOMISED TIKA-CLI TO OUTPUT HTML-WITH-IMAGES
165	--------------------------------------------------------------
166
167	The default Tika cli app accepts --html and --xml (for xhtml) flags to output html and xhtml respectively.
168	To extract images, the Tika cli app needs to be run separately with a --extract flag and optional --extract-dir=<dir>
169	However, running --html and then --extract sequentially does not produce an html file referring to the extracted
170	images because the extracted images are renamed to rId<digit>_<imagefilename>.<ext>, while the html file generated
171	refers to "embedded:<imagefilename>.<ext>" as the value for the src attributes of image elements.
172
173	So the problem is two-fold:
174	- Need to not be prefixing anything to the extracted images
175	- Need to remove "embedded:" prefix from the img src attributes in the html produced. Ideally don't want the string
176	"embedded:" prefixed at all, but that would require editing many source files in the Tika project rather than just one.
177
178	The solution turned out not to require compiling up apache-tika from source at all, but having a source checkout
179	to locate and modify code was handy.
180
181
182	SOLUTION TO OUTPUT (X)HTML WITH IMAGES EXTRACTED IN THE SAME LOCATION:
183	1. I wrote the org.greenstone.tika.GSTikaClient.java which is based on the TikaClient.java
184	with some minor modifications to be documented below.
185
186	2. It stands alone and can be compiled and run against the tika-app-*.jar file on the classpath:
187	To compile
188	GS3/gs2build/ext/gstika>javac -cp `pwd`/lib/tika-app-*.jar org/greenstone/tika/GSTikaCLI.java
189	To run:
190	GS3/gs2build/ext/gstika>java -cp "`pwd`/lib/tika-app-*.jar:." org.greenstone.tika.GSTikaCLI --html-with-images <inputfilepath> > output.html
191
192	(Can pass existing flags, e.g. --html for html without images extracted)
193
194	To compile code that lives in a directory called "src" and compile it into a directory called "build":
195
196	GS3/gs2build/ext/gstika>javac -cp `pwd`/lib/tika-app-*.jar -d `pwd`/build src/org/greenstone/tika/GSTikaCLI.java
197
198	To run the compiled class that's now in folder "build":
199	GS3/gs2build/ext/gstika>javac -cp "`pwd`/lib/tika-app-*.jar:`pwd`/build" --html-with-images <inputfilepath> > output.html
200
201
202	3. GSTikaClient.java is based on TikaClient.java with the modifications marked with comments mentioning "GSDL".
203
204	a. The major changes are that inner class method FileEmbeddedDocumentExtractor.getOutputFile() no longer
205	prefixes the unwanted "rId_" prefix to the filenames of the extracted images
206
207	b. The return type of the static method getTransformerHandler() is no longer TransformerHandler, but its superclass ContentHandler.
208
209	When the new --html-with-imgs (or xhtml-with-images) flag is passed into GSTikaClient, function getTransformerHandler() will further process the existing html/xml result generated by the function, by removing "embedded:" prefixes in img src attributes. This is done by copying some source code from tika-app/src/main/java/org/apache/tika/gui/TikaGUI.java source code and modifying it (look for code about a ContentHandlerDecorator in TikaGUI.java).
210
211	c. Other changes are to support the 2 new additional input flags --html-with-imgs and --xhtml-with-imgs, and additional call the image extraction functions, and ensuring an extraction directory flag is still supported in this mode. (Though when not provided, the images will be extracted into the same level as the input file.)
212
213
214	4. Next added a makeGSTikaCLI.sh script for compiling and the GSTikaCLI.sh script for minor simplification of running.
215
216
217	cd gs2build/ext/gstika
218	./makeGSTikaCLI.sh
219	./GSTikaCLI.sh --html-with-images <inputfile> > <outputfile>
220	e.g. ./GSTikaCLI.sh --html-with-imgs --pretty-print --encoding=UTF-8 tmp/<file>.docx > tmp/<file>.html
221
222
223	--------------------------------------------------------------
224	F. COMPILING TIKA FROM SOURCE
225	--------------------------------------------------------------
226
227	Refer to https://github.com/apache/tika
228
229	(a) Need Maven 3 to compile up Tika.
230	export MAVEN_HOME=/Path/To/apache-maven3
231	export PATH=$MAVEN_HOME/bin:$PATH
232
233	(b) Need to configure Maven to grab artifacts using https, since some are only available over https.
234	Refer to https://stackoverflow.com/questions/25393298/what-is-the-correct-way-of-forcing-maven-to-use-https-for-maven-central
235	which instructs adding the following to your $MAVEN_HOME/conf/settings.xml into <profiles> section:
236
237	<profile>
238	<id>maven-https</id>
239	<activation>
240	<activeByDefault>true</activeByDefault>
241	</activation>
242	<repositories>
243	<repository>
244	<id>central</id>
245	<url>https://repo1.maven.org/maven2</url>
246	<snapshots>
247	<enabled>false</enabled>
248	</snapshots>
249	</repository>
250	</repositories>
251	<pluginRepositories>
252	<pluginRepository>
253	<id>central</id>
254	<url>https://repo1.maven.org/maven2</url>
255	<snapshots>
256	<enabled>false</enabled>
257	</snapshots>
258	</pluginRepository>
259	</pluginRepositories>
260	</profile>
261
262	(c) Grab tika from git and attempt to compile it with maven
263	> git clone https://github.com/apache/tika.git
264	> cd tika
265	> mvn clean install
266	Takes 42-45 mins to compile up!
267
268
269	This compiles up version 2.0.0 tika-app jar file, whereas the precompiled downloadable jar is version 1.24.1.
270
271	Compiling this wasn't necessary to compile or run GSTikaClient.java!
272	However, having the source code to base GSTikaClient off of was useful.
273
274	--------------------------------------------------------------

Note: See TracBrowser for help on using the repository browser.

Download in other formats: