source: gs2-extensions/gstika/trunk/GS_TIKA_README.txt@ 34175

Last change on this file since 34175 was 34175, checked in by ak19, 4 years ago

Minor changes to folder names

File size: 12.6 KB
Line 
1--------------------------------------------------------------
2A. Some background information on Apache Tika and related:
3--------------------------------------------------------------
4* https://tika.apache.org/1.5/gettingstarted.html
5Refer to the heading "Using Tika as a command line utility" for available cmd line options
6
7* https://tika.apache.org/download.html
8is where the tika-app-1.24.1.jar was downloaded from
9(We don't need any of the other jars, as explained under heading "Build artifacts"at https://tika.apache.org/1.5/gettingstarted.html)
10
11* Apache 2.0 license
12 https://tika.apache.org/license.html
13
14* Mime-types for docx and other office suite docs:
15 https://stackoverflow.com/questions/4212861/what-is-a-correct-mime-type-for-docx-pptx-etc
16
17* Tesseract for OCR with Tika:
18https://dingyuliang.me/use-tika-1-14-extract-text-image-tesseract-ocr/
19Use Tika 1.14 to extract text from image by Tesseract OCR
20
21* API usage examples - if modifying Tika code:
22https://tika.apache.org/1.8/examples.html
23https://stackoverflow.com/questions/38577468/convert-a-word-documents-to-html-with-embedded-images-by-tika
24
25--------------------------------------------------------------
26B. Here are some examples of running Tika on the command line:
27--------------------------------------------------------------
281. HTML:
29
30GS3/gs2build/ext/gstika>java -jar tika-app-*.jar --html /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.htm
31
322. XHTML - looks the same as HTML:
33
34GS3/gs2build/ext/gstika>java -jar tika-app-*.jar --xml /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html
35
363. PLAIN TEXT CONTENT - NO META:
37
38GS3/gs2build/ext/gstika>java -jar tika-app-*.jar --text-main /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html
39
40 a. PLAIN TEXT WITH META:
41
42GS3/gs2build/ext/gstika>java -jar tika-app-*.jar --text /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html
43
44 b. JUST META:
45
46GS3/gs2build/ext/gstika>java -jar tika-app-*.jar --metadata /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html)
47
484. IMAGES CAN'T DO HTML + IMAGES IN ONE STEP by throwing in any of the above flags in addition):
49
50Extracts all attachments (images etc) into specified dir (-z or --extract and then specify a dir for it)
51GS3/gs2build/ext/gstika>java -jar tika-app-*.jar --extract --extract-dir=/PATH/TO/GS3/gs2build/ext/tmp /PATH/TO/testword.docx
52
53
54--------------------------------------------------------------
55C. COMPARE OUTPUT - IMG EXTRACTION vs TEXT:
56--------------------------------------------------------------
57* GS3/gs2build/ext/gstika>java -jar tika-app-*.jar -z --extract-dir=/PATH/TO/GS3/gs2build/ext/tmp /PATH/TO/testword.docx
58
59INFO As a convenience, TikaCLI has turned on extraction of
60inline images for the PDFParser (TIKA-2374).
61Aside from the -z option, this is not the default behavior
62in Tika generally or in tika-server.
63Jun 14, 2020 1:28:17 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
64WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
65See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
66for optional dependencies.
67
68Jun 14, 2020 1:28:17 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
69WARNING: org.xerial's sqlite-jdbc is not loaded.
70Please provide the jar on your classpath to parse sqlite files.
71See tika-parsers/pom.xml for the correct version.
72
73
74* GS3/gs2build/ext/gstika>java -jar tika-app-*.jar --text-main /PATH/TO/testword.docx
75
76Jun 14, 2020 1:29:42 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
77WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
78See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
79for optional dependencies.
80
81Jun 14, 2020 1:29:42 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
82WARNING: org.xerial's sqlite-jdbc is not loaded.
83Please provide the jar on your classpath to parse sqlite files.
84See tika-parsers/pom.xml for the correct version.
85<ACTUAL TEXT IN INPUT DOCUMENT OUTPUT HERE>
86
87
88--------------------------------------------------------------
89D. THE --encoding= FLAG TO TIKA
90--------------------------------------------------------------
91> java -jar tika-app-*.jar --help
92 ...
93 -eX or --encoding=X Use output encoding X
94 ...
95
96You can't specify invalid encodings (e.g. --encoding=nonexistent)
97It seems to be insensitive to case, e.g. --encoding=UTF-8, --encoding=utf-8, --encoding=iso-8859-1
98
99Since my tests have been to convert docs that contain ASCII using Tika,
100it's only obvious that the encoding flag has been taken into account in any way when the output is
101xhtml which is the default (or can pass in -x or --xml to get xhtml out).
102
103
104COMPARE, noting also the case of the encoding in the Tika command, vs in the output:
105
106(1) >java -jar tika-app-*.jar --encoding=iso-8859-1 /Scratch/ak19/testword.docx
107 <?xml version="1.0" encoding="utf-8"?><html xmlns="http://www.w3.org/1999/xhtml">
108 <head>
109 <meta name="date" content="2013-09-18T02:46:00Z"/>
110 ...
111
112(2) >java -jar tika-app-*.jar --encoding=UTF-8 /Scratch/ak19/testword.docx
113 <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
114 <head>
115 ...
116
117(3) >java -jar tika-app-*.jar --encoding=iso-8859-1 /Scratch/ak19/testword.docx
118 <?xml version="1.0" encoding="ISO-8859-1"?><html xmlns="http://www.w3.org/1999/xhtml">
119 <head>
120 ...
121
122(4) >java -jar tika-app-*.jar --encoding=ISO-8859-1 /Scratch/ak19/testword.docx
123 <?xml version="1.0" encoding="ISO-8859-1"?><html xmlns="http://www.w3.org/1999/xhtml">
124 <head>
125 ...
126
127(5) >java -jar tika-app-*.jar --encoding=nonexistent /Scratch/ak19/testword.docx
128 Warning: The encoding 'nonexistent' is not supported by the Java runtime.
129 Warning: encoding "nonexistent" not supported, using UTF-8
130 <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
131 <head>
132 ...
133
134(6) (Output to html)
135 > java -jar tika-app-*.jar --encoding=nonexistent --html /Scratch/ak19/testword.docx
136 Warning: The encoding 'nonexistent' is not supported by the Java runtime.
137 Warning: encoding "nonexistent" not supported, using UTF-8
138 <html xmlns="http://www.w3.org/1999/xhtml">
139 <head>
140 ...
141The warning to STDERR is all that indicates that the encoding flag is taken into account
142when --html flag is turned. The actual html output sent to STDOUT makes no mention of any
143encoding in the file.
144
145(7) (Output to html case 2)
146 > java -jar tika-app-*.jar --html --encoding=iso-8859-1 /Scratch/ak19/testword.docx
147 <html xmlns="http://www.w3.org/1999/xhtml">
148 <head>
149 <meta name="date" content="2013-09-18T02:46:00Z"/>
150 <meta name="Total-Time" content="5"/>
151 ...
152No warnings, but also no mention of the encoding in the html output.
153
154
155The warning messages in (6) indicate that the output encoding is also taken into account when
156the output format is set to html, by passing in the flag --html to tika.
157Since we use --html as the output format, and UTF-8 is the character encoding Greenstone prefers
158to work with, it therefore seems meaningful to set --encoding=UTF-8.
159
160Also passing in --pretty-print to get supposedly better formatted output.
161
162
163--------------------------------------------------------------
164E. WRITING A CUSTOMISED TIKA-CLI TO OUTPUT HTML-WITH-IMAGES
165--------------------------------------------------------------
166
167The default Tika cli app accepts --html and --xml (for xhtml) flags to output html and xhtml respectively.
168To extract images, the Tika cli app needs to be run separately with a --extract flag and optional --extract-dir=<dir>
169However, running --html and then --extract sequentially does not produce an html file referring to the extracted
170images because the extracted images are renamed to rId<digit>_<imagefilename>.<ext>, while the html file generated
171refers to "embedded:<imagefilename>.<ext>" as the value for the src attributes of image elements.
172
173So the problem is two-fold:
174- Need to not be prefixing anything to the extracted images
175- Need to remove "embedded:" prefix from the img src attributes in the html produced. Ideally don't want the string
176"embedded:" prefixed at all, but that would require editing many source files in the Tika project rather than just one.
177
178The solution turned out not to require compiling up apache-tika from source at all, but having a source checkout
179to locate and modify code was handy.
180
181
182SOLUTION TO OUTPUT (X)HTML WITH IMAGES EXTRACTED IN THE SAME LOCATION:
1831. I wrote the org.greenstone.tika.GSTikaClient.java which is based on the TikaClient.java
184with some minor modifications to be documented below.
185
1862. It stands alone and can be compiled and run against the tika-app-*.jar file on the classpath:
187To compile
188 GS3/gs2build/ext/gstika>javac -cp `pwd`/lib/tika-app-*.jar org/greenstone/tika/GSTikaCLI.java
189To run:
190 GS3/gs2build/ext/gstika>java -cp "`pwd`/lib/tika-app-*.jar:." org.greenstone.tika.GSTikaCLI --html-with-images <inputfilepath> > output.html
191
192(Can pass existing flags, e.g. --html for html without images extracted)
193
194To compile code that lives in a directory called "src" and compile it into a directory called "build":
195
196 GS3/gs2build/ext/gstika>javac -cp `pwd`/lib/tika-app-*.jar -d `pwd`/build src/org/greenstone/tika/GSTikaCLI.java
197
198To run the compiled class that's now in folder "build":
199 GS3/gs2build/ext/gstika>javac -cp "`pwd`/lib/tika-app-*.jar:`pwd`/build" --html-with-images <inputfilepath> > output.html
200
201
2023. GSTikaClient.java is based on TikaClient.java with the modifications marked with comments mentioning "GSDL".
203
204a. The major changes are that inner class method FileEmbeddedDocumentExtractor.getOutputFile() no longer
205prefixes the unwanted "rId_" prefix to the filenames of the extracted images
206
207b. The return type of the static method getTransformerHandler() is no longer TransformerHandler, but its superclass ContentHandler.
208
209When the new --html-with-imgs (or xhtml-with-images) flag is passed into GSTikaClient, function getTransformerHandler() will further process the existing html/xml result generated by the function, by removing "embedded:" prefixes in img src attributes. This is done by copying some source code from tika-app/src/main/java/org/apache/tika/gui/TikaGUI.java source code and modifying it (look for code about a ContentHandlerDecorator in TikaGUI.java).
210
211c. Other changes are to support the 2 new additional input flags --html-with-imgs and --xhtml-with-imgs, and additional call the image extraction functions, and ensuring an extraction directory flag is still supported in this mode. (Though when not provided, the images will be extracted into the same level as the input file.)
212
213
2144. Next added a makeGSTikaCLI.sh script for compiling and the GSTikaCLI.sh script for minor simplification of running.
215
216
217cd gs2build/ext/gstika
218./makeGSTikaCLI.sh
219./GSTikaCLI.sh --html-with-images <inputfile> > <outputfile>
220e.g. ./GSTikaCLI.sh --html-with-imgs --pretty-print --encoding=UTF-8 tmp/<file>.docx > tmp/<file>.html
221
222
223--------------------------------------------------------------
224F. COMPILING TIKA FROM SOURCE
225--------------------------------------------------------------
226
227Refer to https://github.com/apache/tika
228
229(a) Need Maven 3 to compile up Tika.
230 export MAVEN_HOME=/Path/To/apache-maven3
231 export PATH=$MAVEN_HOME/bin:$PATH
232
233(b) Need to configure Maven to grab artifacts using https, since some are only available over https.
234Refer to https://stackoverflow.com/questions/25393298/what-is-the-correct-way-of-forcing-maven-to-use-https-for-maven-central
235which instructs adding the following to your $MAVEN_HOME/conf/settings.xml into <profiles> section:
236
237 <profile>
238 <id>maven-https</id>
239 <activation>
240 <activeByDefault>true</activeByDefault>
241 </activation>
242 <repositories>
243 <repository>
244 <id>central</id>
245 <url>https://repo1.maven.org/maven2</url>
246 <snapshots>
247 <enabled>false</enabled>
248 </snapshots>
249 </repository>
250 </repositories>
251 <pluginRepositories>
252 <pluginRepository>
253 <id>central</id>
254 <url>https://repo1.maven.org/maven2</url>
255 <snapshots>
256 <enabled>false</enabled>
257 </snapshots>
258 </pluginRepository>
259 </pluginRepositories>
260 </profile>
261
262(c) Grab tika from git and attempt to compile it with maven
263 > git clone https://github.com/apache/tika.git
264 > cd tika
265 > mvn clean install
266Takes 42-45 mins to compile up!
267
268
269This compiles up version 2.0.0 tika-app jar file, whereas the precompiled downloadable jar is version 1.24.1.
270
271Compiling this wasn't necessary to compile or run GSTikaClient.java!
272However, having the source code to base GSTikaClient off of was useful.
273
274--------------------------------------------------------------
Note: See TracBrowser for help on using the repository browser.