source: main/trunk/greenstone2/ext/tika/GS-README.txt@ 37201

Last change on this file since 37201 was 35402, checked in by anupama, 3 years ago

File renaming tika ext's README to GS-README, as it was our own custom readme file and not any official one by tika.

File size: 7.5 KB
Line 
1--------------------------------------------------------------
2About tika-app.jar:
3--------------------------------------------------------------
4Last updated version is currently 1.24.1 (tika-app-1.24.1.jar)
5which can be found in the final line of output of running:
6 java -jar %GSDLHOME%\ext\tika\tika-app.jar --version
7on Windows:
8or on Linux,
9 java -jar $GSDLHOME/ext/tika/tika-app.jar --version
10
11
12
13--------------------------------------------------------------
14A. Some background information on Apache Tika and related:
15--------------------------------------------------------------
16* https://tika.apache.org/1.5/gettingstarted.html
17Refer to the heading "Using Tika as a command line utility" for available cmd line options
18
19* https://tika.apache.org/download.html
20is where the tika-app-1.24.1.jar was downloaded from
21(We don't need any of the other jars, as explained under heading "Build artifacts"at https://tika.apache.org/1.5/gettingstarted.html)
22
23* Apache 2.0 license
24 https://tika.apache.org/license.html
25
26* Mime-types for docx and other office suite docs:
27 https://stackoverflow.com/questions/4212861/what-is-a-correct-mime-type-for-docx-pptx-etc
28
29* Tesseract for OCR with Tika:
30https://dingyuliang.me/use-tika-1-14-extract-text-image-tesseract-ocr/
31Use Tika 1.14 to extract text from image by Tesseract OCR
32
33* API usage examples - if modifying Tika code:
34https://tika.apache.org/1.8/examples.html
35https://stackoverflow.com/questions/38577468/convert-a-word-documents-to-html-with-embedded-images-by-tika
36
37--------------------------------------------------------------
38B. Here are some examples of running Tika on the command line:
39--------------------------------------------------------------
401. HTML:
41
42GS3/gs2build/ext/tika>java -jar tika-app.jar --html /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.htm
43
442. XHTML - looks the same as HTML:
45
46GS3/gs2build/ext/tika>java -jar tika-app.jar --xml /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html
47
483. PLAIN TEXT CONTENT - NO META:
49
50GS3/gs2build/ext/tika>java -jar tika-app.jar --text-main /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html
51
52 a. PLAIN TEXT WITH META:
53
54GS3/gs2build/ext/tika>java -jar tika-app.jar --text /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html
55
56 b. JUST META:
57
58GS3/gs2build/ext/tika>java -jar tika-app.jar --metadata /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html)
59
604. IMAGES CAN'T DO HTML + IMAGES IN ONE STEP by throwing in any of the above flags in addition):
61
62Extracts all attachments (images etc) into specified dir (-z or --extract and then specify a dir for it)
63GS3/gs2build/ext/tika>java -jar tika-app.jar --extract --extract-dir=/PATH/TO/GS3/gs2build/ext/tmp /PATH/TO/testword.docx
64
65
66--------------------------------------------------------------
67C. COMPARE OUTPUT - IMG EXTRACTION vs TEXT:
68--------------------------------------------------------------
69* GS3/gs2build/ext/tika>java -jar tika-app.jar -z --extract-dir=/PATH/TO/GS3/gs2build/ext/tmp /PATH/TO/testword.docx
70
71INFO As a convenience, TikaCLI has turned on extraction of
72inline images for the PDFParser (TIKA-2374).
73Aside from the -z option, this is not the default behavior
74in Tika generally or in tika-server.
75Jun 14, 2020 1:28:17 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
76WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
77See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
78for optional dependencies.
79
80Jun 14, 2020 1:28:17 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
81WARNING: org.xerial's sqlite-jdbc is not loaded.
82Please provide the jar on your classpath to parse sqlite files.
83See tika-parsers/pom.xml for the correct version.
84
85
86* GS3/gs2build/ext/tika>java -jar tika-app.jar --text-main /PATH/TO/testword.docx
87
88Jun 14, 2020 1:29:42 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
89WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
90See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
91for optional dependencies.
92
93Jun 14, 2020 1:29:42 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
94WARNING: org.xerial's sqlite-jdbc is not loaded.
95Please provide the jar on your classpath to parse sqlite files.
96See tika-parsers/pom.xml for the correct version.
97<ACTUAL TEXT IN INPUT DOCUMENT OUTPUT HERE>
98
99
100--------------------------------------------------------------
101D. THE --encoding= FLAG TO TIKA
102--------------------------------------------------------------
103> java -jar tika-app.jar --help
104 ...
105 -eX or --encoding=X Use output encoding X
106 ...
107
108You can't specify invalid encodings (e.g. --encoding=nonexistent)
109It seems to be insensitive to case, e.g. --encoding=UTF-8, --encoding=utf-8, --encoding=iso-8859-1
110
111Since my tests have been to convert docs that contain ASCII using Tika,
112it's only obvious that the encoding flag has been taken into account in any way when the output is
113xhtml which is the default (or can pass in -x or --xml to get xhtml out).
114
115
116COMPARE, noting also the case of the encoding in the Tika command, vs in the output:
117
118(1) >java -jar tika-app.jar --encoding=iso-8859-1 /Scratch/ak19/testword.docx
119 <?xml version="1.0" encoding="utf-8"?><html xmlns="http://www.w3.org/1999/xhtml">
120 <head>
121 <meta name="date" content="2013-09-18T02:46:00Z"/>
122 ...
123
124(2) >java -jar tika-app.jar --encoding=UTF-8 /Scratch/ak19/testword.docx
125 <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
126 <head>
127 ...
128
129(3) >java -jar tika-app.jar --encoding=iso-8859-1 /Scratch/ak19/testword.docx
130 <?xml version="1.0" encoding="ISO-8859-1"?><html xmlns="http://www.w3.org/1999/xhtml">
131 <head>
132 ...
133
134(4) >java -jar tika-app.jar --encoding=ISO-8859-1 /Scratch/ak19/testword.docx
135 <?xml version="1.0" encoding="ISO-8859-1"?><html xmlns="http://www.w3.org/1999/xhtml">
136 <head>
137 ...
138
139(5) >java -jar tika-app.jar --encoding=nonexistent /Scratch/ak19/testword.docx
140 Warning: The encoding 'nonexistent' is not supported by the Java runtime.
141 Warning: encoding "nonexistent" not supported, using UTF-8
142 <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
143 <head>
144 ...
145
146(6) (Output to html)
147 > java -jar tika-app.jar --encoding=nonexistent --html /Scratch/ak19/testword.docx
148 Warning: The encoding 'nonexistent' is not supported by the Java runtime.
149 Warning: encoding "nonexistent" not supported, using UTF-8
150 <html xmlns="http://www.w3.org/1999/xhtml">
151 <head>
152 ...
153The warning to STDERR is all that indicates that the encoding flag is taken into account
154when --html flag is turned. The actual html output sent to STDOUT makes no mention of any
155encoding in the file.
156
157(7) (Output to html case 2)
158 > java -jar tika-app.jar --html --encoding=iso-8859-1 /Scratch/ak19/testword.docx
159 <html xmlns="http://www.w3.org/1999/xhtml">
160 <head>
161 <meta name="date" content="2013-09-18T02:46:00Z"/>
162 <meta name="Total-Time" content="5"/>
163 ...
164No warnings, but also no mention of the encoding in the html output.
165
166
167The warning messages in (6) indicate that the output encoding is also taken into account when
168the output format is set to html, by passing in the flag --html to tika.
169Since we use --html as the output format, and UTF-8 is the character encoding Greenstone prefers
170to work with, it therefore seems meaningful to set --encoding=UTF-8.
171
172Also passing in --pretty-print to get supposedly better formatted output.
173
174
175--------------------------------------------------------------
Note: See TracBrowser for help on using the repository browser.