1 | --------------------------------------------------------------
|
---|
2 | A. Some background information on Apache Tika and related:
|
---|
3 | --------------------------------------------------------------
|
---|
4 | * https://tika.apache.org/1.5/gettingstarted.html
|
---|
5 | Refer to the heading "Using Tika as a command line utility" for available cmd line options
|
---|
6 |
|
---|
7 | * https://tika.apache.org/download.html
|
---|
8 | is where the tika-app-1.24.1.jar was downloaded from
|
---|
9 | (We don't need any of the other jars, as explained under heading "Build artifacts"at https://tika.apache.org/1.5/gettingstarted.html)
|
---|
10 |
|
---|
11 | * Apache 2.0 license
|
---|
12 | https://tika.apache.org/license.html
|
---|
13 |
|
---|
14 | * Mime-types for docx and other office suite docs:
|
---|
15 | https://stackoverflow.com/questions/4212861/what-is-a-correct-mime-type-for-docx-pptx-etc
|
---|
16 |
|
---|
17 | * Tesseract for OCR with Tika:
|
---|
18 | https://dingyuliang.me/use-tika-1-14-extract-text-image-tesseract-ocr/
|
---|
19 | Use Tika 1.14 to extract text from image by Tesseract OCR
|
---|
20 |
|
---|
21 | * API usage examples - if modifying Tika code:
|
---|
22 | https://tika.apache.org/1.8/examples.html
|
---|
23 | https://stackoverflow.com/questions/38577468/convert-a-word-documents-to-html-with-embedded-images-by-tika
|
---|
24 |
|
---|
25 | --------------------------------------------------------------
|
---|
26 | B. Here are some examples of running Tika on the command line:
|
---|
27 | --------------------------------------------------------------
|
---|
28 | 1. HTML:
|
---|
29 |
|
---|
30 | GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar --html /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.htm
|
---|
31 |
|
---|
32 | 2. XHTML - looks the same as HTML:
|
---|
33 |
|
---|
34 | GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar --xml /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html
|
---|
35 |
|
---|
36 | 3. PLAIN TEXT CONTENT - NO META:
|
---|
37 |
|
---|
38 | GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar --text-main /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html
|
---|
39 |
|
---|
40 | a. PLAIN TEXT WITH META:
|
---|
41 |
|
---|
42 | GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar --text /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html
|
---|
43 |
|
---|
44 | b. JUST META:
|
---|
45 |
|
---|
46 | GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar --metadata /PATH/TO/testword.docx > /PATH/TO/GS3/gs2build/ext/tmp/testword.html)
|
---|
47 |
|
---|
48 | 4. IMAGES CAN'T DO HTML + IMAGES IN ONE STEP by throwing in any of the above flags in addition):
|
---|
49 |
|
---|
50 | Extracts all attachments (images etc) into specified dir (-z or --extract and then specify a dir for it)
|
---|
51 | GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar --extract --extract-dir=/PATH/TO/GS3/gs2build/ext/tmp /PATH/TO/testword.docx
|
---|
52 |
|
---|
53 |
|
---|
54 | --------------------------------------------------------------
|
---|
55 | C. COMPARE OUTPUT - IMG EXTRACTION vs TEXT:
|
---|
56 | --------------------------------------------------------------
|
---|
57 | * GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar -z --extract-dir=/PATH/TO/GS3/gs2build/ext/tmp /PATH/TO/testword.docx
|
---|
58 |
|
---|
59 | INFO As a convenience, TikaCLI has turned on extraction of
|
---|
60 | inline images for the PDFParser (TIKA-2374).
|
---|
61 | Aside from the -z option, this is not the default behavior
|
---|
62 | in Tika generally or in tika-server.
|
---|
63 | Jun 14, 2020 1:28:17 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
|
---|
64 | WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
|
---|
65 | See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
|
---|
66 | for optional dependencies.
|
---|
67 |
|
---|
68 | Jun 14, 2020 1:28:17 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
|
---|
69 | WARNING: org.xerial's sqlite-jdbc is not loaded.
|
---|
70 | Please provide the jar on your classpath to parse sqlite files.
|
---|
71 | See tika-parsers/pom.xml for the correct version.
|
---|
72 |
|
---|
73 |
|
---|
74 | * GS3/gs2build/ext/tika>java -jar tika-app-1.24.1.jar --text-main /PATH/TO/testword.docx
|
---|
75 |
|
---|
76 | Jun 14, 2020 1:29:42 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
|
---|
77 | WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
|
---|
78 | See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
|
---|
79 | for optional dependencies.
|
---|
80 |
|
---|
81 | Jun 14, 2020 1:29:42 AM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
|
---|
82 | WARNING: org.xerial's sqlite-jdbc is not loaded.
|
---|
83 | Please provide the jar on your classpath to parse sqlite files.
|
---|
84 | See tika-parsers/pom.xml for the correct version.
|
---|
85 | <ACTUAL TEXT IN INPUT DOCUMENT OUTPUT HERE>
|
---|
86 |
|
---|
87 |
|
---|
88 | --------------------------------------------------------------
|
---|
89 | D. THE --encoding= FLAG TO TIKA
|
---|
90 | --------------------------------------------------------------
|
---|
91 | > java -jar tika-app-1.24.1.jar --help
|
---|
92 | ...
|
---|
93 | -eX or --encoding=X Use output encoding X
|
---|
94 | ...
|
---|
95 |
|
---|
96 | You can't specify invalid encodings (e.g. --encoding=nonexistent)
|
---|
97 | It seems to be insensitive to case, e.g. --encoding=UTF-8, --encoding=utf-8, --encoding=iso-8859-1
|
---|
98 |
|
---|
99 | Since my tests have been to convert docs that contain ASCII using Tika,
|
---|
100 | it's only obvious that the encoding flag has been taken into account in any way when the output is
|
---|
101 | xhtml which is the default (or can pass in -x or --xml to get xhtml out).
|
---|
102 |
|
---|
103 |
|
---|
104 | COMPARE, noting also the case of the encoding in the Tika command, vs in the output:
|
---|
105 |
|
---|
106 | (1) >java -jar tika-app-1.24.1.jar --encoding=iso-8859-1 /Scratch/ak19/testword.docx
|
---|
107 | <?xml version="1.0" encoding="utf-8"?><html xmlns="http://www.w3.org/1999/xhtml">
|
---|
108 | <head>
|
---|
109 | <meta name="date" content="2013-09-18T02:46:00Z"/>
|
---|
110 | ...
|
---|
111 |
|
---|
112 | (2) >java -jar tika-app-1.24.1.jar --encoding=UTF-8 /Scratch/ak19/testword.docx
|
---|
113 | <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
|
---|
114 | <head>
|
---|
115 | ...
|
---|
116 |
|
---|
117 | (3) >java -jar tika-app-1.24.1.jar --encoding=iso-8859-1 /Scratch/ak19/testword.docx
|
---|
118 | <?xml version="1.0" encoding="ISO-8859-1"?><html xmlns="http://www.w3.org/1999/xhtml">
|
---|
119 | <head>
|
---|
120 | ...
|
---|
121 |
|
---|
122 | (4) >java -jar tika-app-1.24.1.jar --encoding=ISO-8859-1 /Scratch/ak19/testword.docx
|
---|
123 | <?xml version="1.0" encoding="ISO-8859-1"?><html xmlns="http://www.w3.org/1999/xhtml">
|
---|
124 | <head>
|
---|
125 | ...
|
---|
126 |
|
---|
127 | (5) >java -jar tika-app-1.24.1.jar --encoding=nonexistent /Scratch/ak19/testword.docx
|
---|
128 | Warning: The encoding 'nonexistent' is not supported by the Java runtime.
|
---|
129 | Warning: encoding "nonexistent" not supported, using UTF-8
|
---|
130 | <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
|
---|
131 | <head>
|
---|
132 | ...
|
---|
133 |
|
---|
134 | (6) (Output to html)
|
---|
135 | > java -jar tika-app-1.24.1.jar --encoding=nonexistent --html /Scratch/ak19/testword.docx
|
---|
136 | Warning: The encoding 'nonexistent' is not supported by the Java runtime.
|
---|
137 | Warning: encoding "nonexistent" not supported, using UTF-8
|
---|
138 | <html xmlns="http://www.w3.org/1999/xhtml">
|
---|
139 | <head>
|
---|
140 | ...
|
---|
141 | The warning to STDERR is all that indicates that the encoding flag is taken into account
|
---|
142 | when --html flag is turned. The actual html output sent to STDOUT makes no mention of any
|
---|
143 | encoding in the file.
|
---|
144 |
|
---|
145 | (7) (Output to html case 2)
|
---|
146 | > java -jar tika-app-1.24.1.jar --html --encoding=iso-8859-1 /Scratch/ak19/testword.docx
|
---|
147 | <html xmlns="http://www.w3.org/1999/xhtml">
|
---|
148 | <head>
|
---|
149 | <meta name="date" content="2013-09-18T02:46:00Z"/>
|
---|
150 | <meta name="Total-Time" content="5"/>
|
---|
151 | ...
|
---|
152 | No warnings, but also no mention of the encoding in the html output.
|
---|
153 |
|
---|
154 |
|
---|
155 | The warning messages in (6) indicate that the output encoding is also taken into account when
|
---|
156 | the output format is set to html, by passing in the flag --html to tika.
|
---|
157 | Since we use --html as the output format, and UTF-8 is the character encoding Greenstone prefers
|
---|
158 | to work with, it therefore seems meaningful to set --encoding=UTF-8.
|
---|
159 |
|
---|
160 | Also passing in --pretty-print to get supposedly better formatted output.
|
---|
161 |
|
---|
162 |
|
---|
163 | --------------------------------------------------------------
|
---|