Context Navigation

← Previous Changeset
Next Changeset →

Changeset 34172

Timestamp:

2020-06-14T19:11:13+12:00 (4 years ago)

Author:

ak19

Message:

Some minor improvements to the UnknownConverterPlugin settings for tika's conversion (of docx files) to html. Also documenting the reasoning.

Location:

main/trunk/greenstone2

Files:

: 2 edited

collect/modelcol/etc/collectionConfig.xml (modified) (1 diff)
ext/tika/README.txt (modified) (2 diffs)

Legend:

: Unmodified
: Added
: Removed

main/trunk/greenstone2/collect/modelcol/etc/collectionConfig.xml

r34169	r34172
87	87	<!-- Configuring an UnknownConverterPlugin for docx processing with Tika -->
88	88	<plugin name="UnknownConverterPlugin">
89		<option name="-exec_cmd" value="java -jar $GSDLHOME/ext/tika/tika-app-1.24.1.jar --html %%INPUT_FILE > %%OUTPUT"/>
	89	<option name="-exec_cmd" value="java -jar $GSDLHOME/ext/tika/tika-app-1.24.1.jar --html --pretty-print --encoding=UTF-8 %%INPUT_FILE > %%OUTPUT"/>
90	90	<option name="-convert_to" value="html"/>
91	91	<option name="-mime_type" value="application/vnd.openxmlformats-officedocument.wordprocessingml.document"/>

main/trunk/greenstone2/ext/tika/README.txt

-              r34171
+              r34172
 * Mime-types for docx and other office suite docs:
     https://stackoverflow.com/questions/4212861/what-is-a-correct-mime-type-for-docx-pptx-etc
+* Tesseract for OCR with Tika:
+https://dingyuliang.me/use-tika-1-14-extract-text-image-tesseract-ocr/
+Use Tika 1.14 to extract text from image by Tesseract OCR
+* API usage examples - if modifying Tika code:
+https://tika.apache.org/1.8/examples.html
+https://stackoverflow.com/questions/38577468/convert-a-word-documents-to-html-with-embedded-images-by-tika
 --------------------------------------------------------------
 …
 <ACTUAL TEXT IN INPUT DOCUMENT OUTPUT HERE>
 --------------------------------------------------------------
+D. THE --encoding= FLAG TO TIKA
+--------------------------------------------------------------
+> java -jar tika-app-1.24.1.jar --help
+  ...
+  -eX or --encoding=X    Use output encoding X
+  ...
+You can't specify invalid encodings (e.g. --encoding=nonexistent)
+It seems to be insensitive to case, e.g. --encoding=UTF-8, --encoding=utf-8, --encoding=iso-8859-1
+Since my tests have been to convert docs that contain ASCII using Tika,
+it's only obvious that the encoding flag has been taken into account in any way when the output is
+xhtml which is the default (or can pass in -x or --xml to get xhtml out).
+COMPARE, noting also the case of the encoding in the Tika command, vs in the output:
+(1) >java -jar tika-app-1.24.1.jar --encoding=iso-8859-1 /Scratch/ak19/testword.docx
+  <?xml version="1.0" encoding="utf-8"?><html xmlns="http://www.w3.org/1999/xhtml">
+  <head>
+  <meta name="date" content="2013-09-18T02:46:00Z"/>
+  ...
+(2) >java -jar tika-app-1.24.1.jar --encoding=UTF-8 /Scratch/ak19/testword.docx
+    <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
+    <head>
+    ...
+(3) >java -jar tika-app-1.24.1.jar --encoding=iso-8859-1 /Scratch/ak19/testword.docx
+    <?xml version="1.0" encoding="ISO-8859-1"?><html xmlns="http://www.w3.org/1999/xhtml">
+    <head>
+    ...
+(4) >java -jar tika-app-1.24.1.jar --encoding=ISO-8859-1 /Scratch/ak19/testword.docx
+    <?xml version="1.0" encoding="ISO-8859-1"?><html xmlns="http://www.w3.org/1999/xhtml">
+    <head>
+     ...
+(5) >java -jar tika-app-1.24.1.jar --encoding=nonexistent /Scratch/ak19/testword.docx
+    Warning:  The encoding 'nonexistent' is not supported by the Java runtime.
+    Warning: encoding "nonexistent" not supported, using UTF-8
+    <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
+    <head>
+    ...
+(6) (Output to html)
+    > java -jar tika-app-1.24.1.jar --encoding=nonexistent --html /Scratch/ak19/testword.docx
+    Warning:  The encoding 'nonexistent' is not supported by the Java runtime.
+    Warning: encoding "nonexistent" not supported, using UTF-8
+    <html xmlns="http://www.w3.org/1999/xhtml">
+    <head>
+    ...
+The warning to STDERR is all that indicates that the encoding flag is taken into account
+when --html flag is turned. The actual html output sent to STDOUT makes no mention of any
+encoding in the file.
+(7) (Output to html case 2)
+    > java -jar tika-app-1.24.1.jar --html --encoding=iso-8859-1 /Scratch/ak19/testword.docx
+    <html xmlns="http://www.w3.org/1999/xhtml">
+    <head>
+    <meta name="date" content="2013-09-18T02:46:00Z"/>
+    <meta name="Total-Time" content="5"/>
+    ...
+No warnings, but also no mention of the encoding in the html output.
+The warning messages in (6) indicate that the output encoding is also taken into account when
+the output format is set to html, by passing in the flag --html to tika.
+Since we use --html as the output format, and UTF-8 is the character encoding Greenstone prefers
+to work with, it therefore seems meaningful to set --encoding=UTF-8.
+Also passing in --pretty-print to get supposedly better formatted output.
+--------------------------------------------------------------

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 34172

Legend:

main/trunk/greenstone2/collect/modelcol/etc/collectionConfig.xml

main/trunk/greenstone2/ext/tika/README.txt

Download in other formats: