Ignore:
Timestamp:
2020-06-14T19:11:13+12:00 (4 years ago)
Author:
ak19
Message:

Some minor improvements to the UnknownConverterPlugin settings for tika's conversion (of docx files) to html. Also documenting the reasoning.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • main/trunk/greenstone2/ext/tika/README.txt

    r34171 r34172  
    1414* Mime-types for docx and other office suite docs: 
    1515    https://stackoverflow.com/questions/4212861/what-is-a-correct-mime-type-for-docx-pptx-etc
    16    
     16
     17* Tesseract for OCR with Tika:
     18https://dingyuliang.me/use-tika-1-14-extract-text-image-tesseract-ocr/
     19Use Tika 1.14 to extract text from image by Tesseract OCR
     20
     21* API usage examples - if modifying Tika code:
     22https://tika.apache.org/1.8/examples.html
     23https://stackoverflow.com/questions/38577468/convert-a-word-documents-to-html-with-embedded-images-by-tika
    1724
    1825--------------------------------------------------------------
     
    7885<ACTUAL TEXT IN INPUT DOCUMENT OUTPUT HERE>
    7986
     87
    8088--------------------------------------------------------------
     89D. THE --encoding= FLAG TO TIKA
     90--------------------------------------------------------------
     91> java -jar tika-app-1.24.1.jar --help
     92  ...
     93  -eX or --encoding=X    Use output encoding X
     94  ...
     95
     96You can't specify invalid encodings (e.g. --encoding=nonexistent)
     97It seems to be insensitive to case, e.g. --encoding=UTF-8, --encoding=utf-8, --encoding=iso-8859-1
     98
     99Since my tests have been to convert docs that contain ASCII using Tika,
     100it's only obvious that the encoding flag has been taken into account in any way when the output is
     101xhtml which is the default (or can pass in -x or --xml to get xhtml out).
     102
     103
     104COMPARE, noting also the case of the encoding in the Tika command, vs in the output:
     105
     106(1) >java -jar tika-app-1.24.1.jar --encoding=iso-8859-1 /Scratch/ak19/testword.docx
     107  <?xml version="1.0" encoding="utf-8"?><html xmlns="http://www.w3.org/1999/xhtml">
     108  <head>
     109  <meta name="date" content="2013-09-18T02:46:00Z"/>
     110  ...
     111
     112(2) >java -jar tika-app-1.24.1.jar --encoding=UTF-8 /Scratch/ak19/testword.docx
     113    <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
     114    <head>
     115    ...
     116
     117(3) >java -jar tika-app-1.24.1.jar --encoding=iso-8859-1 /Scratch/ak19/testword.docx
     118    <?xml version="1.0" encoding="ISO-8859-1"?><html xmlns="http://www.w3.org/1999/xhtml">
     119    <head>
     120    ...
     121 
     122(4) >java -jar tika-app-1.24.1.jar --encoding=ISO-8859-1 /Scratch/ak19/testword.docx
     123    <?xml version="1.0" encoding="ISO-8859-1"?><html xmlns="http://www.w3.org/1999/xhtml">
     124    <head>
     125     ...
     126
     127(5) >java -jar tika-app-1.24.1.jar --encoding=nonexistent /Scratch/ak19/testword.docx
     128    Warning:  The encoding 'nonexistent' is not supported by the Java runtime.
     129    Warning: encoding "nonexistent" not supported, using UTF-8
     130    <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml">
     131    <head>
     132    ...
     133
     134(6) (Output to html)
     135    > java -jar tika-app-1.24.1.jar --encoding=nonexistent --html /Scratch/ak19/testword.docx
     136    Warning:  The encoding 'nonexistent' is not supported by the Java runtime.
     137    Warning: encoding "nonexistent" not supported, using UTF-8
     138    <html xmlns="http://www.w3.org/1999/xhtml">
     139    <head>
     140    ...
     141The warning to STDERR is all that indicates that the encoding flag is taken into account
     142when --html flag is turned. The actual html output sent to STDOUT makes no mention of any
     143encoding in the file.
     144
     145(7) (Output to html case 2)
     146    > java -jar tika-app-1.24.1.jar --html --encoding=iso-8859-1 /Scratch/ak19/testword.docx
     147    <html xmlns="http://www.w3.org/1999/xhtml">
     148    <head>
     149    <meta name="date" content="2013-09-18T02:46:00Z"/>
     150    <meta name="Total-Time" content="5"/>
     151    ...
     152No warnings, but also no mention of the encoding in the html output.
     153
     154
     155The warning messages in (6) indicate that the output encoding is also taken into account when
     156the output format is set to html, by passing in the flag --html to tika.
     157Since we use --html as the output format, and UTF-8 is the character encoding Greenstone prefers
     158to work with, it therefore seems meaningful to set --encoding=UTF-8.
     159
     160Also passing in --pretty-print to get supposedly better formatted output.
     161
     162
     163--------------------------------------------------------------
Note: See TracChangeset for help on using the changeset viewer.