Changeset 34172

Show
Ignore:
Timestamp:
14.06.2020 19:11:13 (4 weeks ago)
Author:
ak19
Message:

Some minor improvements to the UnknownConverterPlugin? settings for tika's conversion (of docx files) to html. Also documenting the reasoning.

Location:
main/trunk/greenstone2
Files:
2 modified

Legend:

Unmodified
Added
Removed
  • main/trunk/greenstone2/collect/modelcol/etc/collectionConfig.xml

    r34169 r34172  
    8787            <!-- Configuring an UnknownConverterPlugin for docx processing with Tika --> 
    8888            <plugin name="UnknownConverterPlugin"> 
    89               <option name="-exec_cmd" value="java -jar $GSDLHOME/ext/tika/tika-app-1.24.1.jar --html %%INPUT_FILE &gt; %%OUTPUT"/> 
     89              <option name="-exec_cmd" value="java -jar $GSDLHOME/ext/tika/tika-app-1.24.1.jar --html --pretty-print --encoding=UTF-8 %%INPUT_FILE &gt; %%OUTPUT"/> 
    9090              <option name="-convert_to" value="html"/> 
    9191              <option name="-mime_type" value="application/vnd.openxmlformats-officedocument.wordprocessingml.document"/> 
  • main/trunk/greenstone2/ext/tika/README.txt

    r34171 r34172  
    1414* Mime-types for docx and other office suite docs:   
    1515    https://stackoverflow.com/questions/4212861/what-is-a-correct-mime-type-for-docx-pptx-etc 
    16      
     16 
     17* Tesseract for OCR with Tika: 
     18https://dingyuliang.me/use-tika-1-14-extract-text-image-tesseract-ocr/ 
     19Use Tika 1.14 to extract text from image by Tesseract OCR 
     20 
     21* API usage examples - if modifying Tika code: 
     22https://tika.apache.org/1.8/examples.html 
     23https://stackoverflow.com/questions/38577468/convert-a-word-documents-to-html-with-embedded-images-by-tika 
    1724 
    1825-------------------------------------------------------------- 
     
    7885<ACTUAL TEXT IN INPUT DOCUMENT OUTPUT HERE> 
    7986 
     87 
    8088-------------------------------------------------------------- 
     89D. THE --encoding= FLAG TO TIKA 
     90-------------------------------------------------------------- 
     91> java -jar tika-app-1.24.1.jar --help 
     92  ... 
     93  -eX or --encoding=X    Use output encoding X 
     94  ... 
     95 
     96You can't specify invalid encodings (e.g. --encoding=nonexistent) 
     97It seems to be insensitive to case, e.g. --encoding=UTF-8, --encoding=utf-8, --encoding=iso-8859-1 
     98 
     99Since my tests have been to convert docs that contain ASCII using Tika, 
     100it's only obvious that the encoding flag has been taken into account in any way when the output is 
     101xhtml which is the default (or can pass in -x or --xml to get xhtml out). 
     102 
     103 
     104COMPARE, noting also the case of the encoding in the Tika command, vs in the output: 
     105 
     106(1) >java -jar tika-app-1.24.1.jar --encoding=iso-8859-1 /Scratch/ak19/testword.docx 
     107  <?xml version="1.0" encoding="utf-8"?><html xmlns="http://www.w3.org/1999/xhtml"> 
     108  <head> 
     109  <meta name="date" content="2013-09-18T02:46:00Z"/> 
     110  ... 
     111 
     112(2) >java -jar tika-app-1.24.1.jar --encoding=UTF-8 /Scratch/ak19/testword.docx 
     113    <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml"> 
     114    <head> 
     115    ... 
     116 
     117(3) >java -jar tika-app-1.24.1.jar --encoding=iso-8859-1 /Scratch/ak19/testword.docx 
     118    <?xml version="1.0" encoding="ISO-8859-1"?><html xmlns="http://www.w3.org/1999/xhtml"> 
     119    <head> 
     120    ... 
     121   
     122(4) >java -jar tika-app-1.24.1.jar --encoding=ISO-8859-1 /Scratch/ak19/testword.docx 
     123    <?xml version="1.0" encoding="ISO-8859-1"?><html xmlns="http://www.w3.org/1999/xhtml"> 
     124    <head> 
     125     ... 
     126 
     127(5) >java -jar tika-app-1.24.1.jar --encoding=nonexistent /Scratch/ak19/testword.docx 
     128    Warning:  The encoding 'nonexistent' is not supported by the Java runtime. 
     129    Warning: encoding "nonexistent" not supported, using UTF-8 
     130    <?xml version="1.0" encoding="UTF-8"?><html xmlns="http://www.w3.org/1999/xhtml"> 
     131    <head> 
     132    ... 
     133 
     134(6) (Output to html) 
     135    > java -jar tika-app-1.24.1.jar --encoding=nonexistent --html /Scratch/ak19/testword.docx 
     136    Warning:  The encoding 'nonexistent' is not supported by the Java runtime. 
     137    Warning: encoding "nonexistent" not supported, using UTF-8 
     138    <html xmlns="http://www.w3.org/1999/xhtml"> 
     139    <head> 
     140    ... 
     141The warning to STDERR is all that indicates that the encoding flag is taken into account 
     142when --html flag is turned. The actual html output sent to STDOUT makes no mention of any 
     143encoding in the file. 
     144 
     145(7) (Output to html case 2) 
     146    > java -jar tika-app-1.24.1.jar --html --encoding=iso-8859-1 /Scratch/ak19/testword.docx 
     147    <html xmlns="http://www.w3.org/1999/xhtml"> 
     148    <head> 
     149    <meta name="date" content="2013-09-18T02:46:00Z"/> 
     150    <meta name="Total-Time" content="5"/> 
     151    ... 
     152No warnings, but also no mention of the encoding in the html output. 
     153 
     154 
     155The warning messages in (6) indicate that the output encoding is also taken into account when 
     156the output format is set to html, by passing in the flag --html to tika. 
     157Since we use --html as the output format, and UTF-8 is the character encoding Greenstone prefers 
     158to work with, it therefore seems meaningful to set --encoding=UTF-8. 
     159 
     160Also passing in --pretty-print to get supposedly better formatted output. 
     161 
     162 
     163--------------------------------------------------------------