Changeset 34169

Show
Ignore:
Timestamp:
14.06.2020 03:40:21 (3 weeks ago)
Author:
ak19
Message:

All GS3 needs to convert docx files to basic html (no images) out of the box. 1. Adding in the Tika jar with its Apache 2.0 licence, a handcrafted notice derived from the license, and a Readme with links and examples of its use. 2. Updating model collectionConfig.xml with a pre-configured UnknownConverterPlugin? to use the tika jar to convert docx to basic html. So all future GS3 collections will have this set up in the document pipeline and be ready for docx files. When the chance arises, need to set up a model coll for GS2 that uses the UnknownConverterPlugin? in this way too.

Location:
main/trunk/greenstone2
Files:
5 added
1 modified

Legend:

Unmodified
Added
Removed
  • main/trunk/greenstone2/collect/modelcol/etc/collectionConfig.xml

    r33740 r34169  
    8585            <plugin name="EmailPlugin"/> 
    8686            <plugin name="PDFv2Plugin"/> 
     87            <!-- Configuring an UnknownConverterPlugin for docx processing with Tika --> 
     88            <plugin name="UnknownConverterPlugin"> 
     89              <option name="-exec_cmd" value="java -jar $GSDLHOME/ext/tika/tika-app-1.24.1.jar --html %%INPUT_FILE &gt; %%OUTPUT"/> 
     90              <option name="-convert_to" value="html"/> 
     91              <option name="-mime_type" value="application/vnd.openxmlformats-officedocument.wordprocessingml.document"/> 
     92              <option name="-srcicon" value="icondocx"/> 
     93              <option name="-process_extension" value="docx"/> 
     94            </plugin> 
    8795            <plugin name="RTFPlugin"/> 
    8896            <plugin name="WordPlugin"/>