Ignore:
Timestamp:
2020-06-14T03:40:21+12:00 (4 years ago)
Author:
ak19
Message:

All GS3 needs to convert docx files to basic html (no images) out of the box. 1. Adding in the Tika jar with its Apache 2.0 licence, a handcrafted notice derived from the license, and a Readme with links and examples of its use. 2. Updating model collectionConfig.xml with a pre-configured UnknownConverterPlugin to use the tika jar to convert docx to basic html. So all future GS3 collections will have this set up in the document pipeline and be ready for docx files. When the chance arises, need to set up a model coll for GS2 that uses the UnknownConverterPlugin in this way too.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • main/trunk/greenstone2/collect/modelcol/etc/collectionConfig.xml

    r33740 r34169  
    8585            <plugin name="EmailPlugin"/>
    8686            <plugin name="PDFv2Plugin"/>
     87            <!-- Configuring an UnknownConverterPlugin for docx processing with Tika -->
     88            <plugin name="UnknownConverterPlugin">
     89              <option name="-exec_cmd" value="java -jar $GSDLHOME/ext/tika/tika-app-1.24.1.jar --html %%INPUT_FILE &gt; %%OUTPUT"/>
     90              <option name="-convert_to" value="html"/>
     91              <option name="-mime_type" value="application/vnd.openxmlformats-officedocument.wordprocessingml.document"/>
     92              <option name="-srcicon" value="icondocx"/>
     93              <option name="-process_extension" value="docx"/>
     94            </plugin>
    8795            <plugin name="RTFPlugin"/>
    8896            <plugin name="WordPlugin"/>
Note: See TracChangeset for help on using the changeset viewer.