Changeset 34191


Ignore:
Timestamp:
2020-06-16T17:44:17+12:00 (4 years ago)
Author:
ak19
Message:

Model colectionConfig.xml with commented out UnknownConverterPlugin configured to use tika with tesseract to OCR PDFs consisting of images.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • main/trunk/greenstone2/collect/modelcol/etc/collectionConfig.xml

    r34173 r34191  
    9393              <option name="-process_extension" value="docx"/>
    9494            </plugin>
     95            <!-- If you have Tesseract installed (for linux 64 bit machines, there's a tesseract tarball available
     96                 for download from http://trac.greenstone.org/browser/gs2-extensions/tesseract/trunk/tesseract-linux-x64.tar.gz
     97                 Untested: for windows, you can try installing Tesseract from Win binaries at https://github.com/UB-Mannheim/tesseract/wiki
     98                 For Windows and Mac, be sure to add tesseract's bin folder to your PATH and also set the TESSDATA_PREFIX environment variable to
     99                 the folder "tessdata" wherein you also need to have the "<3-letter-langcode>.traineddata" files for the languages you want
     100                 to OCR.) The Linux 64 bit tesseract extension tarball already does all this for you.
     101                 Once you have Tesseract installed, you can activate the following UnknownConverterPlugin to use Tika with Tesseract to OCR PDFs
     102                 that contain images by removing the XML comment symbols.
     103            -->
     104            <!--
     105            <plugin name="UnknownConverterPlugin">
     106                <option name="-exec_cmd" value="java -jar $GSDLHOME/ext/tika/tika-app-*.jar &#45;&#45;config=$GSDLHOME/ext/tika/tika-config.xml &#45;&#45;html %%INPUT_FILE > %%OUTPUT"/>
     107                <option name="-convert_to" value="html"/>
     108                <option name="-mime_type" value="application/pdf"/>
     109                <option name="-srcicon" value="iconpdf"/>
     110                <option name="-process_extension" value="pdf"/>
     111            </plugin>
     112            -->
    95113            <plugin name="RTFPlugin"/>
    96114            <plugin name="WordPlugin"/>
Note: See TracChangeset for help on using the changeset viewer.