Changeset 34191
- Timestamp:
- 2020-06-16T17:44:17+12:00 (4 years ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
main/trunk/greenstone2/collect/modelcol/etc/collectionConfig.xml
r34173 r34191 93 93 <option name="-process_extension" value="docx"/> 94 94 </plugin> 95 <!-- If you have Tesseract installed (for linux 64 bit machines, there's a tesseract tarball available 96 for download from http://trac.greenstone.org/browser/gs2-extensions/tesseract/trunk/tesseract-linux-x64.tar.gz 97 Untested: for windows, you can try installing Tesseract from Win binaries at https://github.com/UB-Mannheim/tesseract/wiki 98 For Windows and Mac, be sure to add tesseract's bin folder to your PATH and also set the TESSDATA_PREFIX environment variable to 99 the folder "tessdata" wherein you also need to have the "<3-letter-langcode>.traineddata" files for the languages you want 100 to OCR.) The Linux 64 bit tesseract extension tarball already does all this for you. 101 Once you have Tesseract installed, you can activate the following UnknownConverterPlugin to use Tika with Tesseract to OCR PDFs 102 that contain images by removing the XML comment symbols. 103 --> 104 <!-- 105 <plugin name="UnknownConverterPlugin"> 106 <option name="-exec_cmd" value="java -jar $GSDLHOME/ext/tika/tika-app-*.jar --config=$GSDLHOME/ext/tika/tika-config.xml --html %%INPUT_FILE > %%OUTPUT"/> 107 <option name="-convert_to" value="html"/> 108 <option name="-mime_type" value="application/pdf"/> 109 <option name="-srcicon" value="iconpdf"/> 110 <option name="-process_extension" value="pdf"/> 111 </plugin> 112 --> 95 113 <plugin name="RTFPlugin"/> 96 114 <plugin name="WordPlugin"/>
Note:
See TracChangeset
for help on using the changeset viewer.