Ignore:
Timestamp:
2020-06-16T15:00:39+12:00 (4 years ago)
Author:
ak19
Message:

In order to get tika + tesseract to OCR PDFs (note that tesseract can't OCR PDFs on its own), need to pass a tika-config.xml file to tika that is configured to use txt OR hocr as outputType, and if outputType=hocr then need to have the tesseract/tessdata/configs folder contain a file called hocr at minimum. Now the build process ensures that the tessdata/configs and other tessdata subfolders in the extracted tesseract source package get copied across into the GEXTTESS_INSTALLED install location. Updating the README with the notes and the tesseract bin tarball.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • gs2-extensions/tesseract/trunk/src/packages/CASCADE-MAKE/TESSERACT.sh

    r34178 r34186  
    2525opt_run_untar $force_untar $auto_untar $package $version
    2626
    27 # Need to do this for TESSERACT, before we can do configure/make/make install
     27# Need to do this for TESSERACT, before we can do configure->make->make install
    2828pushd $package$version;
    2929libtoolize
     
    4747cp $GEXTTESS_DEVEL/packages/tessdata-langs.tar.gz $GEXTTESS_INSTALLED/.
    4848pushd $GEXTTESS_INSTALLED
    49 #mkdir tessdata
    5049tar -xvzf tessdata-langs.tar.gz
    5150rm tessdata-langs.tar.gz
     51mkdir -p tessdata/tessconfigs
    5252popd
     53
     54# Not sure why source package's tessdata didn't get installed in installdir
     55# despite exporting TESSDATA_PREFIX at the start at cascade-make process.
     56cp -r $package$version/tessdata/configs $GEXTTESS_INSTALLED/tessdata/
     57cp $package$version/tessdata/eng.user-patterns $GEXTTESS_INSTALLED/tessdata/.
     58cp $package$version/tessdata/eng.user-words $GEXTTESS_INSTALLED/tessdata/.
     59cp $package$version/tessdata/tessconfigs/*batch* $GEXTTESS_INSTALLED/tessdata/tessconfigs/.
     60cp $package$version/tessdata/tessconfigs/*demo* $GEXTTESS_INSTALLED/tessdata/tessconfigs/.
     61
    5362
    5463echo "Done installing basic tesseract languages"
    5564echo "Visit https://github.com/tesseract-ocr/tessdata for a full list of trained language data."
    56 echo "To download support for any other language(s), note the 3 letter code of that language"
     65echo "To download support for any specific language(s), note the 3 letter code of that language"
    5766echo "Go into your $GEXTTESS_INSTALLED/tessdata and for each language run: "
    58 echo "wget https://github.com/tesseract-ocr/tessdata/raw/master/<3-letter-lang-code>.traineddata"
     67echo "   wget https://github.com/tesseract-ocr/tessdata/raw/master/<3-letter-lang-code>.traineddata"
     68echo "To get all languages currently supported by Tesseract, delete"
     69echo "$GEXTTESS_INSTALLED/tessdata"
     70echo "and in $GEXTTES_INSTALLED run:"
     71echo "   git clone https://github.com/tesseract-ocr/tessdata"
    5972echo ""
Note: See TracChangeset for help on using the changeset viewer.