Ignore:
Timestamp:
2020-06-16T17:20:50+12:00 (4 years ago)
Author:
ak19
Message:
  1. The tessdata folder was being created when compiling tesseract, and needn't be created and populated manually (except for the lang files), so there's less work for CASCADE-MAKE/TESSERACT.sh to do. However, the tessdata folder was being created in the linux/share folder. 'share' is probably a place where people expect tesseract's tessdata to be by default, so am updating the setup scripts to work with that, as I've donw with CASCADE-MAKE/TESSERACT.sh. 2. Adding useful instructions for users on getting more OCR language scripts' support in new file GETTING-OCR-SUPPORT-FOR-MORE-LANGS.txt, now included in the tesseract binary tarball too. Adjusted the README for us. 3. Removing the sample.jpg, converted from sample.tif which I'd downloaded from online and for which I don't know the copyright to. Replacing with sample.tif, a 96 DPI TIF file at 1870x2420 resolution produced from the first page of pdf05-notext.pdf by www.sejda.com/pdf-to-jpg. Moreover, this sample file contains lots of text, in 2 columns, not just 4 words like the original sample file. Good for testing a tesseract built from CASCADE-MAKE on. Also including the pdf05-notext-ocr-with-tikaTesseract.pdf istelf from the tutorial sample files, but only Tika with Tesseract can work on PDFs and not Tesseract by itself, indicated in the filename.
File:
1 edited

Legend:

Unmodified
Added
Removed
  • gs2-extensions/tesseract/trunk/src/packages/CASCADE-MAKE/TESSERACT.sh

    r34186 r34190  
    4545
    4646echo "Installing basic tesseract languages support (tessdata)"
    47 cp $GEXTTESS_DEVEL/packages/tessdata-langs.tar.gz $GEXTTESS_INSTALLED/.
    48 pushd $GEXTTESS_INSTALLED
     47# Untar OCR language support tarball one level above TESSDATA_PREFIX ($GEXTTESS_INSTALLED/shared),
     48# Then go into that folder to finish setting up language files.
     49cp $GEXTTESS_DEVEL/packages/tessdata-langs.tar.gz $TESSDATA_PREFIX/../.
     50pushd $TESSDATA_PREFIX/..
    4951tar -xvzf tessdata-langs.tar.gz
     52# Above creates linux/shared/tessdata-langs folder - move files there into
     53# linux/shared/tessdata (i.e. TESSDATA_PREFIX) and delete both tarball and temporary
     54# tessdata-langs folder created at current location of one level up from TESSDATA_PREFIX
     55mv tessdata-langs/*.traineddata $TESSDATA_PREFIX/.
    5056rm tessdata-langs.tar.gz
    51 mkdir -p tessdata/tessconfigs
     57rm -rf tessdata-langs
    5258popd
    5359
    54 # Not sure why source package's tessdata didn't get installed in installdir
    55 # despite exporting TESSDATA_PREFIX at the start at cascade-make process.
    56 cp -r $package$version/tessdata/configs $GEXTTESS_INSTALLED/tessdata/
    57 cp $package$version/tessdata/eng.user-patterns $GEXTTESS_INSTALLED/tessdata/.
    58 cp $package$version/tessdata/eng.user-words $GEXTTESS_INSTALLED/tessdata/.
    59 cp $package$version/tessdata/tessconfigs/*batch* $GEXTTESS_INSTALLED/tessdata/tessconfigs/.
    60 cp $package$version/tessdata/tessconfigs/*demo* $GEXTTESS_INSTALLED/tessdata/tessconfigs/.
    6160
    62 
    63 echo "Done installing basic tesseract languages"
    64 echo "Visit https://github.com/tesseract-ocr/tessdata for a full list of trained language data."
    65 echo "To download support for any specific language(s), note the 3 letter code of that language"
    66 echo "Go into your $GEXTTESS_INSTALLED/tessdata and for each language run: "
     61echo "Done installing basic tesseract languages for OCR (Optical Character Recognition, to recognise text from images)."
     62echo "Visit https://github.com/tesseract-ocr/tessdata for a full list of trained language data for OCR."
     63echo "To download OCR support for any specific language(s), note the 3 letter code of that language"
     64echo "Go into your $TESSDATA_PREFIX folder and for each language you want OCR abilities for, run: "
    6765echo "   wget https://github.com/tesseract-ocr/tessdata/raw/master/<3-letter-lang-code>.traineddata"
    68 echo "To get all languages currently supported by Tesseract, delete"
    69 echo "$GEXTTESS_INSTALLED/tessdata"
    70 echo "and in $GEXTTES_INSTALLED run:"
     66echo "To get all languages currently supported by Tesseract (beware, this may be a few Gigabytes), delete"
     67echo "$TESSDATA_PREFIX"
     68echo "and in $GEXTTES_INSTALLED/shared run:"
    7169echo "   git clone https://github.com/tesseract-ocr/tessdata"
    7270echo ""
Note: See TracChangeset for help on using the changeset viewer.