2020-06-16T17:20:50+12:00 (4 years ago)
  1. The tessdata folder was being created when compiling tesseract, and needn't be created and populated manually (except for the lang files), so there's less work for CASCADE-MAKE/TESSERACT.sh to do. However, the tessdata folder was being created in the linux/share folder. 'share' is probably a place where people expect tesseract's tessdata to be by default, so am updating the setup scripts to work with that, as I've donw with CASCADE-MAKE/TESSERACT.sh. 2. Adding useful instructions for users on getting more OCR language scripts' support in new file GETTING-OCR-SUPPORT-FOR-MORE-LANGS.txt, now included in the tesseract binary tarball too. Adjusted the README for us. 3. Removing the sample.jpg, converted from sample.tif which I'd downloaded from online and for which I don't know the copyright to. Replacing with sample.tif, a 96 DPI TIF file at 1870x2420 resolution produced from the first page of pdf05-notext.pdf by www.sejda.com/pdf-to-jpg. Moreover, this sample file contains lots of text, in 2 columns, not just 4 words like the original sample file. Good for testing a tesseract built from CASCADE-MAKE on. Also including the pdf05-notext-ocr-with-tikaTesseract.pdf istelf from the tutorial sample files, but only Tika with Tesseract can work on PDFs and not Tesseract by itself, indicated in the filename.
1 edited


  • gs2-extensions/tesseract/trunk/README.txt

    r34186 r34190  
    1919at http://trac.greenstone.org/browser/gs2-extensions/imagemagick/trunk/README
    21221. Find a location on your machine
    23252. Check out the tesseract extension from gs2-extensions
    2426   svn co http://trac.greenstone.org/browser/gs2-extensions/tesseract/trunk tesseract
    26293. Compile it all up (tesseract and dependencies):
    2730   cd tesseract
    2831   ./CASCADE-MAKE.sh
    30344. Open a fresh terminal and check that the tesseract now installed in src/linux/bin works:
    4347   cat out.txt
     49If you run Tesseract with the hocr config file, you can get the OCR output in
     50nicely formatted html more representative of the input structure:
     52       tesseract sample.tif hocrtest
     54The OCR output in html format will be in hocrtest.hocr:
     56    cat hocrtest.hocr
    45595. If successful,
    47 a. create a folder at the same level as src called tesseract
     61 a. create a folder at the same level as src called tesseract
    4862   cd src
    4963   cd ..
    5064   mkdir tesseract
    52 b. COPY the setup files and MOVE the installed folder (src/linux) into the new cut-down tesseract folder:
     66 b. COPY the setup files and MOVE the installed folder (src/linux) into the new cut-down tesseract folder:
    5468   cp src/setup.ba* tesseract/.
    5569   mv src/linux tesseract/.
    57 c. COPY the TESSERACT-APACHE-LICENSE and LEPTONICA-LICENSE txt files (note it uses
    58 American spelling!) from src/packages into the cut-down tesseract/linux:
     71 c. COPY the TESSERACT-APACHE-LICENSE and LEPTONICA-LICENSE txt files (note it uses
     72 American spelling!) from src/packages into the cut-down tesseract/linux:
    6074   cp src/packages/*LICENSE.txt tesseract/linux/.
    62 d. REMOVE folder "man" from tesseract/linux:
     76 d. Copy the top-level GETTING-OCR-SUPPORT-FOR-MORE-LANGS.txt file into the cutdown tesseract:
     77   cp GETTING-OCR-SUPPORT-FOR-MORE-LANGS.txt tesseract/.
     79 e. REMOVE folder "man" from tesseract/linux:
    6380   rm -rf tesseract/linux/man
     82 f. REMOVE everything EXCEPT the "tessdata" folder in tesseract/linux/share.
     83 (The other things in that  location are either unnecessary or created by tesseract's dependencies).
    65866. Create a tarball of the cut down tesseract folder named tesseract-<os>-<arch>.tar.gz:
    6687   tar -cvzf tesseract-linux-x64.tar.gz tesseract
    68907. (Add/SVN up and) commit that to svn:
    111133               hocr_font_info 0
    113 2. In order to have a tessdata/configs/hocr, needed to correct the Tesseract we were
     137In order to have a tessdata/configs/hocr, needed to correct the Tesseract we were
    114138cascade-making to get it to put the "configs" subfolder inside the installed Tessereact's
    115139tessdata folder. The source version of tesseract has this folder, but it wasn't getting
Note: See TracChangeset for help on using the changeset viewer.