Ignore:
Timestamp:
2020-06-16T17:20:50+12:00 (4 years ago)
Author:
ak19
Message:
  1. The tessdata folder was being created when compiling tesseract, and needn't be created and populated manually (except for the lang files), so there's less work for CASCADE-MAKE/TESSERACT.sh to do. However, the tessdata folder was being created in the linux/share folder. 'share' is probably a place where people expect tesseract's tessdata to be by default, so am updating the setup scripts to work with that, as I've donw with CASCADE-MAKE/TESSERACT.sh. 2. Adding useful instructions for users on getting more OCR language scripts' support in new file GETTING-OCR-SUPPORT-FOR-MORE-LANGS.txt, now included in the tesseract binary tarball too. Adjusted the README for us. 3. Removing the sample.jpg, converted from sample.tif which I'd downloaded from online and for which I don't know the copyright to. Replacing with sample.tif, a 96 DPI TIF file at 1870x2420 resolution produced from the first page of pdf05-notext.pdf by www.sejda.com/pdf-to-jpg. Moreover, this sample file contains lots of text, in 2 columns, not just 4 words like the original sample file. Good for testing a tesseract built from CASCADE-MAKE on. Also including the pdf05-notext-ocr-with-tikaTesseract.pdf istelf from the tutorial sample files, but only Tika with Tesseract can work on PDFs and not Tesseract by itself, indicated in the filename.
File:
1 edited

Legend:

Unmodified
Added
Removed
  • gs2-extensions/tesseract/trunk/README.txt

    r34186 r34190  
    1919at http://trac.greenstone.org/browser/gs2-extensions/imagemagick/trunk/README
    2020
     21
    21221. Find a location on your machine
     23
    2224
    23252. Check out the tesseract extension from gs2-extensions
    2426   svn co http://trac.greenstone.org/browser/gs2-extensions/tesseract/trunk tesseract
    2527
     28
    26293. Compile it all up (tesseract and dependencies):
    2730   cd tesseract
    2831   ./CASCADE-MAKE.sh
     32
    2933
    30344. Open a fresh terminal and check that the tesseract now installed in src/linux/bin works:
     
    4347   cat out.txt
    4448
     49If you run Tesseract with the hocr config file, you can get the OCR output in
     50nicely formatted html more representative of the input structure:
     51
     52       tesseract sample.tif hocrtest
     53
     54The OCR output in html format will be in hocrtest.hocr:
     55
     56    cat hocrtest.hocr
     57
     58
    45595. If successful,
    4660
    47 a. create a folder at the same level as src called tesseract
     61 a. create a folder at the same level as src called tesseract
    4862   cd src
    4963   cd ..
    5064   mkdir tesseract
    5165
    52 b. COPY the setup files and MOVE the installed folder (src/linux) into the new cut-down tesseract folder:
     66 b. COPY the setup files and MOVE the installed folder (src/linux) into the new cut-down tesseract folder:
    5367
    5468   cp src/setup.ba* tesseract/.
    5569   mv src/linux tesseract/.
    5670
    57 c. COPY the TESSERACT-APACHE-LICENSE and LEPTONICA-LICENSE txt files (note it uses
    58 American spelling!) from src/packages into the cut-down tesseract/linux:
     71 c. COPY the TESSERACT-APACHE-LICENSE and LEPTONICA-LICENSE txt files (note it uses
     72 American spelling!) from src/packages into the cut-down tesseract/linux:
    5973
    6074   cp src/packages/*LICENSE.txt tesseract/linux/.
    6175
    62 d. REMOVE folder "man" from tesseract/linux:
     76 d. Copy the top-level GETTING-OCR-SUPPORT-FOR-MORE-LANGS.txt file into the cutdown tesseract:
     77   cp GETTING-OCR-SUPPORT-FOR-MORE-LANGS.txt tesseract/.
     78
     79 e. REMOVE folder "man" from tesseract/linux:
    6380   rm -rf tesseract/linux/man
     81
     82 f. REMOVE everything EXCEPT the "tessdata" folder in tesseract/linux/share.
     83 (The other things in that  location are either unnecessary or created by tesseract's dependencies).
     84
    6485
    65866. Create a tarball of the cut down tesseract folder named tesseract-<os>-<arch>.tar.gz:
    6687   tar -cvzf tesseract-linux-x64.tar.gz tesseract
     88
    6789
    68907. (Add/SVN up and) commit that to svn:
     
    111133               hocr_font_info 0
    112134           
    113 2. In order to have a tessdata/configs/hocr, needed to correct the Tesseract we were
     135
     136
     137In order to have a tessdata/configs/hocr, needed to correct the Tesseract we were
    114138cascade-making to get it to put the "configs" subfolder inside the installed Tessereact's
    115139tessdata folder. The source version of tesseract has this folder, but it wasn't getting
Note: See TracChangeset for help on using the changeset viewer.