Changeset 34190 for gs2-extensions/tesseract/trunk/README.txt
- Timestamp:
- 2020-06-16T17:20:50+12:00 (4 years ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
gs2-extensions/tesseract/trunk/README.txt
r34186 r34190 19 19 at http://trac.greenstone.org/browser/gs2-extensions/imagemagick/trunk/README 20 20 21 21 22 1. Find a location on your machine 23 22 24 23 25 2. Check out the tesseract extension from gs2-extensions 24 26 svn co http://trac.greenstone.org/browser/gs2-extensions/tesseract/trunk tesseract 25 27 28 26 29 3. Compile it all up (tesseract and dependencies): 27 30 cd tesseract 28 31 ./CASCADE-MAKE.sh 32 29 33 30 34 4. Open a fresh terminal and check that the tesseract now installed in src/linux/bin works: … … 43 47 cat out.txt 44 48 49 If you run Tesseract with the hocr config file, you can get the OCR output in 50 nicely formatted html more representative of the input structure: 51 52 tesseract sample.tif hocrtest 53 54 The OCR output in html format will be in hocrtest.hocr: 55 56 cat hocrtest.hocr 57 58 45 59 5. If successful, 46 60 47 a. create a folder at the same level as src called tesseract61 a. create a folder at the same level as src called tesseract 48 62 cd src 49 63 cd .. 50 64 mkdir tesseract 51 65 52 b. COPY the setup files and MOVE the installed folder (src/linux) into the new cut-down tesseract folder:66 b. COPY the setup files and MOVE the installed folder (src/linux) into the new cut-down tesseract folder: 53 67 54 68 cp src/setup.ba* tesseract/. 55 69 mv src/linux tesseract/. 56 70 57 c. COPY the TESSERACT-APACHE-LICENSE and LEPTONICA-LICENSE txt files (note it uses58 American spelling!) from src/packages into the cut-down tesseract/linux:71 c. COPY the TESSERACT-APACHE-LICENSE and LEPTONICA-LICENSE txt files (note it uses 72 American spelling!) from src/packages into the cut-down tesseract/linux: 59 73 60 74 cp src/packages/*LICENSE.txt tesseract/linux/. 61 75 62 d. REMOVE folder "man" from tesseract/linux: 76 d. Copy the top-level GETTING-OCR-SUPPORT-FOR-MORE-LANGS.txt file into the cutdown tesseract: 77 cp GETTING-OCR-SUPPORT-FOR-MORE-LANGS.txt tesseract/. 78 79 e. REMOVE folder "man" from tesseract/linux: 63 80 rm -rf tesseract/linux/man 81 82 f. REMOVE everything EXCEPT the "tessdata" folder in tesseract/linux/share. 83 (The other things in that location are either unnecessary or created by tesseract's dependencies). 84 64 85 65 86 6. Create a tarball of the cut down tesseract folder named tesseract-<os>-<arch>.tar.gz: 66 87 tar -cvzf tesseract-linux-x64.tar.gz tesseract 88 67 89 68 90 7. (Add/SVN up and) commit that to svn: … … 111 133 hocr_font_info 0 112 134 113 2. In order to have a tessdata/configs/hocr, needed to correct the Tesseract we were 135 136 137 In order to have a tessdata/configs/hocr, needed to correct the Tesseract we were 114 138 cascade-making to get it to put the "configs" subfolder inside the installed Tessereact's 115 139 tessdata folder. The source version of tesseract has this folder, but it wasn't getting
Note:
See TracChangeset
for help on using the changeset viewer.