source: gs2-extensions/tesseract/trunk/README.txt@ 34198

Last change on this file since 34198 was 34198, checked in by ak19, 4 years ago
  1. Added a script to generate the cut-down ('binary only') tesseract binary tarball and zip. 2. Also adding the tesseract binary zip itself to svn.
File size: 5.5 KB
Line 
1-------------------------------------------------
2CONTENTS
3-------------------------------------------------
4In this file:
5
6A. COMPILING TESSERACT GS2-EXTENSION
7& CREATING THE CUT-DOWN BINARY-ONLY TARBALL
8
9B. GETTING TIKA AND TESSERACT TO OCR A PDF
10
11
12-------------------------------------------------
13A. COMPILING TESSERACT GS2-EXTENSION
14& CREATING THE CUT-DOWN BINARY-ONLY TARBALL
15-------------------------------------------------
16
17To compile the Tesseract gs2-extension and then create the "binary" tarball needed to run
18Tesseract, we follow an equivalent version of the instructions for the imagemagick gs2-extension
19at http://trac.greenstone.org/browser/gs2-extensions/imagemagick/trunk/README
20
21
221. Find a location on your machine
23
24
252. Check out the tesseract extension from gs2-extensions
26 svn co http://trac.greenstone.org/browser/gs2-extensions/tesseract/trunk tesseract
27
28
293. Compile it all up (tesseract and dependencies):
30 cd tesseract
31 ./CASCADE-MAKE.sh
32
33
344. Open a fresh terminal and check that the tesseract now installed in src/linux/bin works:
35
36 cd src
37 source ./setup.bash
38
39This should have set up env vars like GEXTTESS, GEXTTESS_INSTALLED, and TESSDATA_PREFIX
40which Tesseract needs to have set
41
42 tesseract --list-langs
43 tesseract sample.tif out
44
45OCRs sample.tif and generates out.txt from it.
46
47 cat out.txt
48
49If you run Tesseract with the hocr config file, you can get the OCR output in
50nicely formatted html more representative of the input structure:
51
52 tesseract sample.tif hocrtest
53
54The OCR output in html format will be in hocrtest.hocr:
55
56 cat hocrtest.hocr
57
58
595. If successful, create the cut down tesseract binary zip and tarball by running the following
60at the toplevel of the extension checkout:
61
62 ./makedists.sh <linux-x64|linux>
63
64
65If manually creating the cut-down tesseract zip and tarball then:
66 a. create a folder at the same level as src called tesseract
67 cd src
68 cd ..
69 mkdir tesseract
70
71 b. COPY the setup files and MOVE the installed folder (src/linux) into the new cut-down tesseract folder:
72
73 cp src/setup.ba* tesseract/.
74 mv src/linux tesseract/.
75
76 c. COPY the TESSERACT-APACHE-LICENSE and LEPTONICA-LICENSE txt files (note it uses
77 American spelling!) from src/packages into the cut-down tesseract/linux:
78
79 cp src/packages/*LICENSE.txt tesseract/linux/.
80
81 d. Copy the top-level GETTING-OCR-SUPPORT-FOR-MORE-LANGS.txt file into the cutdown tesseract:
82 cp GETTING-OCR-SUPPORT-FOR-MORE-LANGS.txt tesseract/.
83
84 e. REMOVE folder "man" from tesseract/linux:
85 rm -rf tesseract/linux/man
86
87 f. REMOVE everything EXCEPT the "tessdata" folder in tesseract/linux/share.
88 (The other things in that location are either unnecessary or created by tesseract's dependencies).
89
90
916. Create a tarball of the cut down tesseract folder named tesseract-<os>-<arch>.tar.gz:
92 tar -cvzf tesseract-linux-x64.tar.gz tesseract
93
94
957. (Add/SVN up and) commit that to svn:
96 svn up
97 svn add tesseract-linux-x64.tar.gz
98 (or svn diff tesseract-linux-x64.tar.gz if there was an earlier version to confirm modified)
99 svn commit -m "MESSAGE" tesseract-linux-x64.tar.gz
100
101
102-------------------------------------------------
103B. GETTING TIKA AND TESSERACT TO OCR A PDF
104-------------------------------------------------
105Tesseract does not OCR PDFs (https://github.com/tesseract-ocr/tesseract/issues/1476).
106Trying to do so, you'll see:
107 tesseract pdf05-notext.pdf notext
108 Tesseract Open Source OCR Engine v5.0.0-alpha-694-g6ee3 with Leptonica
109 Error in pixReadStream: Pdf reading is not supported
110 Error in pixRead: pix not read
111 Error during processing.
112
113Tesseract can OCR the individual images constituting a page of the PDF, but to OCR PDFs
114with Tesseract, you need an additional tool to split PDFs into its pages and extract images
115from them, feed each page's image into Tesseract to get it OCR-ed and then create an html or
116txt file collating all the individual OCR-ed page content.
117
118Tika does this.
119
120By default if Tika is on the environment and TESSDATA_PREFIX is set to the tessdata folder
121containing the language files, Tika is able to get Tesseract to OCR images out of the box.
122Yet not PDFs. Tika will output empty OCR for (x)html/txt, extracting only extractable text
123from PDFs and no OCR until the following is correct.
124
125To get Tika (app v1.24.x) and Tesseract (v5.0.0) set up to OCR PDFs, needed to do 2 more
126things:
1271. Have to run the tika-app-*.jar with in --config=/path/to/a/tika-config.xml file
128configured correctly for the TesseractOCRParser and PDFParser
1292. The <tika-config.xml> file passed to tika-app-*.jar should configure the "outputType"
130param's configuration of the TesseractOCRParser as follows:
131 a. Set the "outputType" param to "txt" so Tesseract produces the OCR in .txt format, and
132 Tesseract will produce .txt as OCR output which Tika will intercept and process,
133 b. OR if you want the OCR generated by Tesseract to be in hocr (html ocr) format, then set
134 the "outputType" param value to "hocr" AND ensure a config file also called hocr exists in
135 $TESSDATA_PREFIX/configs containing the following (taken from
136 https://github.com/tesseract-ocr/tessconfigs/blob/3decf1c8252ba6dbeef0bf908f4b0aab7f18d113/configs/hocr):
137 tessedit_create_hocr 1
138 hocr_font_info 0
139
140
141
142In order to have a tessdata/configs/hocr, needed to correct the Tesseract we were
143cascade-making to get it to put the "configs" subfolder inside the installed Tessereact's
144tessdata folder. The source version of tesseract has this folder, but it wasn't getting
145included in the built version despite us exporting TESSDATA_PREFIX before CASCADE-MAKing.
146
Note: See TracBrowser for help on using the repository browser.