source: gs2-extensions/tesseract/trunk/README.txt@ 34203

Last change on this file since 34203 was 34203, checked in by ak19, 4 years ago

Reminder.

File size: 5.9 KB
Line 
1TODO:
2- Also turn the CASCADE-MAKE/*.sh files shared with imagemagick (ZLIB, LIBPNG, TIFF, JPEG, JPEG2000) into svn:externals
3+ DONE: Since zlib, libpng, tif, jpg, jpeg2000 are all from imagemagick, may be use svn:externals
4to bring them into packages?
5svn:externals on individual files is possible, see
6https://stackoverflow.com/questions/1355956/can-we-set-a-single-file-as-external-in-subversion
7
8-------------------------------------------------
9CONTENTS
10-------------------------------------------------
11In this file:
12
13A. COMPILING TESSERACT GS2-EXTENSION
14& CREATING THE CUT-DOWN BINARY-ONLY TARBALL
15
16B. GETTING TIKA AND TESSERACT TO OCR A PDF
17
18
19-------------------------------------------------
20A. COMPILING TESSERACT GS2-EXTENSION
21& CREATING THE CUT-DOWN BINARY-ONLY TARBALL
22-------------------------------------------------
23
24To compile the Tesseract gs2-extension and then create the "binary" tarball needed to run
25Tesseract, we follow an equivalent version of the instructions for the imagemagick gs2-extension
26at http://trac.greenstone.org/browser/gs2-extensions/imagemagick/trunk/README
27
28
291. Find a location on your machine
30
31
322. Check out the tesseract extension from gs2-extensions
33 svn co http://trac.greenstone.org/browser/gs2-extensions/tesseract/trunk tesseract
34
35
363. Compile it all up (tesseract and dependencies):
37 cd tesseract
38 ./CASCADE-MAKE.sh
39
40
414. Open a fresh terminal and check that the tesseract now installed in src/linux/bin works:
42
43 cd src
44 source ./setup.bash
45
46This should have set up env vars like GEXTTESS, GEXTTESS_INSTALLED, and TESSDATA_PREFIX
47which Tesseract needs to have set
48
49 tesseract --list-langs
50 tesseract sample.tif out
51
52OCRs sample.tif and generates out.txt from it.
53
54 cat out.txt
55
56If you run Tesseract with the hocr config file, you can get the OCR output in
57nicely formatted html more representative of the input structure:
58
59 tesseract sample.tif hocrtest
60
61The OCR output in html format will be in hocrtest.hocr:
62
63 cat hocrtest.hocr
64
65
665. If successful, create the cut down tesseract binary zip and tarball by running the following
67at the toplevel of the extension checkout:
68
69 ./makedists.sh <linux-x64|linux>
70
71
72If manually creating the cut-down tesseract zip and tarball then:
73 a. create a folder at the same level as src called tesseract
74 cd src
75 cd ..
76 mkdir tesseract
77
78 b. COPY the setup files and MOVE the installed folder (src/linux) into the new cut-down tesseract folder:
79
80 cp src/setup.ba* tesseract/.
81 mv src/linux tesseract/.
82
83 c. COPY the TESSERACT-APACHE-LICENSE and LEPTONICA-LICENSE txt files (note it uses
84 American spelling!) from src/packages into the cut-down tesseract/linux:
85
86 cp src/packages/*LICENSE.txt tesseract/linux/.
87
88 d. Copy the top-level GETTING-OCR-SUPPORT-FOR-MORE-LANGS.txt file into the cutdown tesseract:
89 cp GETTING-OCR-SUPPORT-FOR-MORE-LANGS.txt tesseract/.
90
91 e. REMOVE folder "man" from tesseract/linux:
92 rm -rf tesseract/linux/man
93
94 f. REMOVE everything EXCEPT the "tessdata" folder in tesseract/linux/share.
95 (The other things in that location are either unnecessary or created by tesseract's dependencies).
96
97
986. Create a tarball of the cut down tesseract folder named tesseract-<os>-<arch>.tar.gz:
99 tar -cvzf tesseract-linux-x64.tar.gz tesseract
100
101
1027. (Add/SVN up and) commit that to svn:
103 svn up
104 svn add tesseract-linux-x64.tar.gz
105 (or svn diff tesseract-linux-x64.tar.gz if there was an earlier version to confirm modified)
106 svn commit -m "MESSAGE" tesseract-linux-x64.tar.gz
107
108
109-------------------------------------------------
110B. GETTING TIKA AND TESSERACT TO OCR A PDF
111-------------------------------------------------
112Tesseract does not OCR PDFs (https://github.com/tesseract-ocr/tesseract/issues/1476).
113Trying to do so, you'll see:
114 tesseract pdf05-notext.pdf notext
115 Tesseract Open Source OCR Engine v5.0.0-alpha-694-g6ee3 with Leptonica
116 Error in pixReadStream: Pdf reading is not supported
117 Error in pixRead: pix not read
118 Error during processing.
119
120Tesseract can OCR the individual images constituting a page of the PDF, but to OCR PDFs
121with Tesseract, you need an additional tool to split PDFs into its pages and extract images
122from them, feed each page's image into Tesseract to get it OCR-ed and then create an html or
123txt file collating all the individual OCR-ed page content.
124
125Tika does this.
126
127By default if Tika is on the environment and TESSDATA_PREFIX is set to the tessdata folder
128containing the language files, Tika is able to get Tesseract to OCR images out of the box.
129Yet not PDFs. Tika will output empty OCR for (x)html/txt, extracting only extractable text
130from PDFs and no OCR until the following is correct.
131
132To get Tika (app v1.24.x) and Tesseract (v5.0.0) set up to OCR PDFs, needed to do 2 more
133things:
1341. Have to run the tika-app-*.jar with in --config=/path/to/a/tika-config.xml file
135configured correctly for the TesseractOCRParser and PDFParser
1362. The <tika-config.xml> file passed to tika-app-*.jar should configure the "outputType"
137param's configuration of the TesseractOCRParser as follows:
138 a. Set the "outputType" param to "txt" so Tesseract produces the OCR in .txt format, and
139 Tesseract will produce .txt as OCR output which Tika will intercept and process,
140 b. OR if you want the OCR generated by Tesseract to be in hocr (html ocr) format, then set
141 the "outputType" param value to "hocr" AND ensure a config file also called hocr exists in
142 $TESSDATA_PREFIX/configs containing the following (taken from
143 https://github.com/tesseract-ocr/tessconfigs/blob/3decf1c8252ba6dbeef0bf908f4b0aab7f18d113/configs/hocr):
144 tessedit_create_hocr 1
145 hocr_font_info 0
146
147
148
149In order to have a tessdata/configs/hocr, needed to correct the Tesseract we were
150cascade-making to get it to put the "configs" subfolder inside the installed Tessereact's
151tessdata folder. The source version of tesseract has this folder, but it wasn't getting
152included in the built version despite us exporting TESSDATA_PREFIX before CASCADE-MAKing.
153
Note: See TracBrowser for help on using the repository browser.