1 | -------------------------------------------------
|
---|
2 | CONTENTS
|
---|
3 | -------------------------------------------------
|
---|
4 | In this file:
|
---|
5 |
|
---|
6 | A. COMPILING TESSERACT GS2-EXTENSION
|
---|
7 | & CREATING THE CUT-DOWN BINARY-ONLY TARBALL
|
---|
8 |
|
---|
9 | B. GETTING TIKA AND TESSERACT TO OCR A PDF
|
---|
10 |
|
---|
11 |
|
---|
12 | -------------------------------------------------
|
---|
13 | A. COMPILING TESSERACT GS2-EXTENSION
|
---|
14 | & CREATING THE CUT-DOWN BINARY-ONLY TARBALL
|
---|
15 | -------------------------------------------------
|
---|
16 |
|
---|
17 | To compile the Tesseract gs2-extension and then create the "binary" tarball needed to run
|
---|
18 | Tesseract, we follow an equivalent version of the instructions for the imagemagick gs2-extension
|
---|
19 | at http://trac.greenstone.org/browser/gs2-extensions/imagemagick/trunk/README
|
---|
20 |
|
---|
21 |
|
---|
22 | 1. Find a location on your machine
|
---|
23 |
|
---|
24 |
|
---|
25 | 2. Check out the tesseract extension from gs2-extensions
|
---|
26 | svn co http://trac.greenstone.org/browser/gs2-extensions/tesseract/trunk tesseract
|
---|
27 |
|
---|
28 |
|
---|
29 | 3. Compile it all up (tesseract and dependencies):
|
---|
30 | cd tesseract
|
---|
31 | ./CASCADE-MAKE.sh
|
---|
32 |
|
---|
33 |
|
---|
34 | 4. Open a fresh terminal and check that the tesseract now installed in src/linux/bin works:
|
---|
35 |
|
---|
36 | cd src
|
---|
37 | source ./setup.bash
|
---|
38 |
|
---|
39 | This should have set up env vars like GEXTTESS, GEXTTESS_INSTALLED, and TESSDATA_PREFIX
|
---|
40 | which Tesseract needs to have set
|
---|
41 |
|
---|
42 | tesseract --list-langs
|
---|
43 | tesseract sample.tif out
|
---|
44 |
|
---|
45 | OCRs sample.tif and generates out.txt from it.
|
---|
46 |
|
---|
47 | cat out.txt
|
---|
48 |
|
---|
49 | If you run Tesseract with the hocr config file, you can get the OCR output in
|
---|
50 | nicely formatted html more representative of the input structure:
|
---|
51 |
|
---|
52 | tesseract sample.tif hocrtest
|
---|
53 |
|
---|
54 | The OCR output in html format will be in hocrtest.hocr:
|
---|
55 |
|
---|
56 | cat hocrtest.hocr
|
---|
57 |
|
---|
58 |
|
---|
59 | 5. If successful,
|
---|
60 |
|
---|
61 | a. create a folder at the same level as src called tesseract
|
---|
62 | cd src
|
---|
63 | cd ..
|
---|
64 | mkdir tesseract
|
---|
65 |
|
---|
66 | b. COPY the setup files and MOVE the installed folder (src/linux) into the new cut-down tesseract folder:
|
---|
67 |
|
---|
68 | cp src/setup.ba* tesseract/.
|
---|
69 | mv src/linux tesseract/.
|
---|
70 |
|
---|
71 | c. COPY the TESSERACT-APACHE-LICENSE and LEPTONICA-LICENSE txt files (note it uses
|
---|
72 | American spelling!) from src/packages into the cut-down tesseract/linux:
|
---|
73 |
|
---|
74 | cp src/packages/*LICENSE.txt tesseract/linux/.
|
---|
75 |
|
---|
76 | d. Copy the top-level GETTING-OCR-SUPPORT-FOR-MORE-LANGS.txt file into the cutdown tesseract:
|
---|
77 | cp GETTING-OCR-SUPPORT-FOR-MORE-LANGS.txt tesseract/.
|
---|
78 |
|
---|
79 | e. REMOVE folder "man" from tesseract/linux:
|
---|
80 | rm -rf tesseract/linux/man
|
---|
81 |
|
---|
82 | f. REMOVE everything EXCEPT the "tessdata" folder in tesseract/linux/share.
|
---|
83 | (The other things in that location are either unnecessary or created by tesseract's dependencies).
|
---|
84 |
|
---|
85 |
|
---|
86 | 6. Create a tarball of the cut down tesseract folder named tesseract-<os>-<arch>.tar.gz:
|
---|
87 | tar -cvzf tesseract-linux-x64.tar.gz tesseract
|
---|
88 |
|
---|
89 |
|
---|
90 | 7. (Add/SVN up and) commit that to svn:
|
---|
91 | svn up
|
---|
92 | svn add tesseract-linux-x64.tar.gz
|
---|
93 | (or svn diff tesseract-linux-x64.tar.gz if there was an earlier version to confirm modified)
|
---|
94 | svn commit -m "MESSAGE" tesseract-linux-x64.tar.gz
|
---|
95 |
|
---|
96 |
|
---|
97 | -------------------------------------------------
|
---|
98 | B. GETTING TIKA AND TESSERACT TO OCR A PDF
|
---|
99 | -------------------------------------------------
|
---|
100 | Tesseract does not OCR PDFs (https://github.com/tesseract-ocr/tesseract/issues/1476).
|
---|
101 | Trying to do so, you'll see:
|
---|
102 | tesseract pdf05-notext.pdf notext
|
---|
103 | Tesseract Open Source OCR Engine v5.0.0-alpha-694-g6ee3 with Leptonica
|
---|
104 | Error in pixReadStream: Pdf reading is not supported
|
---|
105 | Error in pixRead: pix not read
|
---|
106 | Error during processing.
|
---|
107 |
|
---|
108 | Tesseract can OCR the individual images constituting a page of the PDF, but to OCR PDFs
|
---|
109 | with Tesseract, you need an additional tool to split PDFs into its pages and extract images
|
---|
110 | from them, feed each page's image into Tesseract to get it OCR-ed and then create an html or
|
---|
111 | txt file collating all the individual OCR-ed page content.
|
---|
112 |
|
---|
113 | Tika does this.
|
---|
114 |
|
---|
115 | By default if Tika is on the environment and TESSDATA_PREFIX is set to the tessdata folder
|
---|
116 | containing the language files, Tika is able to get Tesseract to OCR images out of the box.
|
---|
117 | Yet not PDFs. Tika will output empty OCR for (x)html/txt, extracting only extractable text
|
---|
118 | from PDFs and no OCR until the following is correct.
|
---|
119 |
|
---|
120 | To get Tika (app v1.24.x) and Tesseract (v5.0.0) set up to OCR PDFs, needed to do 2 more
|
---|
121 | things:
|
---|
122 | 1. Have to run the tika-app-*.jar with in --config=/path/to/a/tika-config.xml file
|
---|
123 | configured correctly for the TesseractOCRParser and PDFParser
|
---|
124 | 2. The <tika-config.xml> file passed to tika-app-*.jar should configure the "outputType"
|
---|
125 | param's configuration of the TesseractOCRParser as follows:
|
---|
126 | a. Set the "outputType" param to "txt" so Tesseract produces the OCR in .txt format, and
|
---|
127 | Tesseract will produce .txt as OCR output which Tika will intercept and process,
|
---|
128 | b. OR if you want the OCR generated by Tesseract to be in hocr (html ocr) format, then set
|
---|
129 | the "outputType" param value to "hocr" AND ensure a config file also called hocr exists in
|
---|
130 | $TESSDATA_PREFIX/configs containing the following (taken from
|
---|
131 | https://github.com/tesseract-ocr/tessconfigs/blob/3decf1c8252ba6dbeef0bf908f4b0aab7f18d113/configs/hocr):
|
---|
132 | tessedit_create_hocr 1
|
---|
133 | hocr_font_info 0
|
---|
134 |
|
---|
135 |
|
---|
136 |
|
---|
137 | In order to have a tessdata/configs/hocr, needed to correct the Tesseract we were
|
---|
138 | cascade-making to get it to put the "configs" subfolder inside the installed Tessereact's
|
---|
139 | tessdata folder. The source version of tesseract has this folder, but it wasn't getting
|
---|
140 | included in the built version despite us exporting TESSDATA_PREFIX before CASCADE-MAKing.
|
---|
141 |
|
---|