1 | Aim: tutorial on using UnknownConverterPlugin + Tika (default apache tika-app jar) + Tesseract
|
---|
2 | to get users to OCR their PDFs.
|
---|
3 |
|
---|
4 | Tika already works with UnknownConverterPlugin.
|
---|
5 | But need OCR-ing abilities.
|
---|
6 | Tika is supposed to work well with Tesseract (OCR).
|
---|
7 | So wanted to set up Tesseract.
|
---|
8 |
|
---|
9 | I tried to compile things up locally, but ended up needing
|
---|
10 | libz, libpng, libjpg, libtif which imagemagick already has (and libgif too actually)
|
---|
11 | So I ended up setting up Tesseract with Dr Bainbridge's Cascade-Make way of doing things,
|
---|
12 | since that would ultimately need to happen if my attempts with Tesseract + Tika are
|
---|
13 | successful anyway. With Cascade-Make I was successful in getting a working tesseract
|
---|
14 | installed at last.
|
---|
15 |
|
---|
16 | --------------------------------------------------------------------------------------------------
|
---|
17 | LINKS: BACKGROUND READING ON TIKA WITH OCR USING TESSERACT, COMPILING TESSERACT ON LINUX, ETC
|
---|
18 | --------------------------------------------------------------------------------------------------
|
---|
19 | https://www.linux.com/news/googles-tesseract-ocr-engine-quantum-leap-forward/
|
---|
20 | Google's Tesseract OCR engine is a quantum leap forward
|
---|
21 | September 28, 2006
|
---|
22 |
|
---|
23 | https://sourceforge.net/projects/tesseract-ocr/
|
---|
24 | A commercial quality OCR engine originally developed at HP between 1985 and 1995. In 1995, this engine was among the top 3 evaluated by UNLV. It was open-sourced by HP and UNLV in 2005. (NOTE: We're migrating to code.google.com. Please see the forums.)
|
---|
25 |
|
---|
26 | https://github.com/tesseract-ocr/tesseract/wiki/Downloads
|
---|
27 | https://github.com/tesseract-ocr/tessdoc
|
---|
28 | https://tesseract-ocr.github.io/tessdoc/Downloads
|
---|
29 | https://github.com/tesseract-ocr/tesseract/wiki#running-tesseract
|
---|
30 |
|
---|
31 | https://github.com/tesseract-ocr/tesseract/releases/tag/3.02.02 (source code tarball)
|
---|
32 |
|
---|
33 | https://stackoverflow.com/questions/29603749/how-to-integrate-tesseract-ocr-with-tika
|
---|
34 | https://cwiki.apache.org/confluence/display/TIKA/TikaOCR
|
---|
35 | Windows: https://github.com/UB-Mannheim/tesseract/wiki
|
---|
36 |
|
---|
37 | https://issues.apache.org/jira/browse/TIKA-3035
|
---|
38 | indicates that tika-app versions 1.23 and 1.24 are indeed the latest: as discussion has comments from 2020
|
---|
39 | indicates that tesseract will work with tika-app too, not just tika-server?
|
---|
40 |
|
---|
41 | https://www.howtoforge.com/tutorial/tesseract-ocr-installation-and-usage-on-ubuntu-16-04/
|
---|
42 | https://www.linux.com/training-tutorials/using-tesseract-ubuntu/
|
---|
43 | (Compiling tesseract on Ubuntu too)
|
---|
44 | https://asahinow.blogspot.com/2019/04/how-to-compile-tesseract-40-in-ubuntu.html
|
---|
45 | (easier looking instructions for compiling tesseract on Ubuntu, although they do it in a system location)
|
---|
46 |
|
---|
47 | --------------------------------------------
|
---|
48 | COMPILING FROM SOURCE
|
---|
49 | --------------------------------------------
|
---|
50 | To compile tesseract from source,
|
---|
51 | I'm attempting to follow the instructions at https://asahinow.blogspot.com/2019/04/how-to-compile-tesseract-40-in-ubuntu.html
|
---|
52 |
|
---|
53 | 1. cd /Scratch/ak19/sources
|
---|
54 | wget http://www.leptonica.org/source/leptonica-1.79.0.tar.gz
|
---|
55 | tar -xvzf leptonica-1.79.0.tar.gz
|
---|
56 | mkdir /Scratch/ak19/packages/leptonica
|
---|
57 | ./configure --help
|
---|
58 | Scratch/ak19/sources/leptonica-1.79.0>./configure --prefix=/Scratch/ak19/packages/leptonica --exec-prefix=/Scratch/ak19/packages/leptonica/
|
---|
59 | make && make install
|
---|
60 |
|
---|
61 | 2. When running autogen in tesseract, found I needed libtool/glibtool for approx error message described in https://stackoverflow.com/questions/14841946/trouble-when-running-autogen-sh
|
---|
62 |
|
---|
63 | http://www.gnu.org/software/libtool/
|
---|
64 | Xgit clone git://git.savannah.gnu.org/libtool.git
|
---|
65 | wget http://ftpmirror.gnu.org/libtool/libtool-2.4.6.tar.gz
|
---|
66 | tar -xvzf libtool-2.4.6.tar.gz
|
---|
67 | cd libtool-2.4.6
|
---|
68 | ./configure --prefix=/Scratch/ak19/packages/libtool
|
---|
69 | make
|
---|
70 | make install
|
---|
71 |
|
---|
72 | 2. cd /Scratch/ak19/sources
|
---|
73 | git clone https://github.com/tesseract-ocr/tesseract.git
|
---|
74 | cd tesseract
|
---|
75 | export PATH=/Scratch/ak19/packages/libtool/bin:$PATH
|
---|
76 | # when I ran sh autogen.sh
|
---|
77 | # saw this error: https://stackoverflow.com/questions/18978252/error-libtool-library-used-but-libtool-is-undefined
|
---|
78 | # followed solution there
|
---|
79 | libtoolize
|
---|
80 | aclocal
|
---|
81 | autoheader
|
---|
82 | sh autogen.sh
|
---|
83 |
|
---|
84 | cd /Scratch/ak19/packages
|
---|
85 | mkdir tesseract
|
---|
86 | mkdir -p tesseract/lib
|
---|
87 | mkdir -p tesseract/include
|
---|
88 |
|
---|
89 | cd /Scratch/ak19/sources/tesseract
|
---|
90 | ./configure --help | less
|
---|
91 | # need leptonica on PATH
|
---|
92 | export PATH=/Scratch/ak19/packages/leptonica/bin:$PATH
|
---|
93 |
|
---|
94 | # Configure at this stage will fail with the errors described in https://github.com/DanBloomberg/leptonica/issues/410
|
---|
95 | export PKG_CONFIG_PATH=/Scratch/ak19/packages/leptonica/lib/pkgconfig
|
---|
96 |
|
---|
97 | ./configure --prefix=/Scratch/ak19/packages/tesseract
|
---|
98 | XXXXXXXXXXXX LDFLAGS="-L/Scratch/ak19/packages/tesseract/lib" CFLAGS="-I/Scratch/ak19/packages/tesseract/include" make
|
---|
99 | LDFLAGS=<leptonica stuff!> CFLAGS=<leptonica stuff!> make
|
---|
100 | make install
|
---|
101 |
|
---|
102 | (The above looked like it compiled successfully, but it failed to OCR sample.tif.)
|
---|
103 |
|
---|
104 | ---------------------------------------------------------------
|
---|
105 | TRYING TO RUN MY (POORLY) COMPILED TESSERACT INSTALLATION
|
---|
106 | ---------------------------------------------------------------
|
---|
107 | Language files for tesseract
|
---|
108 | https://stackoverflow.com/questions/14800730/tesseract-running-error
|
---|
109 |
|
---|
110 |
|
---|
111 | You can grab eng.traineddata Github:
|
---|
112 |
|
---|
113 | wget https://github.com/tesseract-ocr/tessdata/raw/master/eng.traineddata
|
---|
114 |
|
---|
115 | Check https://github.com/tesseract-ocr/tessdata for a full list of trained language data.
|
---|
116 |
|
---|
117 | When you grab the file(s), move them to the /usr/local/share/tessdata folder. Warning: some Linux distributions (such as openSUSE and Ubuntu) may be expecting it in /usr/share/tessdata instead.
|
---|
118 |
|
---|
119 | # If you got the data from Google, unzip it first!
|
---|
120 | gunzip eng.traineddata.gz
|
---|
121 | # Move the data
|
---|
122 | sudo mv -v eng.traineddata /usr/local/share/tessdata/
|
---|
123 |
|
---|
124 | (1) cd /Scratch/ak19/packages/tesseract
|
---|
125 | mkdir tessdata
|
---|
126 | cd tessdata
|
---|
127 |
|
---|
128 | (2) Install all the language files you want from https://github.com/tesseract-ocr/tessdata (via https://github.com/tesseract-ocr/)
|
---|
129 | (If you installed tesseract with a package manager, then you're advised to install language packs via package manager too.
|
---|
130 | How to do this is explained at https://stackoverflow.com/questions/14800730/tesseract-running-error.)
|
---|
131 | Since we installed tesseract from source, can install language files from source too:
|
---|
132 |
|
---|
133 | wget https://github.com/tesseract-ocr/tessdata/raw/master/eng.traineddata
|
---|
134 |
|
---|
135 |
|
---|
136 | (3) When done, export the env pointing to the language files for tesseract to find:
|
---|
137 | export TESSDATA_PREFIX='/Scratch/ak19/packages/tesseract/tessdata'
|
---|
138 |
|
---|
139 | (4) Put tesseract on the environment to run the OCR:
|
---|
140 | export PATH=/Scratch/ak19/packages/tesseract/bin:$PATH
|
---|
141 |
|
---|
142 | (5) Test the languages now available
|
---|
143 | > tesseract --list-langs
|
---|
144 |
|
---|
145 | Error in pixReadMemTiff: function not present
|
---|
146 | Error in pixReadMem: tiff: no pix returned
|
---|
147 | Error in pixaGenerateFontFromString: pix not made
|
---|
148 | Error in bmfCreate: font pixa not made
|
---|
149 | List of available languages (1):
|
---|
150 | eng
|
---|
151 |
|
---|
152 | (At least English is installed now)
|
---|
153 |
|
---|
154 | (6) The above errors are described in https://stackoverflow.com/questions/33659458/tesseract-image-issue
|
---|
155 |
|
---|
156 | Step 1: Install libjpeg, libtiff, libpng. Step 2: Recompile and install the leptonica. more links
|
---|
157 | share improve this answer
|
---|
158 | answered Nov 15 '15 at 3:41
|
---|
159 | BigBen
|
---|
160 | 8111 bronze badge
|
---|
161 | add a comment
|
---|
162 | 2
|
---|
163 |
|
---|
164 | Default image format for firstly tesseract version was .tif or .tiff. in new version you should install following format package (libgif libjpeg libpng libtiff zlib). Leptonica use this pakages for read images and tesseract use leptonica for analyse images.
|
---|
165 |
|
---|
166 | libgif libjpeg libpng libtiff zlib
|
---|
167 |
|
---|
168 | finally recompile and install leptonica as @BigBen answer.
|
---|
169 |
|
---|
170 | We have all but libgif in imagemagick:
|
---|
171 | /Scratch/ak19/GS3bin_04June2020/gs2build/bin/linux/imagemagick>
|
---|
172 |
|
---|
173 | export MAGICK_HOME=/Scratch/ak19/GS3bin_04June2020/gs2build/bin/linux/imagemagick
|
---|
174 |
|
---|
175 | RECOMPILE leptonica:
|
---|
176 | rm -rf /Scratch/ak19/packages/leptonica/
|
---|
177 | cd /Scratch/ak19/sources/leptonica-1.79.0
|
---|
178 | ./configure --prefix=/Scratch/ak19/packages/leptonica
|
---|
179 | # DO I NEED CFLAGS, but I have no $MAGICK_HOME/include folder, so would have to recompile imagemagick first...
|
---|
180 | LDFLAGS="-L/$MAGICK_HOME/lib" make
|
---|
181 | make install
|
---|
182 |
|
---|
183 | (7) Run on sample tiff file (containing a line or 2 of text) obtained from https://alternatiff.com/testpage.html
|
---|
184 | Command from https://cwiki.apache.org/confluence/display/TIKA/TikaOCR
|
---|
185 | export TESSDATA_PREFIX='/Scratch/ak19/packages/tesseract/tessdata' \
|
---|
186 | && export PATH=/Scratch/ak19/packages/tesseract/bin:$PATH
|
---|
187 | tesseract -psm 3 /Scratch/ak19/sample.tif out.txt > bla.txt 2>&1
|
---|
188 |
|
---|
189 | (My terminal is destroyed by some garbled encoding/charset scheme)
|
---|
190 |
|
---|
191 | cat out.txt
|
---|
192 |
|
---|
193 | ------------------------------------------------------------------------
|
---|
194 | WENT THE CASCADE-MAKE ROUTE TO COMPILE TESSERACT INSTEAD
|
---|
195 | ------------------------------------------------------------------------
|
---|
196 | After lots of hard work, I've now got CASCADE-MAKE working to compile up
|
---|
197 | tesseract and its dependencies. Once compiled up and installed, and
|
---|
198 | before committing my cascade-make stuff for tesseract, I needed to do the
|
---|
199 | following to test tesseract actually worked and could OCR sample.gif at last.
|
---|
200 |
|
---|
201 |
|
---|
202 |
|
---|
203 | cd <any GS3 installation>
|
---|
204 | source ./gs3-setup.sh (to get GSDLOS set)
|
---|
205 | (Now cd into the tesseract/linux folder containing bin, lib, include, tessdata etc)
|
---|
206 | source ./setup.bash
|
---|
207 | (This will set $GEXTTESS_INSTALLED to point to tesseract/linux folder)
|
---|
208 | > tesseract --list-langs
|
---|
209 | > tesseract /Scratch/ak19/sample.tif out
|
---|
210 | (generates out.TXT containing the OCR-ed content)
|
---|
211 | then:
|
---|
212 | cat out.txt
|
---|