source: gs2-extensions/tesseract/trunk/src/LinksAndNotesOnCompilingManually.txt@ 34180

Last change on this file since 34180 was 34180, checked in by ak19, 4 years ago

Gnome-lib has setup.bash_old and setup.bat_old, but imagemagick and pdf-box (and now gstika) have setup.bash and setup.bat. I think this is because gnome-lib is needed for compiling Greenstone but doesn't get loaded as an regular extension when running GS3 or building collections, whereas the other gs2-extensions do (so GEXT is set for pdfbox and now gstika etc. Previously followed the gnome-lib pattern for tesseract gs2-extension, and source gs3-setup didn't detect it until tesseracts setup.bash_old was renamed to setup.bash. So I think renaming that and setup.bat_old on svn is the way forward. I think.

File size: 9.4 KB
Line 
1Aim: tutorial on using UnknownConverterPlugin + Tika (default apache tika-app jar) + Tesseract
2to get users to OCR their PDFs.
3
4Tika already works with UnknownConverterPlugin.
5But need OCR-ing abilities.
6Tika is supposed to work well with Tesseract (OCR).
7So wanted to set up Tesseract.
8
9I tried to compile things up locally, but ended up needing
10libz, libpng, libjpg, libtif which imagemagick already has (and libgif too actually)
11So I ended up setting up Tesseract with Dr Bainbridge's Cascade-Make way of doing things,
12since that would ultimately need to happen if my attempts with Tesseract + Tika are
13successful anyway. With Cascade-Make I was successful in getting a working tesseract
14installed at last.
15
16--------------------------------------------------------------------------------------------------
17LINKS: BACKGROUND READING ON TIKA WITH OCR USING TESSERACT, COMPILING TESSERACT ON LINUX, ETC
18--------------------------------------------------------------------------------------------------
19https://www.linux.com/news/googles-tesseract-ocr-engine-quantum-leap-forward/
20Google's Tesseract OCR engine is a quantum leap forward
21September 28, 2006
22
23https://sourceforge.net/projects/tesseract-ocr/
24A commercial quality OCR engine originally developed at HP between 1985 and 1995. In 1995, this engine was among the top 3 evaluated by UNLV. It was open-sourced by HP and UNLV in 2005. (NOTE: We're migrating to code.google.com. Please see the forums.)
25
26https://github.com/tesseract-ocr/tesseract/wiki/Downloads
27 https://github.com/tesseract-ocr/tessdoc
28 https://tesseract-ocr.github.io/tessdoc/Downloads
29 https://github.com/tesseract-ocr/tesseract/wiki#running-tesseract
30
31https://github.com/tesseract-ocr/tesseract/releases/tag/3.02.02 (source code tarball)
32
33https://stackoverflow.com/questions/29603749/how-to-integrate-tesseract-ocr-with-tika
34https://cwiki.apache.org/confluence/display/TIKA/TikaOCR
35 Windows: https://github.com/UB-Mannheim/tesseract/wiki
36
37https://issues.apache.org/jira/browse/TIKA-3035
38 indicates that tika-app versions 1.23 and 1.24 are indeed the latest: as discussion has comments from 2020
39 indicates that tesseract will work with tika-app too, not just tika-server?
40
41https://www.howtoforge.com/tutorial/tesseract-ocr-installation-and-usage-on-ubuntu-16-04/
42https://www.linux.com/training-tutorials/using-tesseract-ubuntu/
43(Compiling tesseract on Ubuntu too)
44https://asahinow.blogspot.com/2019/04/how-to-compile-tesseract-40-in-ubuntu.html
45(easier looking instructions for compiling tesseract on Ubuntu, although they do it in a system location)
46
47--------------------------------------------
48COMPILING FROM SOURCE
49--------------------------------------------
50To compile tesseract from source,
51I'm attempting to follow the instructions at https://asahinow.blogspot.com/2019/04/how-to-compile-tesseract-40-in-ubuntu.html
52
531. cd /Scratch/ak19/sources
54 wget http://www.leptonica.org/source/leptonica-1.79.0.tar.gz
55 tar -xvzf leptonica-1.79.0.tar.gz
56 mkdir /Scratch/ak19/packages/leptonica
57 ./configure --help
58 Scratch/ak19/sources/leptonica-1.79.0>./configure --prefix=/Scratch/ak19/packages/leptonica --exec-prefix=/Scratch/ak19/packages/leptonica/
59 make && make install
60
612. When running autogen in tesseract, found I needed libtool/glibtool for approx error message described in https://stackoverflow.com/questions/14841946/trouble-when-running-autogen-sh
62
63http://www.gnu.org/software/libtool/
64 Xgit clone git://git.savannah.gnu.org/libtool.git
65 wget http://ftpmirror.gnu.org/libtool/libtool-2.4.6.tar.gz
66 tar -xvzf libtool-2.4.6.tar.gz
67 cd libtool-2.4.6
68 ./configure --prefix=/Scratch/ak19/packages/libtool
69 make
70 make install
71
722. cd /Scratch/ak19/sources
73 git clone https://github.com/tesseract-ocr/tesseract.git
74 cd tesseract
75 export PATH=/Scratch/ak19/packages/libtool/bin:$PATH
76 # when I ran sh autogen.sh
77 # saw this error: https://stackoverflow.com/questions/18978252/error-libtool-library-used-but-libtool-is-undefined
78 # followed solution there
79 libtoolize
80 aclocal
81 autoheader
82 sh autogen.sh
83
84 cd /Scratch/ak19/packages
85 mkdir tesseract
86 mkdir -p tesseract/lib
87 mkdir -p tesseract/include
88
89 cd /Scratch/ak19/sources/tesseract
90 ./configure --help | less
91 # need leptonica on PATH
92 export PATH=/Scratch/ak19/packages/leptonica/bin:$PATH
93
94 # Configure at this stage will fail with the errors described in https://github.com/DanBloomberg/leptonica/issues/410
95 export PKG_CONFIG_PATH=/Scratch/ak19/packages/leptonica/lib/pkgconfig
96
97 ./configure --prefix=/Scratch/ak19/packages/tesseract
98 XXXXXXXXXXXX LDFLAGS="-L/Scratch/ak19/packages/tesseract/lib" CFLAGS="-I/Scratch/ak19/packages/tesseract/include" make
99 LDFLAGS=<leptonica stuff!> CFLAGS=<leptonica stuff!> make
100 make install
101
102(The above looked like it compiled successfully, but it failed to OCR sample.tif.)
103
104---------------------------------------------------------------
105TRYING TO RUN MY (POORLY) COMPILED TESSERACT INSTALLATION
106---------------------------------------------------------------
107Language files for tesseract
108https://stackoverflow.com/questions/14800730/tesseract-running-error
109
110
111 You can grab eng.traineddata Github:
112
113 wget https://github.com/tesseract-ocr/tessdata/raw/master/eng.traineddata
114
115 Check https://github.com/tesseract-ocr/tessdata for a full list of trained language data.
116
117 When you grab the file(s), move them to the /usr/local/share/tessdata folder. Warning: some Linux distributions (such as openSUSE and Ubuntu) may be expecting it in /usr/share/tessdata instead.
118
119 # If you got the data from Google, unzip it first!
120 gunzip eng.traineddata.gz
121 # Move the data
122 sudo mv -v eng.traineddata /usr/local/share/tessdata/
123
124(1) cd /Scratch/ak19/packages/tesseract
125 mkdir tessdata
126 cd tessdata
127
128(2) Install all the language files you want from https://github.com/tesseract-ocr/tessdata (via https://github.com/tesseract-ocr/)
129(If you installed tesseract with a package manager, then you're advised to install language packs via package manager too.
130How to do this is explained at https://stackoverflow.com/questions/14800730/tesseract-running-error.)
131Since we installed tesseract from source, can install language files from source too:
132
133 wget https://github.com/tesseract-ocr/tessdata/raw/master/eng.traineddata
134
135
136(3) When done, export the env pointing to the language files for tesseract to find:
137 export TESSDATA_PREFIX='/Scratch/ak19/packages/tesseract/tessdata'
138
139(4) Put tesseract on the environment to run the OCR:
140 export PATH=/Scratch/ak19/packages/tesseract/bin:$PATH
141
142(5) Test the languages now available
143 > tesseract --list-langs
144
145 Error in pixReadMemTiff: function not present
146 Error in pixReadMem: tiff: no pix returned
147 Error in pixaGenerateFontFromString: pix not made
148 Error in bmfCreate: font pixa not made
149 List of available languages (1):
150 eng
151
152(At least English is installed now)
153
154(6) The above errors are described in https://stackoverflow.com/questions/33659458/tesseract-image-issue
155
156 Step 1: Install libjpeg, libtiff, libpng. Step 2: Recompile and install the leptonica. more links
157 share improve this answer
158 answered Nov 15 '15 at 3:41
159 BigBen
160 8111 bronze badge
161 add a comment
162 2
163
164 Default image format for firstly tesseract version was .tif or .tiff. in new version you should install following format package (libgif libjpeg libpng libtiff zlib). Leptonica use this pakages for read images and tesseract use leptonica for analyse images.
165
166 libgif libjpeg libpng libtiff zlib
167
168 finally recompile and install leptonica as @BigBen answer.
169
170We have all but libgif in imagemagick:
171 /Scratch/ak19/GS3bin_04June2020/gs2build/bin/linux/imagemagick>
172
173 export MAGICK_HOME=/Scratch/ak19/GS3bin_04June2020/gs2build/bin/linux/imagemagick
174
175 RECOMPILE leptonica:
176 rm -rf /Scratch/ak19/packages/leptonica/
177 cd /Scratch/ak19/sources/leptonica-1.79.0
178 ./configure --prefix=/Scratch/ak19/packages/leptonica
179 # DO I NEED CFLAGS, but I have no $MAGICK_HOME/include folder, so would have to recompile imagemagick first...
180 LDFLAGS="-L/$MAGICK_HOME/lib" make
181 make install
182
183(7) Run on sample tiff file (containing a line or 2 of text) obtained from https://alternatiff.com/testpage.html
184Command from https://cwiki.apache.org/confluence/display/TIKA/TikaOCR
185 export TESSDATA_PREFIX='/Scratch/ak19/packages/tesseract/tessdata' \
186 && export PATH=/Scratch/ak19/packages/tesseract/bin:$PATH
187 tesseract -psm 3 /Scratch/ak19/sample.tif out.txt > bla.txt 2>&1
188
189 (My terminal is destroyed by some garbled encoding/charset scheme)
190
191 cat out.txt
192
193------------------------------------------------------------------------
194WENT THE CASCADE-MAKE ROUTE TO COMPILE TESSERACT INSTEAD
195------------------------------------------------------------------------
196After lots of hard work, I've now got CASCADE-MAKE working to compile up
197tesseract and its dependencies. Once compiled up and installed, and
198before committing my cascade-make stuff for tesseract, I needed to do the
199following to test tesseract actually worked and could OCR sample.gif at last.
200
201
202
203 cd <any GS3 installation>
204 source ./gs3-setup.sh (to get GSDLOS set)
205 (Now cd into the tesseract/linux folder containing bin, lib, include, tessdata etc)
206 source ./setup.bash
207 (This will set $GEXTTESS_INSTALLED to point to tesseract/linux folder)
208 > tesseract --list-langs
209 > tesseract /Scratch/ak19/sample.tif out
210 (generates out.TXT containing the OCR-ed content)
211 then:
212 cat out.txt
Note: See TracBrowser for help on using the repository browser.