Context Navigation

TESSERACT.sh@ 34190

Last change on this file since 34190 was 34190, checked in by ak19, 4 years ago
The tessdata folder was being created when compiling tesseract, and needn't be created and populated manually (except for the lang files), so there's less work for CASCADE-MAKE/TESSERACT.sh to do. However, the tessdata folder was being created in the linux/share folder. 'share' is probably a place where people expect tesseract's tessdata to be by default, so am updating the setup scripts to work with that, as I've donw with CASCADE-MAKE/TESSERACT.sh. 2. Adding useful instructions for users on getting more OCR language scripts' support in new file GETTING-OCR-SUPPORT-FOR-MORE-LANGS.txt, now included in the tesseract binary tarball too. Adjusted the README for us. 3. Removing the sample.jpg, converted from sample.tif which I'd downloaded from online and for which I don't know the copyright to. Replacing with sample.tif, a 96 DPI TIF file at 1870x2420 resolution produced from the first page of pdf05-notext.pdf by www.sejda.com/pdf-to-jpg. Moreover, this sample file contains lots of text, in 2 columns, not just 4 words like the original sample file. Good for testing a tesseract built from CASCADE-MAKE on. Also including the pdf05-notext-ocr-with-tikaTesseract.pdf istelf from the tutorial sample files, but only Tika with Tesseract can work on PDFs and not Tesseract by itself, indicated in the filename.
Property svn:executable set to ``*
File size: 2.7 KB

Line
1	#!/bin/bash
2
3	package=tesseract
4	version=-5.0.0
5
6	progname=$0
7
8	source ../cascade-make/lib/cascade-lib.bash GEXTTESS ../.. $*
9
10	prefix=$GEXTTESS_INSTALLED
11
12	# See imagemagick ext
13	if [ "x$CROSSCONFIGURE_ARGS" != "x" ] ; then
14	echo "WARNING: Crossconfiguring not supported yet"
15	fi
16
17	export CFLAGS="$CFLAGS -I$GEXTTESS_INSTALLED/include"
18	export CPPFLAGS="$CPPFLAGS -I$GEXTTESS_INSTALLED/include"
19	export CXXFLAGS="$CXXFLAGS -I$GEXTTESS_INSTALLED/include"
20	export LDFLAGS="$LDFLAGS -L$GEXTTESS_INSTALLED/lib"
21	export LD_LIBRARY_PATH="$GEXTTESS_INSTALLED/lib"
22	# Need PKG_CONFIG_PATH set tp leptonica's lib/pkgconfig folder (containing lept.pc file)
23	export PKG_CONFIG_PATH=$GEXTTESS_INSTALLED/lib/pkgconfig
24
25	opt_run_untar $force_untar $auto_untar $package $version
26
27	# Need to do this for TESSERACT, before we can do configure->make->make install
28	pushd $package$version;
29	libtoolize
30	#aclocal
31	#autoheader
32	sh autogen.sh
33	popd
34
35	opt_run_configure $force_config $auto_config $package $version $prefix \
36	--disable-shared --enable-static
37
38	opt_run_make $compile $package $version
39	opt_run_make $install $package $version "install"
40	opt_run_make $clean $package $version "clean"
41	opt_run_make $distclean $package $version "distclean"
42
43	opt_run_tarclean $tarclean $package $version
44
45
46	echo "Installing basic tesseract languages support (tessdata)"
47	# Untar OCR language support tarball one level above TESSDATA_PREFIX ($GEXTTESS_INSTALLED/shared),
48	# Then go into that folder to finish setting up language files.
49	cp $GEXTTESS_DEVEL/packages/tessdata-langs.tar.gz $TESSDATA_PREFIX/../.
50	pushd $TESSDATA_PREFIX/..
51	tar -xvzf tessdata-langs.tar.gz
52	# Above creates linux/shared/tessdata-langs folder - move files there into
53	# linux/shared/tessdata (i.e. TESSDATA_PREFIX) and delete both tarball and temporary
54	# tessdata-langs folder created at current location of one level up from TESSDATA_PREFIX
55	mv tessdata-langs/*.traineddata $TESSDATA_PREFIX/.
56	rm tessdata-langs.tar.gz
57	rm -rf tessdata-langs
58	popd
59
60
61	echo "Done installing basic tesseract languages for OCR (Optical Character Recognition, to recognise text from images)."
62	echo "Visit https://github.com/tesseract-ocr/tessdata for a full list of trained language data for OCR."
63	echo "To download OCR support for any specific language(s), note the 3 letter code of that language"
64	echo "Go into your $TESSDATA_PREFIX folder and for each language you want OCR abilities for, run: "
65	echo " wget https://github.com/tesseract-ocr/tessdata/raw/master/<3-letter-lang-code>.traineddata"
66	echo "To get all languages currently supported by Tesseract (beware, this may be a few Gigabytes), delete"
67	echo "$TESSDATA_PREFIX"
68	echo "and in $GEXTTES_INSTALLED/shared run:"
69	echo " git clone https://github.com/tesseract-ocr/tessdata"
70	echo ""

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: gs2-extensions/tesseract/trunk/src/packages/CASCADE-MAKE/TESSERACT.sh@ 34190

Download in other formats: