Context Navigation

README.txt@ 34199

Last change on this file since 34199 was 34199, checked in by ak19, 4 years ago

A makedists.sh script for gstika to make the cutdown zip and tarball. Updated the one for tesseract. I'm not sure what the CASCADE-MAKE makedist includes (yet), and it may be different for gnome-lib. But I have so far followed the manual steps for Imagemagick in creating the cutdown binary-only distribution tarballs and carefully controlling what goes in there (which is different for each of these gs extensions).

File size: 5.8 KB

Line
1	TODO:
2	Since zlib, libpng, tif, jpg, jpeg2000 are all from imagemagick, may be use svn:externals
3	to bring them into packages?
4	svn:externals on individual files is possible, see
5	https://stackoverflow.com/questions/1355956/can-we-set-a-single-file-as-external-in-subversion
6
7	-------------------------------------------------
8	CONTENTS
9	-------------------------------------------------
10	In this file:
11
12	A. COMPILING TESSERACT GS2-EXTENSION
13	& CREATING THE CUT-DOWN BINARY-ONLY TARBALL
14
15	B. GETTING TIKA AND TESSERACT TO OCR A PDF
16
17
18	-------------------------------------------------
19	A. COMPILING TESSERACT GS2-EXTENSION
20	& CREATING THE CUT-DOWN BINARY-ONLY TARBALL
21	-------------------------------------------------
22
23	To compile the Tesseract gs2-extension and then create the "binary" tarball needed to run
24	Tesseract, we follow an equivalent version of the instructions for the imagemagick gs2-extension
25	at http://trac.greenstone.org/browser/gs2-extensions/imagemagick/trunk/README
26
27
28	1. Find a location on your machine
29
30
31	2. Check out the tesseract extension from gs2-extensions
32	svn co http://trac.greenstone.org/browser/gs2-extensions/tesseract/trunk tesseract
33
34
35	3. Compile it all up (tesseract and dependencies):
36	cd tesseract
37	./CASCADE-MAKE.sh
38
39
40	4. Open a fresh terminal and check that the tesseract now installed in src/linux/bin works:
41
42	cd src
43	source ./setup.bash
44
45	This should have set up env vars like GEXTTESS, GEXTTESS_INSTALLED, and TESSDATA_PREFIX
46	which Tesseract needs to have set
47
48	tesseract --list-langs
49	tesseract sample.tif out
50
51	OCRs sample.tif and generates out.txt from it.
52
53	cat out.txt
54
55	If you run Tesseract with the hocr config file, you can get the OCR output in
56	nicely formatted html more representative of the input structure:
57
58	tesseract sample.tif hocrtest
59
60	The OCR output in html format will be in hocrtest.hocr:
61
62	cat hocrtest.hocr
63
64
65	5. If successful, create the cut down tesseract binary zip and tarball by running the following
66	at the toplevel of the extension checkout:
67
68	./makedists.sh <linux-x64\|linux>
69
70
71	If manually creating the cut-down tesseract zip and tarball then:
72	a. create a folder at the same level as src called tesseract
73	cd src
74	cd ..
75	mkdir tesseract
76
77	b. COPY the setup files and MOVE the installed folder (src/linux) into the new cut-down tesseract folder:
78
79	cp src/setup.ba* tesseract/.
80	mv src/linux tesseract/.
81
82	c. COPY the TESSERACT-APACHE-LICENSE and LEPTONICA-LICENSE txt files (note it uses
83	American spelling!) from src/packages into the cut-down tesseract/linux:
84
85	cp src/packages/*LICENSE.txt tesseract/linux/.
86
87	d. Copy the top-level GETTING-OCR-SUPPORT-FOR-MORE-LANGS.txt file into the cutdown tesseract:
88	cp GETTING-OCR-SUPPORT-FOR-MORE-LANGS.txt tesseract/.
89
90	e. REMOVE folder "man" from tesseract/linux:
91	rm -rf tesseract/linux/man
92
93	f. REMOVE everything EXCEPT the "tessdata" folder in tesseract/linux/share.
94	(The other things in that location are either unnecessary or created by tesseract's dependencies).
95
96
97	6. Create a tarball of the cut down tesseract folder named tesseract-<os>-<arch>.tar.gz:
98	tar -cvzf tesseract-linux-x64.tar.gz tesseract
99
100
101	7. (Add/SVN up and) commit that to svn:
102	svn up
103	svn add tesseract-linux-x64.tar.gz
104	(or svn diff tesseract-linux-x64.tar.gz if there was an earlier version to confirm modified)
105	svn commit -m "MESSAGE" tesseract-linux-x64.tar.gz
106
107
108	-------------------------------------------------
109	B. GETTING TIKA AND TESSERACT TO OCR A PDF
110	-------------------------------------------------
111	Tesseract does not OCR PDFs (https://github.com/tesseract-ocr/tesseract/issues/1476).
112	Trying to do so, you'll see:
113	tesseract pdf05-notext.pdf notext
114	Tesseract Open Source OCR Engine v5.0.0-alpha-694-g6ee3 with Leptonica
115	Error in pixReadStream: Pdf reading is not supported
116	Error in pixRead: pix not read
117	Error during processing.
118
119	Tesseract can OCR the individual images constituting a page of the PDF, but to OCR PDFs
120	with Tesseract, you need an additional tool to split PDFs into its pages and extract images
121	from them, feed each page's image into Tesseract to get it OCR-ed and then create an html or
122	txt file collating all the individual OCR-ed page content.
123
124	Tika does this.
125
126	By default if Tika is on the environment and TESSDATA_PREFIX is set to the tessdata folder
127	containing the language files, Tika is able to get Tesseract to OCR images out of the box.
128	Yet not PDFs. Tika will output empty OCR for (x)html/txt, extracting only extractable text
129	from PDFs and no OCR until the following is correct.
130
131	To get Tika (app v1.24.x) and Tesseract (v5.0.0) set up to OCR PDFs, needed to do 2 more
132	things:
133	1. Have to run the tika-app-*.jar with in --config=/path/to/a/tika-config.xml file
134	configured correctly for the TesseractOCRParser and PDFParser
135	2. The <tika-config.xml> file passed to tika-app-*.jar should configure the "outputType"
136	param's configuration of the TesseractOCRParser as follows:
137	a. Set the "outputType" param to "txt" so Tesseract produces the OCR in .txt format, and
138	Tesseract will produce .txt as OCR output which Tika will intercept and process,
139	b. OR if you want the OCR generated by Tesseract to be in hocr (html ocr) format, then set
140	the "outputType" param value to "hocr" AND ensure a config file also called hocr exists in
141	$TESSDATA_PREFIX/configs containing the following (taken from
142	https://github.com/tesseract-ocr/tessconfigs/blob/3decf1c8252ba6dbeef0bf908f4b0aab7f18d113/configs/hocr):
143	tessedit_create_hocr 1
144	hocr_font_info 0
145
146
147
148	In order to have a tessdata/configs/hocr, needed to correct the Tesseract we were
149	cascade-making to get it to put the "configs" subfolder inside the installed Tessereact's
150	tessdata folder. The source version of tesseract has this folder, but it wasn't getting
151	included in the built version despite us exporting TESSDATA_PREFIX before CASCADE-MAKing.
152

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: gs2-extensions/tesseract/trunk/README.txt@ 34199

Download in other formats: