Context Navigation

README.txt@ 34204

Last change on this file since 34204 was 34204, checked in by ak19, 4 years ago
Another todo.
File size: 6.4 KB

Line
1	TODO:
2	- It seems that the linux/lib/*.a files for libz, libpng, tiff, jpeg, jasper don't need to be present in the binary cut down version for leptonica to work and for tesseract to use it for successfully OCR-ing images. Since leptonica.a/lept.a appears self-contained (because it was not generated as a shared library), can remove these self-contained dependency libraries for zlib png etc now before creating the tesseract tarball. (Saves 2.4 Mb)
3	- Also turn the CASCADE-MAKE/*.sh files shared with imagemagick (ZLIB, LIBPNG, TIFF, JPEG, JPEG2000) into svn:externals
4	+ DONE: Since zlib, libpng, tif, jpg, jpeg2000 are all from imagemagick, may be use svn:externals
5	to bring them into packages?
6	svn:externals on individual files is possible, see
7	https://stackoverflow.com/questions/1355956/can-we-set-a-single-file-as-external-in-subversion
8
9	-------------------------------------------------
10	CONTENTS
11	-------------------------------------------------
12	In this file:
13
14	A. COMPILING TESSERACT GS2-EXTENSION
15	& CREATING THE CUT-DOWN BINARY-ONLY TARBALL
16
17	B. GETTING TIKA AND TESSERACT TO OCR A PDF
18
19
20	-------------------------------------------------
21	A. COMPILING TESSERACT GS2-EXTENSION
22	& CREATING THE CUT-DOWN BINARY-ONLY TARBALL
23	-------------------------------------------------
24
25	To compile the Tesseract gs2-extension and then create the "binary" tarball needed to run
26	Tesseract, we follow an equivalent version of the instructions for the imagemagick gs2-extension
27	at http://trac.greenstone.org/browser/gs2-extensions/imagemagick/trunk/README
28
29
30	1. Find a location on your machine
31
32
33	2. Check out the tesseract extension from gs2-extensions
34	svn co http://trac.greenstone.org/browser/gs2-extensions/tesseract/trunk tesseract
35
36
37	3. Compile it all up (tesseract and dependencies):
38	cd tesseract
39	./CASCADE-MAKE.sh
40
41
42	4. Open a fresh terminal and check that the tesseract now installed in src/linux/bin works:
43
44	cd src
45	source ./setup.bash
46
47	This should have set up env vars like GEXTTESS, GEXTTESS_INSTALLED, and TESSDATA_PREFIX
48	which Tesseract needs to have set
49
50	tesseract --list-langs
51	tesseract sample.tif out
52
53	OCRs sample.tif and generates out.txt from it.
54
55	cat out.txt
56
57	If you run Tesseract with the hocr config file, you can get the OCR output in
58	nicely formatted html more representative of the input structure:
59
60	tesseract sample.tif hocrtest
61
62	The OCR output in html format will be in hocrtest.hocr:
63
64	cat hocrtest.hocr
65
66
67	5. If successful, create the cut down tesseract binary zip and tarball by running the following
68	at the toplevel of the extension checkout:
69
70	./makedists.sh <linux-x64\|linux>
71
72
73	If manually creating the cut-down tesseract zip and tarball then:
74	a. create a folder at the same level as src called tesseract
75	cd src
76	cd ..
77	mkdir tesseract
78
79	b. COPY the setup files and MOVE the installed folder (src/linux) into the new cut-down tesseract folder:
80
81	cp src/setup.ba* tesseract/.
82	mv src/linux tesseract/.
83
84	c. COPY the TESSERACT-APACHE-LICENSE and LEPTONICA-LICENSE txt files (note it uses
85	American spelling!) from src/packages into the cut-down tesseract/linux:
86
87	cp src/packages/*LICENSE.txt tesseract/linux/.
88
89	d. Copy the top-level GETTING-OCR-SUPPORT-FOR-MORE-LANGS.txt file into the cutdown tesseract:
90	cp GETTING-OCR-SUPPORT-FOR-MORE-LANGS.txt tesseract/.
91
92	e. REMOVE folder "man" from tesseract/linux:
93	rm -rf tesseract/linux/man
94
95	f. REMOVE everything EXCEPT the "tessdata" folder in tesseract/linux/share.
96	(The other things in that location are either unnecessary or created by tesseract's dependencies).
97
98
99	6. Create a tarball of the cut down tesseract folder named tesseract-<os>-<arch>.tar.gz:
100	tar -cvzf tesseract-linux-x64.tar.gz tesseract
101
102
103	7. (Add/SVN up and) commit that to svn:
104	svn up
105	svn add tesseract-linux-x64.tar.gz
106	(or svn diff tesseract-linux-x64.tar.gz if there was an earlier version to confirm modified)
107	svn commit -m "MESSAGE" tesseract-linux-x64.tar.gz
108
109
110	-------------------------------------------------
111	B. GETTING TIKA AND TESSERACT TO OCR A PDF
112	-------------------------------------------------
113	Tesseract does not OCR PDFs (https://github.com/tesseract-ocr/tesseract/issues/1476).
114	Trying to do so, you'll see:
115	tesseract pdf05-notext.pdf notext
116	Tesseract Open Source OCR Engine v5.0.0-alpha-694-g6ee3 with Leptonica
117	Error in pixReadStream: Pdf reading is not supported
118	Error in pixRead: pix not read
119	Error during processing.
120
121	Tesseract can OCR the individual images constituting a page of the PDF, but to OCR PDFs
122	with Tesseract, you need an additional tool to split PDFs into its pages and extract images
123	from them, feed each page's image into Tesseract to get it OCR-ed and then create an html or
124	txt file collating all the individual OCR-ed page content.
125
126	Tika does this.
127
128	By default if Tika is on the environment and TESSDATA_PREFIX is set to the tessdata folder
129	containing the language files, Tika is able to get Tesseract to OCR images out of the box.
130	Yet not PDFs. Tika will output empty OCR for (x)html/txt, extracting only extractable text
131	from PDFs and no OCR until the following is correct.
132
133	To get Tika (app v1.24.x) and Tesseract (v5.0.0) set up to OCR PDFs, needed to do 2 more
134	things:
135	1. Have to run the tika-app-*.jar with in --config=/path/to/a/tika-config.xml file
136	configured correctly for the TesseractOCRParser and PDFParser
137	2. The <tika-config.xml> file passed to tika-app-*.jar should configure the "outputType"
138	param's configuration of the TesseractOCRParser as follows:
139	a. Set the "outputType" param to "txt" so Tesseract produces the OCR in .txt format, and
140	Tesseract will produce .txt as OCR output which Tika will intercept and process,
141	b. OR if you want the OCR generated by Tesseract to be in hocr (html ocr) format, then set
142	the "outputType" param value to "hocr" AND ensure a config file also called hocr exists in
143	$TESSDATA_PREFIX/configs containing the following (taken from
144	https://github.com/tesseract-ocr/tessconfigs/blob/3decf1c8252ba6dbeef0bf908f4b0aab7f18d113/configs/hocr):
145	tessedit_create_hocr 1
146	hocr_font_info 0
147
148
149
150	In order to have a tessdata/configs/hocr, needed to correct the Tesseract we were
151	cascade-making to get it to put the "configs" subfolder inside the installed Tessereact's
152	tessdata folder. The source version of tesseract has this folder, but it wasn't getting
153	included in the built version despite us exporting TESSDATA_PREFIX before CASCADE-MAKing.
154

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: gs2-extensions/tesseract/trunk/README.txt@ 34204

Download in other formats: