Context Navigation

README.txt@ 34190

Last change on this file since 34190 was 34190, checked in by ak19, 4 years ago

The tessdata folder was being created when compiling tesseract, and needn't be created and populated manually (except for the lang files), so there's less work for CASCADE-MAKE/TESSERACT.sh to do. However, the tessdata folder was being created in the linux/share folder. 'share' is probably a place where people expect tesseract's tessdata to be by default, so am updating the setup scripts to work with that, as I've donw with CASCADE-MAKE/TESSERACT.sh. 2. Adding useful instructions for users on getting more OCR language scripts' support in new file GETTING-OCR-SUPPORT-FOR-MORE-LANGS.txt, now included in the tesseract binary tarball too. Adjusted the README for us. 3. Removing the sample.jpg, converted from sample.tif which I'd downloaded from online and for which I don't know the copyright to. Replacing with sample.tif, a 96 DPI TIF file at 1870x2420 resolution produced from the first page of pdf05-notext.pdf by www.sejda.com/pdf-to-jpg. Moreover, this sample file contains lots of text, in 2 columns, not just 4 words like the original sample file. Good for testing a tesseract built from CASCADE-MAKE on. Also including the pdf05-notext-ocr-with-tikaTesseract.pdf istelf from the tutorial sample files, but only Tika with Tesseract can work on PDFs and not Tesseract by itself, indicated in the filename.

File size: 5.3 KB

Line
1	-------------------------------------------------
2	CONTENTS
3	-------------------------------------------------
4	In this file:
5
6	A. COMPILING TESSERACT GS2-EXTENSION
7	& CREATING THE CUT-DOWN BINARY-ONLY TARBALL
8
9	B. GETTING TIKA AND TESSERACT TO OCR A PDF
10
11
12	-------------------------------------------------
13	A. COMPILING TESSERACT GS2-EXTENSION
14	& CREATING THE CUT-DOWN BINARY-ONLY TARBALL
15	-------------------------------------------------
16
17	To compile the Tesseract gs2-extension and then create the "binary" tarball needed to run
18	Tesseract, we follow an equivalent version of the instructions for the imagemagick gs2-extension
19	at http://trac.greenstone.org/browser/gs2-extensions/imagemagick/trunk/README
20
21
22	1. Find a location on your machine
23
24
25	2. Check out the tesseract extension from gs2-extensions
26	svn co http://trac.greenstone.org/browser/gs2-extensions/tesseract/trunk tesseract
27
28
29	3. Compile it all up (tesseract and dependencies):
30	cd tesseract
31	./CASCADE-MAKE.sh
32
33
34	4. Open a fresh terminal and check that the tesseract now installed in src/linux/bin works:
35
36	cd src
37	source ./setup.bash
38
39	This should have set up env vars like GEXTTESS, GEXTTESS_INSTALLED, and TESSDATA_PREFIX
40	which Tesseract needs to have set
41
42	tesseract --list-langs
43	tesseract sample.tif out
44
45	OCRs sample.tif and generates out.txt from it.
46
47	cat out.txt
48
49	If you run Tesseract with the hocr config file, you can get the OCR output in
50	nicely formatted html more representative of the input structure:
51
52	tesseract sample.tif hocrtest
53
54	The OCR output in html format will be in hocrtest.hocr:
55
56	cat hocrtest.hocr
57
58
59	5. If successful,
60
61	a. create a folder at the same level as src called tesseract
62	cd src
63	cd ..
64	mkdir tesseract
65
66	b. COPY the setup files and MOVE the installed folder (src/linux) into the new cut-down tesseract folder:
67
68	cp src/setup.ba* tesseract/.
69	mv src/linux tesseract/.
70
71	c. COPY the TESSERACT-APACHE-LICENSE and LEPTONICA-LICENSE txt files (note it uses
72	American spelling!) from src/packages into the cut-down tesseract/linux:
73
74	cp src/packages/*LICENSE.txt tesseract/linux/.
75
76	d. Copy the top-level GETTING-OCR-SUPPORT-FOR-MORE-LANGS.txt file into the cutdown tesseract:
77	cp GETTING-OCR-SUPPORT-FOR-MORE-LANGS.txt tesseract/.
78
79	e. REMOVE folder "man" from tesseract/linux:
80	rm -rf tesseract/linux/man
81
82	f. REMOVE everything EXCEPT the "tessdata" folder in tesseract/linux/share.
83	(The other things in that location are either unnecessary or created by tesseract's dependencies).
84
85
86	6. Create a tarball of the cut down tesseract folder named tesseract-<os>-<arch>.tar.gz:
87	tar -cvzf tesseract-linux-x64.tar.gz tesseract
88
89
90	7. (Add/SVN up and) commit that to svn:
91	svn up
92	svn add tesseract-linux-x64.tar.gz
93	(or svn diff tesseract-linux-x64.tar.gz if there was an earlier version to confirm modified)
94	svn commit -m "MESSAGE" tesseract-linux-x64.tar.gz
95
96
97	-------------------------------------------------
98	B. GETTING TIKA AND TESSERACT TO OCR A PDF
99	-------------------------------------------------
100	Tesseract does not OCR PDFs (https://github.com/tesseract-ocr/tesseract/issues/1476).
101	Trying to do so, you'll see:
102	tesseract pdf05-notext.pdf notext
103	Tesseract Open Source OCR Engine v5.0.0-alpha-694-g6ee3 with Leptonica
104	Error in pixReadStream: Pdf reading is not supported
105	Error in pixRead: pix not read
106	Error during processing.
107
108	Tesseract can OCR the individual images constituting a page of the PDF, but to OCR PDFs
109	with Tesseract, you need an additional tool to split PDFs into its pages and extract images
110	from them, feed each page's image into Tesseract to get it OCR-ed and then create an html or
111	txt file collating all the individual OCR-ed page content.
112
113	Tika does this.
114
115	By default if Tika is on the environment and TESSDATA_PREFIX is set to the tessdata folder
116	containing the language files, Tika is able to get Tesseract to OCR images out of the box.
117	Yet not PDFs. Tika will output empty OCR for (x)html/txt, extracting only extractable text
118	from PDFs and no OCR until the following is correct.
119
120	To get Tika (app v1.24.x) and Tesseract (v5.0.0) set up to OCR PDFs, needed to do 2 more
121	things:
122	1. Have to run the tika-app-*.jar with in --config=/path/to/a/tika-config.xml file
123	configured correctly for the TesseractOCRParser and PDFParser
124	2. The <tika-config.xml> file passed to tika-app-*.jar should configure the "outputType"
125	param's configuration of the TesseractOCRParser as follows:
126	a. Set the "outputType" param to "txt" so Tesseract produces the OCR in .txt format, and
127	Tesseract will produce .txt as OCR output which Tika will intercept and process,
128	b. OR if you want the OCR generated by Tesseract to be in hocr (html ocr) format, then set
129	the "outputType" param value to "hocr" AND ensure a config file also called hocr exists in
130	$TESSDATA_PREFIX/configs containing the following (taken from
131	https://github.com/tesseract-ocr/tessconfigs/blob/3decf1c8252ba6dbeef0bf908f4b0aab7f18d113/configs/hocr):
132	tessedit_create_hocr 1
133	hocr_font_info 0
134
135
136
137	In order to have a tessdata/configs/hocr, needed to correct the Tesseract we were
138	cascade-making to get it to put the "configs" subfolder inside the installed Tessereact's
139	tessdata folder. The source version of tesseract has this folder, but it wasn't getting
140	included in the built version despite us exporting TESSDATA_PREFIX before CASCADE-MAKing.
141

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: gs2-extensions/tesseract/trunk/README.txt@ 34190

Download in other formats: