Context Navigation

README.txt@ 34186

Last change on this file since 34186 was 34186, checked in by ak19, 4 years ago

In order to get tika + tesseract to OCR PDFs (note that tesseract can't OCR PDFs on its own), need to pass a tika-config.xml file to tika that is configured to use txt OR hocr as outputType, and if outputType=hocr then need to have the tesseract/tessdata/configs folder contain a file called hocr at minimum. Now the build process ensures that the tessdata/configs and other tessdata subfolders in the extracted tesseract source package get copied across into the GEXTTESS_INSTALLED install location. Updating the README with the notes and the tesseract bin tarball.

File size: 4.7 KB

Line
1	-------------------------------------------------
2	CONTENTS
3	-------------------------------------------------
4	In this file:
5
6	A. COMPILING TESSERACT GS2-EXTENSION
7	& CREATING THE CUT-DOWN BINARY-ONLY TARBALL
8
9	B. GETTING TIKA AND TESSERACT TO OCR A PDF
10
11
12	-------------------------------------------------
13	A. COMPILING TESSERACT GS2-EXTENSION
14	& CREATING THE CUT-DOWN BINARY-ONLY TARBALL
15	-------------------------------------------------
16
17	To compile the Tesseract gs2-extension and then create the "binary" tarball needed to run
18	Tesseract, we follow an equivalent version of the instructions for the imagemagick gs2-extension
19	at http://trac.greenstone.org/browser/gs2-extensions/imagemagick/trunk/README
20
21	1. Find a location on your machine
22
23	2. Check out the tesseract extension from gs2-extensions
24	svn co http://trac.greenstone.org/browser/gs2-extensions/tesseract/trunk tesseract
25
26	3. Compile it all up (tesseract and dependencies):
27	cd tesseract
28	./CASCADE-MAKE.sh
29
30	4. Open a fresh terminal and check that the tesseract now installed in src/linux/bin works:
31
32	cd src
33	source ./setup.bash
34
35	This should have set up env vars like GEXTTESS, GEXTTESS_INSTALLED, and TESSDATA_PREFIX
36	which Tesseract needs to have set
37
38	tesseract --list-langs
39	tesseract sample.tif out
40
41	OCRs sample.tif and generates out.txt from it.
42
43	cat out.txt
44
45	5. If successful,
46
47	a. create a folder at the same level as src called tesseract
48	cd src
49	cd ..
50	mkdir tesseract
51
52	b. COPY the setup files and MOVE the installed folder (src/linux) into the new cut-down tesseract folder:
53
54	cp src/setup.ba* tesseract/.
55	mv src/linux tesseract/.
56
57	c. COPY the TESSERACT-APACHE-LICENSE and LEPTONICA-LICENSE txt files (note it uses
58	American spelling!) from src/packages into the cut-down tesseract/linux:
59
60	cp src/packages/*LICENSE.txt tesseract/linux/.
61
62	d. REMOVE folder "man" from tesseract/linux:
63	rm -rf tesseract/linux/man
64
65	6. Create a tarball of the cut down tesseract folder named tesseract-<os>-<arch>.tar.gz:
66	tar -cvzf tesseract-linux-x64.tar.gz tesseract
67
68	7. (Add/SVN up and) commit that to svn:
69	svn up
70	svn add tesseract-linux-x64.tar.gz
71	(or svn diff tesseract-linux-x64.tar.gz if there was an earlier version to confirm modified)
72	svn commit -m "MESSAGE" tesseract-linux-x64.tar.gz
73
74
75	-------------------------------------------------
76	B. GETTING TIKA AND TESSERACT TO OCR A PDF
77	-------------------------------------------------
78	Tesseract does not OCR PDFs (https://github.com/tesseract-ocr/tesseract/issues/1476).
79	Trying to do so, you'll see:
80	tesseract pdf05-notext.pdf notext
81	Tesseract Open Source OCR Engine v5.0.0-alpha-694-g6ee3 with Leptonica
82	Error in pixReadStream: Pdf reading is not supported
83	Error in pixRead: pix not read
84	Error during processing.
85
86	Tesseract can OCR the individual images constituting a page of the PDF, but to OCR PDFs
87	with Tesseract, you need an additional tool to split PDFs into its pages and extract images
88	from them, feed each page's image into Tesseract to get it OCR-ed and then create an html or
89	txt file collating all the individual OCR-ed page content.
90
91	Tika does this.
92
93	By default if Tika is on the environment and TESSDATA_PREFIX is set to the tessdata folder
94	containing the language files, Tika is able to get Tesseract to OCR images out of the box.
95	Yet not PDFs. Tika will output empty OCR for (x)html/txt, extracting only extractable text
96	from PDFs and no OCR until the following is correct.
97
98	To get Tika (app v1.24.x) and Tesseract (v5.0.0) set up to OCR PDFs, needed to do 2 more
99	things:
100	1. Have to run the tika-app-*.jar with in --config=/path/to/a/tika-config.xml file
101	configured correctly for the TesseractOCRParser and PDFParser
102	2. The <tika-config.xml> file passed to tika-app-*.jar should configure the "outputType"
103	param's configuration of the TesseractOCRParser as follows:
104	a. Set the "outputType" param to "txt" so Tesseract produces the OCR in .txt format, and
105	Tesseract will produce .txt as OCR output which Tika will intercept and process,
106	b. OR if you want the OCR generated by Tesseract to be in hocr (html ocr) format, then set
107	the "outputType" param value to "hocr" AND ensure a config file also called hocr exists in
108	$TESSDATA_PREFIX/configs containing the following (taken from
109	https://github.com/tesseract-ocr/tessconfigs/blob/3decf1c8252ba6dbeef0bf908f4b0aab7f18d113/configs/hocr):
110	tessedit_create_hocr 1
111	hocr_font_info 0
112
113	2. In order to have a tessdata/configs/hocr, needed to correct the Tesseract we were
114	cascade-making to get it to put the "configs" subfolder inside the installed Tessereact's
115	tessdata folder. The source version of tesseract has this folder, but it wasn't getting
116	included in the built version despite us exporting TESSDATA_PREFIX before CASCADE-MAKing.
117

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: gs2-extensions/tesseract/trunk/README.txt@ 34186

Download in other formats: