1 | ------------------------------------------------------------------------
|
---|
2 | README FOR Greenstone USERS: TO SUPPORT ADDITIONAL LANGUAGES FOR OCR
|
---|
3 | ------------------------------------------------------------------------
|
---|
4 | Greenstone can be configured to use Tesseract to OCR images, and use Tika
|
---|
5 | in combination with Tesseract to OCR PDFs.
|
---|
6 |
|
---|
7 | By default, the Greenstone Tesseract extension only comes with support for OCR-ing
|
---|
8 | English and Onscreen Display text, as otherwise the extension will become too large.
|
---|
9 |
|
---|
10 | Tesseract supports OCR for many languages (for the scripts of many languages).
|
---|
11 | The supported languages are at https://github.com/tesseract-ocr/tessdata,
|
---|
12 | where they're indicated by their official 3 letter language code.
|
---|
13 | (You can Google to find the 3 letter lang code for your languages of interest).
|
---|
14 |
|
---|
15 | To obtain support for other languages, you can do one of:
|
---|
16 |
|
---|
17 | a. manually download the <3-letter-langcode>.traineddata files for languages
|
---|
18 | you want from https://github.com/tesseract-ocr/tessdata
|
---|
19 |
|
---|
20 | b. Run the following from the toplevel of your GS3 installation:
|
---|
21 | source ./gs3-setup.sh
|
---|
22 | cd gs2build/ext/tesseract/linux/share/tessdata
|
---|
23 |
|
---|
24 | Then for each language code, run the following with <3-letter-langcode> adjusted
|
---|
25 | accordingly:
|
---|
26 | wget https://github.com/tesseract-ocr/tessdata/raw/master/<3-letter-langcode>.traineddata
|
---|
27 |
|
---|
28 | c. You can download all the supported languages in one step if you have git
|
---|
29 | installed. First move (or remove) the existing "tessdata" folder, before running
|
---|
30 | git clone to get all the languages that have OCR support:
|
---|
31 |
|
---|
32 | cd gs2build/ext/tesseract/linux/share
|
---|
33 | #rm -rf tessdata
|
---|
34 | mv tessdata tessdata.basic
|
---|
35 | git clone https://github.com/tesseract-ocr/tessdata
|
---|
36 |
|
---|
37 |
|
---|
38 | ------------------------------
|
---|
39 | Background Information:
|
---|
40 | ------------------------------
|
---|
41 | Greenstone can only index text in documents that contain extractable text.
|
---|
42 | Not documents that only have images of text ("photos" of text don't contain
|
---|
43 | selectable text).
|
---|
44 |
|
---|
45 | There is a process called OCR (Optical Character Recognition) to recognise
|
---|
46 | any individual characters constituting text represented in images, and thereby
|
---|
47 | produce the text in images that otherwise have no extractable text.
|
---|
48 |
|
---|
49 | Tesseract is OCR software licensed under the Apache 2.0 License. Tesseract can
|
---|
50 | be used by Greenstone for OCR-ing images, to thus get text from those images
|
---|
51 | which Greenstone can then index for full text searching on that image document.
|
---|
52 |
|
---|
53 | Tesseract cannot OCR PDFs, only images. However, Apache Tika can work with Tesseract
|
---|
54 | (both licensed under the Apache 2.0 License) to OCR PDFs that contain pages
|
---|
55 | which are only images of text rather than actual extractable text.
|
---|
56 |
|
---|
57 | Greenstone can use the combination of Apache Tika and Tesseract to further process
|
---|
58 | any PDFs of images of text too, the OCR process producing text that Greenstone can
|
---|
59 | index to enable full text searching on the original document (which otherwise
|
---|
60 | contained no extractable text, only images of text).
|
---|
61 |
|
---|
62 | Important Notes:
|
---|
63 |
|
---|
64 | a. Where OCR is involved in any process, the quality of the OCR-ed text that is
|
---|
65 | produced is tightly dependent on the quality of image files that went into the
|
---|
66 | process. The higher the DPI (dots per inch) of the images and the clearer the
|
---|
67 | legibility of the images of text that go into the digital OCR-ing process, the more
|
---|
68 | sensible and accurate the OCR-ed text that results. In cases of poor quality images,
|
---|
69 | gibberish will be produced. With average-quality input images, the OCR-ed text is a
|
---|
70 | combination of text accurate to the original interspersed occasionally by strange
|
---|
71 | characters.
|
---|
72 |
|
---|
73 | b. OCR is for recognising characters constituting text in images. Characters are
|
---|
74 | components of scripts, and there are many language scripts in the world. As a result,
|
---|
75 | in order for OCR to recognise the characters that constitute the script of the
|
---|
76 | language your document contains, there needs to be support for that language's script
|
---|
77 | in the OCR software used, in this case Tesseract.
|
---|
78 |
|
---|
79 | The languages' scripts that Tesseract supports (indicated by their 3 letter language
|
---|
80 | codes) are at https://github.com/tesseract-ocr/tessdata
|
---|
81 |
|
---|
82 | By default, the Greenstone Tesseract extension only comes with support for OCR-ing
|
---|
83 | English and Onscreen Display text, as otherwise the extension will become too large.
|
---|
84 |
|
---|
85 | To allow the Greenstone Tesseract extension to OCR further languages that
|
---|
86 | Tesseract already supports, read the section "TO SUPPORT ADDITIONAL LANGUAGES FOR OCR".
|
---|
87 |
|
---|
88 | ------------------------------------------------------------------------
|
---|