source: gs2-extensions/tesseract/trunk/GETTING-OCR-SUPPORT-FOR-MORE-LANGS.txt@ 34190

Last change on this file since 34190 was 34190, checked in by ak19, 4 years ago
  1. The tessdata folder was being created when compiling tesseract, and needn't be created and populated manually (except for the lang files), so there's less work for CASCADE-MAKE/TESSERACT.sh to do. However, the tessdata folder was being created in the linux/share folder. 'share' is probably a place where people expect tesseract's tessdata to be by default, so am updating the setup scripts to work with that, as I've donw with CASCADE-MAKE/TESSERACT.sh. 2. Adding useful instructions for users on getting more OCR language scripts' support in new file GETTING-OCR-SUPPORT-FOR-MORE-LANGS.txt, now included in the tesseract binary tarball too. Adjusted the README for us. 3. Removing the sample.jpg, converted from sample.tif which I'd downloaded from online and for which I don't know the copyright to. Replacing with sample.tif, a 96 DPI TIF file at 1870x2420 resolution produced from the first page of pdf05-notext.pdf by www.sejda.com/pdf-to-jpg. Moreover, this sample file contains lots of text, in 2 columns, not just 4 words like the original sample file. Good for testing a tesseract built from CASCADE-MAKE on. Also including the pdf05-notext-ocr-with-tikaTesseract.pdf istelf from the tutorial sample files, but only Tika with Tesseract can work on PDFs and not Tesseract by itself, indicated in the filename.
File size: 4.4 KB
Line 
1------------------------------------------------------------------------
2README FOR Greenstone USERS: TO SUPPORT ADDITIONAL LANGUAGES FOR OCR
3------------------------------------------------------------------------
4Greenstone can be configured to use Tesseract to OCR images, and use Tika
5in combination with Tesseract to OCR PDFs.
6
7By default, the Greenstone Tesseract extension only comes with support for OCR-ing
8English and Onscreen Display text, as otherwise the extension will become too large.
9
10Tesseract supports OCR for many languages (for the scripts of many languages).
11The supported languages are at https://github.com/tesseract-ocr/tessdata,
12where they're indicated by their official 3 letter language code.
13(You can Google to find the 3 letter lang code for your languages of interest).
14
15To obtain support for other languages, you can do one of:
16
17a. manually download the <3-letter-langcode>.traineddata files for languages
18you want from https://github.com/tesseract-ocr/tessdata
19
20b. Run the following from the toplevel of your GS3 installation:
21 source ./gs3-setup.sh
22 cd gs2build/ext/tesseract/linux/share/tessdata
23
24Then for each language code, run the following with <3-letter-langcode> adjusted
25accordingly:
26 wget https://github.com/tesseract-ocr/tessdata/raw/master/<3-letter-langcode>.traineddata
27
28c. You can download all the supported languages in one step if you have git
29installed. First move (or remove) the existing "tessdata" folder, before running
30git clone to get all the languages that have OCR support:
31
32 cd gs2build/ext/tesseract/linux/share
33 #rm -rf tessdata
34 mv tessdata tessdata.basic
35 git clone https://github.com/tesseract-ocr/tessdata
36
37
38------------------------------
39Background Information:
40------------------------------
41Greenstone can only index text in documents that contain extractable text.
42Not documents that only have images of text ("photos" of text don't contain
43selectable text).
44
45There is a process called OCR (Optical Character Recognition) to recognise
46any individual characters constituting text represented in images, and thereby
47produce the text in images that otherwise have no extractable text.
48
49Tesseract is OCR software licensed under the Apache 2.0 License. Tesseract can
50be used by Greenstone for OCR-ing images, to thus get text from those images
51which Greenstone can then index for full text searching on that image document.
52
53Tesseract cannot OCR PDFs, only images. However, Apache Tika can work with Tesseract
54(both licensed under the Apache 2.0 License) to OCR PDFs that contain pages
55which are only images of text rather than actual extractable text.
56
57Greenstone can use the combination of Apache Tika and Tesseract to further process
58any PDFs of images of text too, the OCR process producing text that Greenstone can
59index to enable full text searching on the original document (which otherwise
60contained no extractable text, only images of text).
61
62Important Notes:
63
64a. Where OCR is involved in any process, the quality of the OCR-ed text that is
65produced is tightly dependent on the quality of image files that went into the
66process. The higher the DPI (dots per inch) of the images and the clearer the
67legibility of the images of text that go into the digital OCR-ing process, the more
68sensible and accurate the OCR-ed text that results. In cases of poor quality images,
69gibberish will be produced. With average-quality input images, the OCR-ed text is a
70combination of text accurate to the original interspersed occasionally by strange
71characters.
72
73b. OCR is for recognising characters constituting text in images. Characters are
74components of scripts, and there are many language scripts in the world. As a result,
75in order for OCR to recognise the characters that constitute the script of the
76language your document contains, there needs to be support for that language's script
77in the OCR software used, in this case Tesseract.
78
79The languages' scripts that Tesseract supports (indicated by their 3 letter language
80codes) are at https://github.com/tesseract-ocr/tessdata
81
82By default, the Greenstone Tesseract extension only comes with support for OCR-ing
83English and Onscreen Display text, as otherwise the extension will become too large.
84
85To allow the Greenstone Tesseract extension to OCR further languages that
86Tesseract already supports, read the section "TO SUPPORT ADDITIONAL LANGUAGES FOR OCR".
87
88------------------------------------------------------------------------
Note: See TracBrowser for help on using the repository browser.