Changeset 34186 for gs2-extensions/tesseract/trunk/README.txt
- Timestamp:
- 2020-06-16T15:00:39+12:00 (4 years ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
gs2-extensions/tesseract/trunk/README.txt
r34184 r34186 1 1 ------------------------------------------------- 2 COMPILING TESSERACT GS2-EXTENSION 2 CONTENTS 3 ------------------------------------------------- 4 In this file: 5 6 A. COMPILING TESSERACT GS2-EXTENSION 7 & CREATING THE CUT-DOWN BINARY-ONLY TARBALL 8 9 B. GETTING TIKA AND TESSERACT TO OCR A PDF 10 11 12 ------------------------------------------------- 13 A. COMPILING TESSERACT GS2-EXTENSION 3 14 & CREATING THE CUT-DOWN BINARY-ONLY TARBALL 4 15 ------------------------------------------------- … … 32 43 cat out.txt 33 44 34 5. If successful, create a folder at the same level as src alled tesseract 45 5. If successful, 46 47 a. create a folder at the same level as src called tesseract 35 48 cd src 36 49 cd .. 37 50 mkdir tesseract 38 51 39 COPY the setup files and MOVE the installed folder (src/linux) into the new cut-down tesseract folder:52 b. COPY the setup files and MOVE the installed folder (src/linux) into the new cut-down tesseract folder: 40 53 41 54 cp src/setup.ba* tesseract/. 42 55 mv src/linux tesseract/. 43 56 44 COPY the TESSERACT-APACHE-LICENSE and LEPTONICA-LICENSE txt files (note it uses45 Amer can spelling!) from src/packages into the cut-down tesseract/linux:57 c. COPY the TESSERACT-APACHE-LICENSE and LEPTONICA-LICENSE txt files (note it uses 58 American spelling!) from src/packages into the cut-down tesseract/linux: 46 59 47 60 cp src/packages/*LICENSE.txt tesseract/linux/. 48 61 62 d. REMOVE folder "man" from tesseract/linux: 63 rm -rf tesseract/linux/man 49 64 50 65 6. Create a tarball of the cut down tesseract folder named tesseract-<os>-<arch>.tar.gz: … … 57 72 svn commit -m "MESSAGE" tesseract-linux-x64.tar.gz 58 73 59 74 60 75 ------------------------------------------------- 76 B. GETTING TIKA AND TESSERACT TO OCR A PDF 77 ------------------------------------------------- 78 Tesseract does not OCR PDFs (https://github.com/tesseract-ocr/tesseract/issues/1476). 79 Trying to do so, you'll see: 80 tesseract pdf05-notext.pdf notext 81 Tesseract Open Source OCR Engine v5.0.0-alpha-694-g6ee3 with Leptonica 82 Error in pixReadStream: Pdf reading is not supported 83 Error in pixRead: pix not read 84 Error during processing. 85 86 Tesseract can OCR the individual images constituting a page of the PDF, but to OCR PDFs 87 with Tesseract, you need an additional tool to split PDFs into its pages and extract images 88 from them, feed each page's image into Tesseract to get it OCR-ed and then create an html or 89 txt file collating all the individual OCR-ed page content. 90 91 Tika does this. 92 93 By default if Tika is on the environment and TESSDATA_PREFIX is set to the tessdata folder 94 containing the language files, Tika is able to get Tesseract to OCR images out of the box. 95 Yet not PDFs. Tika will output empty OCR for (x)html/txt, extracting only extractable text 96 from PDFs and no OCR until the following is correct. 97 98 To get Tika (app v1.24.x) and Tesseract (v5.0.0) set up to OCR PDFs, needed to do 2 more 99 things: 100 1. Have to run the tika-app-*.jar with in --config=/path/to/a/tika-config.xml file 101 configured correctly for the TesseractOCRParser and PDFParser 102 2. The <tika-config.xml> file passed to tika-app-*.jar should configure the "outputType" 103 param's configuration of the TesseractOCRParser as follows: 104 a. Set the "outputType" param to "txt" so Tesseract produces the OCR in .txt format, and 105 Tesseract will produce .txt as OCR output which Tika will intercept and process, 106 b. OR if you want the OCR generated by Tesseract to be in hocr (html ocr) format, then set 107 the "outputType" param value to "hocr" AND ensure a config file also called hocr exists in 108 $TESSDATA_PREFIX/configs containing the following (taken from 109 https://github.com/tesseract-ocr/tessconfigs/blob/3decf1c8252ba6dbeef0bf908f4b0aab7f18d113/configs/hocr): 110 tessedit_create_hocr 1 111 hocr_font_info 0 112 113 2. In order to have a tessdata/configs/hocr, needed to correct the Tesseract we were 114 cascade-making to get it to put the "configs" subfolder inside the installed Tessereact's 115 tessdata folder. The source version of tesseract has this folder, but it wasn't getting 116 included in the built version despite us exporting TESSDATA_PREFIX before CASCADE-MAKing. 117
Note:
See TracChangeset
for help on using the changeset viewer.