Changeset 34187 for gs2-extensions/gstika/trunk/GS_TIKA_README.txt
- Timestamp:
- 2020-06-16T15:22:04+12:00 (4 years ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
gs2-extensions/gstika/trunk/GS_TIKA_README.txt
r34177 r34187 1 -------------------------------------------------------------- 2 CONTENTS: 3 -------------------------------------------------------------- 4 5 A. Some background information on Apache Tika and related: 6 B. Here are some examples of running Tika on the command line: 7 C. COMPARE OUTPUT - IMG EXTRACTION vs TEXT: 8 D. THE --encoding= FLAG TO TIKA 9 E. WRITING A CUSTOMISED TIKA-CLI TO OUTPUT HTML-WITH-IMAGES 10 F. COMPILING TIKA FROM SOURCE 11 G. GETTING TIKA TO WORK WITH TESSERACT TO OCR PDFs (tika-config.xml) 12 1 13 -------------------------------------------------------------- 2 14 A. Some background information on Apache Tika and related: … … 274 286 275 287 -------------------------------------------------------------- 288 G. GETTING TIKA TO WORK WITH TESSERACT TO OCR PDFs (tika-config.xml) 289 -------------------------------------------------------------- 290 291 If you have Tesseract installed correctly, its bin folder on PATH and TESSDATA_PREFIX 292 environment variable set, the current version of Tika (tika-app-1.24.x.jar) and will 293 turn on Tesseract OCR automatically for images. 294 295 But Tika is not configured out of the box to work with Tesseract to OCR PDFs (Tesseract 296 on its own does not OCR PDFs, only images). 297 298 To get Tika to work with Tesseract to OCR PDFs: 299 1. Must pass a config.xml file to Tika, where the TesseractOCRParser and PDFParser are 300 configured correctly. Run as: 301 tika-app-*.jar --config=<tika-congif.xml> 302 303 2. The "outputType" param of the TesseractOCRParser in this config file must have one of 304 these 2 values: 305 a. "txt" - which requests Tesseract to output OCR as text 306 b. "hocr" - which asks Tesseract to output OCR as html (hence format called hocr) 307 308 For the hocr param to have any effect (else the PDF pages will not be OCR-ed), on the 309 tesseract end, the $TESSDATA_PREFIX/configs/hocr file must exist and contain 310 these values (given at 311 https://github.com/tesseract-ocr/tessconfigs/blob/3decf1c8252ba6dbeef0bf908f4b0aab7f18d113/configs/hocr): 312 tessedit_create_hocr 1 313 hocr_font_info 0 314 315 The latest Tesseract tarball should now contain this $TESSDATA_PREFIX/configs/hocr file. 316 317 318 I'm committing an appropriate tika-config.xml file (based on https://opensourceconnections.com/blog/2019/11/26/tika-and-tesseract-outside-of-solr/) for GSTika, containing: 319 320 ************************************************************* 321 <?xml version="1.0" encoding="UTF-8" standalone="no"?> 322 <!-- 323 (XML comments only allowed after xml processor instruction.) 324 325 https://opensourceconnections.com/blog/2019/11/26/tika-and-tesseract-outside-of-solr/ 326 which links to their sample tika-config.xml (copied below) which configures the PDF and OCR Parsers to behave just as the old PDFParser.props and OCR Parser properties files did. 327 328 - new way of one tika-config.xml: https://github.com/o19s/pdf-discovery-demo/blob/crazy_tika_tesseract_inside_of_solr/ocr/tika-config.xml 329 - old way of 2 props files: https://github.com/o19s/pdf-discovery-demo/tree/6f5b37305dd863a73af4617db64cbe853c5ecd2a/ocr/tika-properties/org/apache/tika/parser 330 331 https://tika.apache.org/1.16/configuring.html 332 https://issues.apache.org/jira/browse/TIKA-2624 333 --> 334 <properties> 335 <parsers> 336 <parser class="org.apache.tika.parser.DefaultParser"> 337 <parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/> 338 <parser-exclude class="org.apache.tika.parser.pdf.PDFParser"/> 339 </parser> 340 <parser class="org.apache.tika.parser.ocr.TesseractOCRParser"> 341 <params> 342 <!-- Setting the following 2 params is unnecessary, since sourcing Greenstone puts the Tesseract binary 343 on the path AND also sets the TESSDATA_PREFIX env var needed by Tesseract --> 344 <!-- 345 <param name="tesseractPath" type="string">/path/to/GS3/gs2build/ext/tesseract/linux/bin</param> 346 <param name="tessdataPath" type="string">/path/to/GS3/gs2build/ext/tesseract/linux/tessdata</param> 347 --> 348 349 <!-- IMPORTANT!! --> 350 <param name="outputType" type="string">hocr</param><!-- Choose one of: hocr, txt --> 351 <!-- hocr is preferred as Tesseract produces nicely formatted html that better reflects 352 the placement of the original text in the scanned page. (Can compare running with horc vs txt) 353 354 However, initially, the above value had to be fixed as "txt", as outputType value = hocr prevented 355 Tika+Tesseract from OCR-ing pdfs (no OCR output). 356 Until $GEXT_INSTALLED/tessdata/configs/hocr (tesseract config file) was created containing specific 357 property values in point 2b below. 358 359 To get Tika to work with Tesseract to OCR pages of a scanned PDF: 360 1. always pass in this file as __config=/path/to/tika-config.xml to tika-app-*.jar cmd, 361 2. AND do one of the following: 362 a. Set the above outputType param to "txt" so Tesseract produces the OCR in .txt format, and things should work, 363 b. OR if you want the OCR generated by Tesseract to be in hocr (html ocr) format, then set the outputType above 364 to "hocr" AND ensure a config file also called hocr exists in $TESSDATA_PREFIX/configs containing the following 365 (taken from 366 https://github.com/tesseract-ocr/tessconfigs/blob/3decf1c8252ba6dbeef0bf908f4b0aab7f18d113/configs/hocr): 367 tessedit_create_hocr 1 368 hocr_font_info 0 369 370 More information about tesseract config options by running: 371 tesseract __print-parameters 372 --> 373 <param name="language" type="string">eng</param> 374 <param name="pageSegMode" type="string">1</param> 375 </params> 376 </parser> 377 <parser class="org.apache.tika.parser.pdf.PDFParser"> 378 <params> 379 <param name="ocrStrategy" type="string">ocr_and_text</param> 380 </params> 381 </parser> 382 383 </parsers> 384 </properties> 385 ************************************************************* 386 387 388 --------------------------------------------------------------
Note:
See TracChangeset
for help on using the changeset viewer.