Changeset 34195

Show
Ignore:
Timestamp:
16.06.2020 18:05:13 (3 weeks ago)
Author:
ak19
Message:

Renaming config files so one is configured for OCR-ing PDFs, the other for turning off OCR when Tesseract is installed (else Tika will autodetect if OCR-ing applies when Tesseract is installed. Maybe there's some minor savings in overhead with a no-ocr-config.xml?). With no config flag passed to tika, it will by default perform OCR only where it applies and if Tesseract is installed

Location:
gs2-extensions/gstika/trunk/java
Files:
1 added
1 moved

Legend:

Unmodified
Added
Removed
  • gs2-extensions/gstika/trunk/java/ocr-pdfs-config.xml

    r34193 r34195  
    4141 
    4242         To get Tika to work with Tesseract to OCR pages of a scanned PDF: 
    43          1. always pass in this file as __config=/path/to/tika-config.xml to tika-app-*.jar cmd, 
     43         1. always pass in this file as --config=/path/to/tika-config.xml to tika-app-*.jar cmd, 
    4444         2. AND do one of the following: 
    4545            a. Set the above outputType param to "txt" so Tesseract produces the OCR in .txt format, and things should work, 
     
    5252 
    5353        More information about tesseract config options by running: 
    54            tesseract __print-parameters  
     54           tesseract --print-parameters  
    5555    --> 
    5656        <param name="language" type="string">eng</param>