Changeset 34195


Ignore:
Timestamp:
2020-06-16T18:05:13+12:00 (4 years ago)
Author:
ak19
Message:

Renaming config files so one is configured for OCR-ing PDFs, the other for turning off OCR when Tesseract is installed (else Tika will autodetect if OCR-ing applies when Tesseract is installed. Maybe there's some minor savings in overhead with a no-ocr-config.xml?). With no config flag passed to tika, it will by default perform OCR only where it applies and if Tesseract is installed

Location:
gs2-extensions/gstika/trunk/java
Files:
1 added
1 moved

Legend:

Unmodified
Added
Removed
  • gs2-extensions/gstika/trunk/java/ocr-pdfs-config.xml

    r34193 r34195  
    4141
    4242         To get Tika to work with Tesseract to OCR pages of a scanned PDF:
    43          1. always pass in this file as __config=/path/to/tika-config.xml to tika-app-*.jar cmd,
     43         1. always pass in this file as --config=/path/to/tika-config.xml to tika-app-*.jar cmd,
    4444         2. AND do one of the following:
    4545            a. Set the above outputType param to "txt" so Tesseract produces the OCR in .txt format, and things should work,
     
    5252
    5353        More information about tesseract config options by running:
    54            tesseract __print-parameters
     54           tesseract --print-parameters
    5555    -->
    5656        <param name="language" type="string">eng</param>
Note: See TracChangeset for help on using the changeset viewer.