Context Navigation

← Previous Changeset
Next Changeset →

Changeset 34195

Timestamp:

2020-06-16T18:05:13+12:00 (4 years ago)

Author:

ak19

Message:

Renaming config files so one is configured for OCR-ing PDFs, the other for turning off OCR when Tesseract is installed (else Tika will autodetect if OCR-ing applies when Tesseract is installed. Maybe there's some minor savings in overhead with a no-ocr-config.xml?). With no config flag passed to tika, it will by default perform OCR only where it applies and if Tesseract is installed

Location:

gs2-extensions/gstika/trunk/java

Files:

: 1 added
: 1 moved

no-ocr-config.xml (added)
ocr-pdfs-config.xml (moved) (moved from gs2-extensions/gstika/trunk/java/tika-config.xml ) (2 diffs)

Legend:

: Unmodified
: Added
: Removed

gs2-extensions/gstika/trunk/java/ocr-pdfs-config.xml

-              r34193
+              r34195
          To get Tika to work with Tesseract to OCR pages of a scanned PDF:
 . always pass in this file as __config=/path/to/tika-config.xml to tika-app-*.jar cmd,
+. always pass in this file as &#45;&#45;config=/path/to/tika-config.xml to tika-app-*.jar cmd,
 . AND do one of the following:
             a. Set the above outputType param to "txt" so Tesseract produces the OCR in .txt format, and things should work,
 …
         More information about tesseract config options by running:
            tesseract __print-parameters
+           tesseract &#45;&#45;print-parameters
     -->
         <param name="language" type="string">eng</param>

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 34195

Legend:

gs2-extensions/gstika/trunk/java/ocr-pdfs-config.xml

Download in other formats: