Timestamp:
2020-06-16T18:13:13+12:00 (4 years ago)
Author:
ak19
Message:

Updating gstika tarballs too with the latest changes to the tika config file: Renaming config files so one is configured for OCR-ing PDFs, the other for turning off OCR when Tesseract is installed (else Tika will autodetect if OCR-ing applies when Tesseract is installed. Maybe there's some minor savings in overhead with a no-ocr-config.xml?). With no config flag passed to tika, it will by default perform OCR only where it applies and if Tesseract is installed. Because by default Tika only extracts text and does not extract images, and you need to expressly turn image extraction on with -z/--extract, there is no such overhead, except maybe for PDFs where each page is an image. However, in gstika, the GS specific custom flags introduced (html-with-imgs and xhtml-with-imgs) do extract text and images simultaneously and so may need the no-ocr-config.xml to shave off this overhead if no automatic OCR-ing on docs is needed.

File:
1 edited

Note: See TracChangeset for help on using the changeset viewer.