Context Navigation

← Previous Changeset
Next Changeset →

Changeset 34187

Timestamp:

2020-06-16T15:22:04+12:00 (4 years ago)

Author:

ak19

Message:

Committing the tika-config.xml that sets up Tika's PDFParser and TesseractOCRParser to OCR PDFs. Without this, despite Tika detecting Tesseract, PDFs weren't getting OCR-ed. This problem wasn't documented anywhere either and onlly by change did I find what was needed: that a correctly configured tika-config.xml was compulsory to get PDFs OCR-ed by Tika+Tesseract, and that the Tesseract installation I created had been missing TESSDATA_PREFIX/configs/hocr

Location:

gs2-extensions/gstika/trunk

Files:

: 1 added
: 3 edited

GS_TIKA_README.txt (modified) (view diffs)
gstika.tar.gz (modified) (view diffs)
gstika.zip (modified) (view diffs)
java/tika-config.xml (added)

Changeset view not shown, since the total size (253.4 MB) exceeds 9.5 MB

Note: See TracChangeset for help on using the changeset viewer.

Download in other formats:

Unified Diff
Zip Archive