Changeset 34186

Timestamp:
16.06.2020 15:00:39 (3 weeks ago)
Author:
ak19
Message:

In order to get tika + tesseract to OCR PDFs (note that tesseract can't OCR PDFs on its own), need to pass a tika-config.xml file to tika that is configured to use txt OR hocr as outputType, and if outputType=hocr then need to have the tesseract/tessdata/configs folder contain a file called hocr at minimum. Now the build process ensures that the tessdata/configs and other tessdata subfolders in the extracted tesseract source package get copied across into the GEXTTESS_INSTALLED install location. Updating the README with the notes and the tesseract bin tarball.

Location:
gs2-extensions/tesseract/trunk
Files:
4 modified