Context Navigation

← Previous Changeset
Next Changeset →

Changeset 34186

Timestamp:

2020-06-16T15:00:39+12:00 (4 years ago)

Author:

ak19

Message:

In order to get tika + tesseract to OCR PDFs (note that tesseract can't OCR PDFs on its own), need to pass a tika-config.xml file to tika that is configured to use txt OR hocr as outputType, and if outputType=hocr then need to have the tesseract/tessdata/configs folder contain a file called hocr at minimum. Now the build process ensures that the tessdata/configs and other tessdata subfolders in the extracted tesseract source package get copied across into the GEXTTESS_INSTALLED install location. Updating the README with the notes and the tesseract bin tarball.

Location:

gs2-extensions/tesseract/trunk

Files:

: 4 edited

README.txt (modified) (view diffs)
src/CASCADE-MAKE.sh (modified) (view diffs)
src/packages/CASCADE-MAKE/TESSERACT.sh (modified) (view diffs)
tesseract-linux-x64.tar.gz (modified) (view diffs)

Changeset view not shown, since the total size (67.0 MB) exceeds 9.5 MB

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 34186

Download in other formats: