Ticket #390 (new defect)

Opened 9 years ago

Last modified 6 years ago

pdf conversion to text

Reported by: kjdon Owned by: nobody
Priority: moderate Milestone: 2.87 Release
Component: Collection Building: Plugins Severity: minor
Keywords: Cc:

Description

If you select convert_to text for PDFPlugin, it tries to run pdftotext. But we don't supply this, and the conversion fails.

Should we supply it?

Should we try a different format?

Change History

Changed 9 years ago by davidb

  • milestone changed from Release 2.81 to Release 2.82

Original problem detected on Linux. Probably true for Windows and Mac as well. Look to use ghostscript?

Changed 8 years ago by kjdon

  • milestone changed from Release 2.82 to Release 2.83

Changed 8 years ago by kjdon

  • component changed from Collection Building to Collection Building: Plugins
  • milestone changed from Greenstone 2 wishlist to 2.84 Release

Changed 7 years ago by kjdon

  • milestone changed from 2.84 Release to 2.85 Release

Changed 7 years ago by kjdon

If you are using new PDFBox extension, then it can do both html and text.

Changed 6 years ago by sjm84

  • milestone changed from 2.85 Release to 2.86 Release

Changed 6 years ago by ak19

Just committed (rev 24199 and r24200) some minor changes that allow PDFBox to convert to text.

The following Perl Module is described as being capable of doing PDF to text conversion:

 http://search.cpan.org/~cdolan/CAM-PDF-1.55/lib/CAM/PDF.pm

Don't know yet how it deals with the latest PDF version. Can also see:

 http://search.cpan.org/~antro/PDF-111/PDF.pm

Note: See TracTickets for help on using tickets.