New pdftohtml with Xpdf tools - works with newer PDFs too
|Reported by:||ak19||Owned by:||ak19|
Kathy found that users on the mailing list wanted more HTML output options with PDFPlugin. PDFBox's pagedimg output option was modified to produce img+text, but Kathy was hoping there were more possibilities for actual PDF to HTML support out there.
Dr Bainbridge first found PDFtoDOM which was based on PDFBox. But this produced unsatisfactory HTML (sometimes fonts weren't extracted, often fonts made the display hard to read due to overlapping characters, a <div> element around every word rather than every line).
Then Dr Bainbridge found XPdf Tools, which contained a new pdftohtml, which produced results we liked. Its pdftohtml tool outputs screenshots of each PDF page's background + the text overlaid, all as HTML. One html doc per page was produced, and we'd manipulate these into a single sectionalised HTML doc.
To get Xpdf tools to work with GS in this way:
- Downloaded Xpdf tools binaries for Lin/Win/Mac, eventually to be compiled up for Lin & Mac
- To manipulate the HTML DOM produced, Dr Bainbridge found the perl module Mojo::DOM, which he compiled up.
- Then the code was modified to make use of these. The list of commit revisions so far follow below.
- This led to thinking that PDFPlugin needed to be restructured as its configuration options were already complicated and filled with mutually contradictory options since pdfbox_conversion was included, and now to become more complicated and contradictory with the inclusion of XPDF tools.
The commit revisions thus far that make use of Xpdf Tools' pdftohtml and its pdftotext to finally support PDF to text conversion on Windows are as follows. None of these commits concern restructuring the PDFPlugin as yet.
Note that the Xpdf tools binaries for mac have been committed to an svn ignored folder and that they're not yet automatically checked out. Either we get Xpdf tools to compile from src (if we can get past the fact that Xpdf tools use CMake to configure and build rather than autotools' configure script that we're used to) or we find a better SVN location to put the Mac binaries of Xpdf tools.