source: main/trunk/greenstone2/bin/linux/xpdf-tools/doc/pdftohtml.1@ 32205

Last change on this file since 32205 was 32205, checked in by ak19, 6 years ago

First set of commits to do with implementing the new 'paged_html' output option of PDFPlugin that uses using xpdftools' new pdftohtml. So far tested only on Linux (64 bit), but things work there so I'm optimistically committing the changes since they work. 2. Committing the pre-built Linux binaries of XPDFtools for both 32 and 64 bit built by the XPDF group. 2. To use the correct bitness variant of xpdftools, setup.bash now exports the BITNESS env var, consulted by gsConvert.pl. 3. All the perl code changes to do with using xpdf tools' pdftohtml to generate paged_html and feed it in the desired form into GS(3): gsConvert.pl, PDFPlugin.pm and its parent ConvertBinaryPFile.pm have been modified to make it all work. xpdftools' pdftohtml generates a folder containing an html file and a screenshot for each page in a PDF (as well as an index.html linking to each page's html). However, we want a single html file that contains each individual 'page' html's content in a div, and need to do some further HTML style, attribute and structure modifications to massage the xpdftool output to what we want for GS. In order to parse and manipulate the HTML 'DOM' to do this, we're using the Mojo::DOM package that Dr Bainbridge found and which he's compiled up. Mojo::DOM is therefore also committed in this revision. Some further changes and some display fixes are required, but need to check with the others about that.

File size: 3.4 KB
Line 
1.\" Copyright 1997-2017 Glyph & Cog, LLC
2.TH pdftohtml 1 "10 Aug 2017"
3.SH NAME
4pdftohtml \- Portable Document Format (PDF) to HTML converter
5(version 4.00)
6.SH SYNOPSIS
7.B pdftohtml
8[options]
9.I PDF-file
10.I HTML-dir
11.SH DESCRIPTION
12.B Pdftohtml
13converts Portable Document Format (PDF) files to HTML.
14.PP
15Pdftohtml reads the PDF file,
16.IR PDF-file ,
17and places an HTML file for each page, along with auxiliary images
18in the directory,
19.IR HTML-dir .
20The HTML directory will be created; if it already exists, pdftohtml
21will report an error.
22.SH CONFIGURATION FILE
23Pdftohtml reads a configuration file at startup. It first tries to
24find the user's private config file, ~/.xpdfrc. If that doesn't
25exist, it looks for a system-wide config file, typically
26/usr/local/etc/xpdfrc (but this location can be changed when pdftohtml
27is built). See the
28.BR xpdfrc (5)
29man page for details.
30.SH OPTIONS
31Many of the following options can be set with configuration file
32commands. These are listed in square brackets with the description of
33the corresponding command line option.
34.TP
35.BI \-f " number"
36Specifies the first page to convert.
37.TP
38.BI \-l " number"
39Specifies the last page to convert.
40.TP
41.BI \-z " number"
42Specifies the initial zoom level. The default is 1.0, which means
4372dpi, i.e., 1 point in the PDF file will be 1 pixel in the HTML.
44Using \'-z 1.5', for example, will make the initial view 50% larger.
45.TP
46.BI \-r " number"
47Specifies the resolution, in DPI, for background images. This
48controls the pixel size of the background image files. The initial
49zoom level is controlled by the \'-z' option. Specifying a larger
50\'-r' value will allow the viewer to zoom in farther without upscaling
51artifacts in the background.
52.TP
53.B \-skipinvisible
54Don't draw invisible text. By default, invisible text (commonly used
55in OCR'ed PDF files) is drawn as transparent (alpha=0) HTML text.
56This option tells pdftohtml to discard invisible text entirely.
57.TP
58.B \-allinvisible
59Treat all text as invisible. By default, regular (non-invisible) text
60is not drawn in the background image, and is instead drawn with HTML
61on top of the image. This option tells pdftohtml to include the
62regular text in the background image, and then draw it as transparent
63(alpha=0) HTML text.
64.TP
65.BI \-opw " password"
66Specify the owner password for the PDF file. Providing this will
67bypass all security restrictions.
68.TP
69.BI \-upw " password"
70Specify the user password for the PDF file.
71.TP
72.B \-q
73Don't print any messages or errors.
74.RB "[config file: " errQuiet ]
75.TP
76.BI \-cfg " config-file"
77Read
78.I config-file
79in place of ~/.xpdfrc or the system-wide config file.
80.TP
81.B \-v
82Print copyright and version information.
83.TP
84.B \-h
85Print usage information.
86.RB ( \-help
87and
88.B \-\-help
89are equivalent.)
90.SH BUGS
91Some PDF files contain fonts whose encodings have been mangled beyond
92recognition. There is no way (short of OCR) to extract text from
93these files.
94.SH EXIT CODES
95The Xpdf tools use the following exit codes:
96.TP
970
98No error.
99.TP
1001
101Error opening a PDF file.
102.TP
1032
104Error opening an output file.
105.TP
1063
107Error related to PDF permissions.
108.TP
10999
110Other error.
111.SH AUTHOR
112The pdftohtml software and documentation are copyright 1996-2017 Glyph
113& Cog, LLC.
114.SH "SEE ALSO"
115.BR xpdf (1),
116.BR pdftops (1),
117.BR pdftotext (1),
118.BR pdfinfo (1),
119.BR pdffonts (1),
120.BR pdfdetach (1),
121.BR pdftoppm (1),
122.BR pdftopng (1),
123.BR pdfimages (1),
124.BR xpdfrc (5)
125.br
126.B http://www.xpdfreader.com/
Note: See TracBrowser for help on using the repository browser.