1 | .\" Copyright 1997-2017 Glyph & Cog, LLC
|
---|
2 | .TH pdftotext 1 "10 Aug 2017"
|
---|
3 | .SH NAME
|
---|
4 | pdftotext \- Portable Document Format (PDF) to text converter
|
---|
5 | (version 4.00)
|
---|
6 | .SH SYNOPSIS
|
---|
7 | .B pdftotext
|
---|
8 | [options]
|
---|
9 | .RI [ PDF-file
|
---|
10 | .RI [ text-file ]]
|
---|
11 | .SH DESCRIPTION
|
---|
12 | .B Pdftotext
|
---|
13 | converts Portable Document Format (PDF) files to plain text.
|
---|
14 | .PP
|
---|
15 | Pdftotext reads the PDF file,
|
---|
16 | .IR PDF-file ,
|
---|
17 | and writes a text file,
|
---|
18 | .IR text-file .
|
---|
19 | If
|
---|
20 | .I text-file
|
---|
21 | is not specified, pdftotext converts
|
---|
22 | .I file.pdf
|
---|
23 | to
|
---|
24 | .IR file.txt .
|
---|
25 | If
|
---|
26 | .I text-file
|
---|
27 | is \'-', the text is sent to stdout.
|
---|
28 | .SH CONFIGURATION FILE
|
---|
29 | Pdftotext reads a configuration file at startup. It first tries to
|
---|
30 | find the user's private config file, ~/.xpdfrc. If that doesn't
|
---|
31 | exist, it looks for a system-wide config file, typically
|
---|
32 | /usr/local/etc/xpdfrc (but this location can be changed when pdftotext
|
---|
33 | is built). See the
|
---|
34 | .BR xpdfrc (5)
|
---|
35 | man page for details.
|
---|
36 | .SH OPTIONS
|
---|
37 | Many of the following options can be set with configuration file
|
---|
38 | commands. These are listed in square brackets with the description of
|
---|
39 | the corresponding command line option.
|
---|
40 | .TP
|
---|
41 | .BI \-f " number"
|
---|
42 | Specifies the first page to convert.
|
---|
43 | .TP
|
---|
44 | .BI \-l " number"
|
---|
45 | Specifies the last page to convert.
|
---|
46 | .TP
|
---|
47 | .B \-layout
|
---|
48 | Maintain (as best as possible) the original physical layout of the
|
---|
49 | text. The default is to \'undo' physical layout (columns,
|
---|
50 | hyphenation, etc.) and output the text in reading order. If the
|
---|
51 | .B \-fixed
|
---|
52 | option is given, character spacing within each line will be determined
|
---|
53 | by the specified character pitch.
|
---|
54 | .TP
|
---|
55 | .B \-simple
|
---|
56 | Similar to
|
---|
57 | .BR \-layout ,
|
---|
58 | but optimized for simple one-column pages. This mode will do a better
|
---|
59 | job of maintaining horizontal spacing, but it will only work properly
|
---|
60 | with a single column of text.
|
---|
61 | .TP
|
---|
62 | .B \-table
|
---|
63 | Table mode is similar to physical layout mode, but optimized for
|
---|
64 | tabular data, with the goal of keeping rows and columns aligned (at
|
---|
65 | the expense of inserting extra whitespace). If the
|
---|
66 | .B \-fixed
|
---|
67 | option is given, character spacing within each line will be determined
|
---|
68 | by the specified character pitch.
|
---|
69 | .TP
|
---|
70 | .B \-lineprinter
|
---|
71 | Line printer mode uses a strict fixed-character-pitch and -height
|
---|
72 | layout. That is, the page is broken into a grid, and characters are
|
---|
73 | placed into that grid. If the grid spacing is too small for the
|
---|
74 | actual characters, the result is extra whitespace. If the grid
|
---|
75 | spacing is too large, the result is missing whitespace. The grid
|
---|
76 | spacing can be specified using the
|
---|
77 | .B \-fixed
|
---|
78 | and
|
---|
79 | .B \-linespacing
|
---|
80 | options.
|
---|
81 | If one or both are not given on the command line, pdftotext will
|
---|
82 | attempt to compute appropriate value(s).
|
---|
83 | .TP
|
---|
84 | .B \-raw
|
---|
85 | Keep the text in content stream order. Depending on how the PDF file
|
---|
86 | was generated, this may or may not be useful.
|
---|
87 | .TP
|
---|
88 | .BI \-fixed " number"
|
---|
89 | Specify the character pitch (character width), in points, for physical
|
---|
90 | layout, table, or line printer mode. This is ignored in all other
|
---|
91 | modes.
|
---|
92 | .TP
|
---|
93 | .BI \-linespacing " number"
|
---|
94 | Specify the line spacing, in points, for line printer mode. This is
|
---|
95 | ignored in all other modes.
|
---|
96 | .TP
|
---|
97 | .B \-clip
|
---|
98 | Text which is hidden because of clipping is removed before doing
|
---|
99 | layout, and then added back in. This can be helpful for tables where
|
---|
100 | clipped (invisible) text would overlap the next column.
|
---|
101 | .TP
|
---|
102 | .B \-nodiag
|
---|
103 | Diagonal text, i.e., text that is not close to one of the 0, 90, 180,
|
---|
104 | or 270 degree axes, is discarded. This is useful to skip watermarks
|
---|
105 | drawn on top of body text, etc.
|
---|
106 | .TP
|
---|
107 | .BI \-enc " encoding-name"
|
---|
108 | Sets the encoding to use for text output. The
|
---|
109 | .I encoding\-name
|
---|
110 | must be defined with the unicodeMap command (see
|
---|
111 | .BR xpdfrc (5)).
|
---|
112 | The encoding name is case-sensitive. This defaults to "Latin1" (which
|
---|
113 | is a built-in encoding).
|
---|
114 | .RB "[config file: " textEncoding ]
|
---|
115 | .TP
|
---|
116 | .BI \-eol " unix | dos | mac"
|
---|
117 | Sets the end-of-line convention to use for text output.
|
---|
118 | .RB "[config file: " textEOL ]
|
---|
119 | .TP
|
---|
120 | .B \-nopgbrk
|
---|
121 | Don't insert page breaks (form feed characters) between pages.
|
---|
122 | .RB "[config file: " textPageBreaks ]
|
---|
123 | .TP
|
---|
124 | .B \-bom
|
---|
125 | Insert a Unicode byte order marker (BOM) at the start of the text
|
---|
126 | output.
|
---|
127 | .TP
|
---|
128 | .BI \-opw " password"
|
---|
129 | Specify the owner password for the PDF file. Providing this will
|
---|
130 | bypass all security restrictions.
|
---|
131 | .TP
|
---|
132 | .BI \-upw " password"
|
---|
133 | Specify the user password for the PDF file.
|
---|
134 | .TP
|
---|
135 | .B \-q
|
---|
136 | Don't print any messages or errors.
|
---|
137 | .RB "[config file: " errQuiet ]
|
---|
138 | .TP
|
---|
139 | .BI \-cfg " config-file"
|
---|
140 | Read
|
---|
141 | .I config-file
|
---|
142 | in place of ~/.xpdfrc or the system-wide config file.
|
---|
143 | .TP
|
---|
144 | .B \-v
|
---|
145 | Print copyright and version information.
|
---|
146 | .TP
|
---|
147 | .B \-h
|
---|
148 | Print usage information.
|
---|
149 | .RB ( \-help
|
---|
150 | and
|
---|
151 | .B \-\-help
|
---|
152 | are equivalent.)
|
---|
153 | .SH BUGS
|
---|
154 | Some PDF files contain fonts whose encodings have been mangled beyond
|
---|
155 | recognition. There is no way (short of OCR) to extract text from
|
---|
156 | these files.
|
---|
157 | .SH EXIT CODES
|
---|
158 | The Xpdf tools use the following exit codes:
|
---|
159 | .TP
|
---|
160 | 0
|
---|
161 | No error.
|
---|
162 | .TP
|
---|
163 | 1
|
---|
164 | Error opening a PDF file.
|
---|
165 | .TP
|
---|
166 | 2
|
---|
167 | Error opening an output file.
|
---|
168 | .TP
|
---|
169 | 3
|
---|
170 | Error related to PDF permissions.
|
---|
171 | .TP
|
---|
172 | 99
|
---|
173 | Other error.
|
---|
174 | .SH AUTHOR
|
---|
175 | The pdftotext software and documentation are copyright 1996-2017 Glyph
|
---|
176 | & Cog, LLC.
|
---|
177 | .SH "SEE ALSO"
|
---|
178 | .BR xpdf (1),
|
---|
179 | .BR pdftops (1),
|
---|
180 | .BR pdftohtml (1),
|
---|
181 | .BR pdfinfo (1),
|
---|
182 | .BR pdffonts (1),
|
---|
183 | .BR pdfdetach (1),
|
---|
184 | .BR pdftoppm (1),
|
---|
185 | .BR pdftopng (1),
|
---|
186 | .BR pdfimages (1),
|
---|
187 | .BR xpdfrc (5)
|
---|
188 | .br
|
---|
189 | .B http://www.xpdfreader.com/
|
---|