1 | =head1 NAME
|
---|
2 |
|
---|
3 | lwpcook - libwww-perl cookbook
|
---|
4 |
|
---|
5 | =head1 DESCRIPTION
|
---|
6 |
|
---|
7 | This document contain some examples that show typical usage of the
|
---|
8 | libwww-perl library. You should consult the documentation for the
|
---|
9 | individual modules for more detail.
|
---|
10 |
|
---|
11 | All examples should be runnable programs. You can, in most cases, test
|
---|
12 | the code sections by piping the program text directly to perl.
|
---|
13 |
|
---|
14 |
|
---|
15 |
|
---|
16 | =head1 GET
|
---|
17 |
|
---|
18 | It is very easy to use this library to just fetch documents from the
|
---|
19 | net. The LWP::Simple module provides the get() function that return
|
---|
20 | the document specified by its URL argument:
|
---|
21 |
|
---|
22 | use LWP::Simple;
|
---|
23 | $doc = get 'http://www.sn.no/libwww-perl/';
|
---|
24 |
|
---|
25 | or, as a perl one-liner using the getprint() function:
|
---|
26 |
|
---|
27 | perl -MLWP::Simple -e 'getprint "http://www.sn.no/libwww-perl/"'
|
---|
28 |
|
---|
29 | or, how about fetching the latest perl by running this command:
|
---|
30 |
|
---|
31 | perl -MLWP::Simple -e '
|
---|
32 | getstore "ftp://ftp.sunet.se/pub/lang/perl/CPAN/src/latest.tar.gz",
|
---|
33 | "perl.tar.gz"'
|
---|
34 |
|
---|
35 | You will probably first want to find a CPAN site closer to you by
|
---|
36 | running something like the following command:
|
---|
37 |
|
---|
38 | perl -MLWP::Simple -e 'getprint "http://www.perl.com/perl/CPAN/CPAN.html"'
|
---|
39 |
|
---|
40 | Enough of this simple stuff! The LWP object oriented interface gives
|
---|
41 | you more control over the request sent to the server. Using this
|
---|
42 | interface you have full control over headers sent and how you want to
|
---|
43 | handle the response returned.
|
---|
44 |
|
---|
45 | use LWP::UserAgent;
|
---|
46 | $ua = new LWP::UserAgent;
|
---|
47 | $ua->agent("$0/0.1 " . $ua->agent);
|
---|
48 | # $ua->agent("Mozilla/8.0") # pretend we are very capable browser
|
---|
49 |
|
---|
50 | $req = new HTTP::Request 'GET' => 'http://www.sn.no/libwww-perl';
|
---|
51 | $req->header('Accept' => 'text/html');
|
---|
52 |
|
---|
53 | # send request
|
---|
54 | $res = $ua->request($req);
|
---|
55 |
|
---|
56 | # check the outcome
|
---|
57 | if ($res->is_success) {
|
---|
58 | print $res->content;
|
---|
59 | } else {
|
---|
60 | print "Error: " . $res->status_line . "\n";
|
---|
61 | }
|
---|
62 |
|
---|
63 | The lwp-request program (alias GET) that is distributed with the
|
---|
64 | library can also be used to fetch documents from WWW servers.
|
---|
65 |
|
---|
66 |
|
---|
67 |
|
---|
68 | =head1 HEAD
|
---|
69 |
|
---|
70 | If you just want to check if a document is present (i.e. the URL is
|
---|
71 | valid) try to run code that looks like this:
|
---|
72 |
|
---|
73 | use LWP::Simple;
|
---|
74 |
|
---|
75 | if (head($url)) {
|
---|
76 | # ok document exists
|
---|
77 | }
|
---|
78 |
|
---|
79 | The head() function really returns a list of meta-information about
|
---|
80 | the document. The first three values of the list returned are the
|
---|
81 | document type, the size of the document, and the age of the document.
|
---|
82 |
|
---|
83 | More control over the request or access to all header values returned
|
---|
84 | require that you use the object oriented interface described for GET
|
---|
85 | above. Just s/GET/HEAD/g.
|
---|
86 |
|
---|
87 |
|
---|
88 | =head1 POST
|
---|
89 |
|
---|
90 | There is no simple procedural interface for posting data to a WWW server. You
|
---|
91 | must use the object oriented interface for this. The most common POST
|
---|
92 | operation is to access a WWW form application:
|
---|
93 |
|
---|
94 | use LWP::UserAgent;
|
---|
95 | $ua = new LWP::UserAgent;
|
---|
96 |
|
---|
97 | my $req = new HTTP::Request 'POST','http://www.perl.com/cgi-bin/BugGlimpse';
|
---|
98 | $req->content_type('application/x-www-form-urlencoded');
|
---|
99 | $req->content('match=www&errors=0');
|
---|
100 |
|
---|
101 | my $res = $ua->request($req);
|
---|
102 | print $res->as_string;
|
---|
103 |
|
---|
104 | Lazy people use the HTTP::Request::Common module to set up a suitable
|
---|
105 | POST request message (it handles all the escaping issues) and has a
|
---|
106 | suitable default for the content_type:
|
---|
107 |
|
---|
108 | use HTTP::Request::Common qw(POST);
|
---|
109 | use LWP::UserAgent;
|
---|
110 | $ua = new LWP::UserAgent;
|
---|
111 |
|
---|
112 | my $req = POST 'http://www.perl.com/cgi-bin/BugGlimpse',
|
---|
113 | [ search => 'www', errors => 0 ];
|
---|
114 |
|
---|
115 | print $ua->request($req)->as_string;
|
---|
116 |
|
---|
117 | The lwp-request program (alias POST) that is distributed with the
|
---|
118 | library can also be used for posting data.
|
---|
119 |
|
---|
120 |
|
---|
121 |
|
---|
122 | =head1 PROXIES
|
---|
123 |
|
---|
124 | Some sites use proxies to go through fire wall machines, or just as
|
---|
125 | cache in order to improve performance. Proxies can also be used for
|
---|
126 | accessing resources through protocols not supported directly (or
|
---|
127 | supported badly :-) by the libwww-perl library.
|
---|
128 |
|
---|
129 | You should initialize your proxy setting before you start sending
|
---|
130 | requests:
|
---|
131 |
|
---|
132 | use LWP::UserAgent;
|
---|
133 | $ua = new LWP::UserAgent;
|
---|
134 | $ua->env_proxy; # initialize from environment variables
|
---|
135 | # or
|
---|
136 | $ua->proxy(ftp => 'http://proxy.myorg.com');
|
---|
137 | $ua->proxy(wais => 'http://proxy.myorg.com');
|
---|
138 | $ua->no_proxy(qw(no se fi));
|
---|
139 |
|
---|
140 | my $req = new HTTP::Request 'wais://xxx.com/';
|
---|
141 | print $ua->request($req)->as_string;
|
---|
142 |
|
---|
143 | The LWP::Simple interface will call env_proxy() for you automatically.
|
---|
144 | Applications that use the $ua->env_proxy() method will normally not
|
---|
145 | use the $ua->proxy() and $ua->no_proxy() methods.
|
---|
146 |
|
---|
147 | Some proxies also require that you send it a username/password in
|
---|
148 | order to let requests through. You should be able to add the
|
---|
149 | required header, with something like this:
|
---|
150 |
|
---|
151 | use LWP::UserAgent;
|
---|
152 |
|
---|
153 | $ua = new LWP::UserAgent;
|
---|
154 | $ua->proxy(['http', 'ftp'] => 'http://proxy.myorg.com');
|
---|
155 |
|
---|
156 | $req = new HTTP::Request 'GET',"http://www.perl.com";
|
---|
157 | $req->proxy_authorization_basic("proxy_user", "proxy_password");
|
---|
158 |
|
---|
159 | $res = $ua->request($req);
|
---|
160 | print $res->content if $res->is_success;
|
---|
161 |
|
---|
162 | Replace C<proxy.myorg.com>, C<proxy_user> and
|
---|
163 | C<proxy_password> with something suitable for your site.
|
---|
164 |
|
---|
165 |
|
---|
166 | =head1 ACCESS TO PROTECTED DOCUMENTS
|
---|
167 |
|
---|
168 | Documents protected by basic authorization can easily be accessed
|
---|
169 | like this:
|
---|
170 |
|
---|
171 | use LWP::UserAgent;
|
---|
172 | $ua = new LWP::UserAgent;
|
---|
173 | $req = new HTTP::Request GET => 'http://www.sn.no/secret/';
|
---|
174 | $req->authorization_basic('aas', 'mypassword');
|
---|
175 | print $ua->request($req)->as_string;
|
---|
176 |
|
---|
177 | The other alternative is to provide a subclass of I<LWP::UserAgent> that
|
---|
178 | overrides the get_basic_credentials() method. Study the I<lwp-request>
|
---|
179 | program for an example of this.
|
---|
180 |
|
---|
181 |
|
---|
182 | =head1 MIRRORING
|
---|
183 |
|
---|
184 | If you want to mirror documents from a WWW server, then try to run
|
---|
185 | code similar to this at regular intervals:
|
---|
186 |
|
---|
187 | use LWP::Simple;
|
---|
188 |
|
---|
189 | %mirrors = (
|
---|
190 | 'http://www.sn.no/' => 'sn.html',
|
---|
191 | 'http://www.perl.com/' => 'perl.html',
|
---|
192 | 'http://www.sn.no/libwww-perl/' => 'lwp.html',
|
---|
193 | 'gopher://gopher.sn.no/' => 'gopher.html',
|
---|
194 | );
|
---|
195 |
|
---|
196 | while (($url, $localfile) = each(%mirrors)) {
|
---|
197 | mirror($url, $localfile);
|
---|
198 | }
|
---|
199 |
|
---|
200 | Or, as a perl one-liner:
|
---|
201 |
|
---|
202 | perl -MLWP::Simple -e 'mirror("http://www.perl.com/", "perl.html")';
|
---|
203 |
|
---|
204 | The document will not be transfered unless it has been updated.
|
---|
205 |
|
---|
206 |
|
---|
207 |
|
---|
208 | =head1 LARGE DOCUMENTS
|
---|
209 |
|
---|
210 | If the document you want to fetch is too large to be kept in memory,
|
---|
211 | then you have two alternatives. You can instruct the library to write
|
---|
212 | the document content to a file (second $ua->request() argument is a file
|
---|
213 | name):
|
---|
214 |
|
---|
215 | use LWP::UserAgent;
|
---|
216 | $ua = new LWP::UserAgent;
|
---|
217 |
|
---|
218 | my $req = new HTTP::Request 'GET',
|
---|
219 | 'http://www.sn.no/~aas/perl/www/libwww-perl-5.00.tar.gz';
|
---|
220 | $res = $ua->request($req, "libwww-perl.tar.gz");
|
---|
221 | if ($res->is_success) {
|
---|
222 | print "ok\n";
|
---|
223 | }
|
---|
224 |
|
---|
225 | Or you can process the document as it arrives (second $ua->request()
|
---|
226 | argument is a code reference):
|
---|
227 |
|
---|
228 | use LWP::UserAgent;
|
---|
229 | $ua = new LWP::UserAgent;
|
---|
230 | $URL = 'ftp://ftp.unit.no/pub/rfc/rfc-index.txt';
|
---|
231 |
|
---|
232 | my $expected_length;
|
---|
233 | my $bytes_received = 0;
|
---|
234 | $ua->request(HTTP::Request->new('GET', $URL),
|
---|
235 | sub {
|
---|
236 | my($chunk, $res) = @_;
|
---|
237 | $bytes_received += length($chunk);
|
---|
238 | unless (defined $expected_length) {
|
---|
239 | $expected_length = $res->content_length || 0;
|
---|
240 | }
|
---|
241 | if ($expected_length) {
|
---|
242 | printf STDERR "%d%% - ",
|
---|
243 | 100 * $bytes_received / $expected_length;
|
---|
244 | }
|
---|
245 | print STDERR "$bytes_received bytes received\n";
|
---|
246 |
|
---|
247 | # XXX Should really do something with the chunk itself
|
---|
248 | # print $chunk;
|
---|
249 | });
|
---|
250 |
|
---|
251 |
|
---|
252 |
|
---|
253 | =head1 HTML FORMATTING
|
---|
254 |
|
---|
255 | It is easy to convert HTML code to "readable" text.
|
---|
256 |
|
---|
257 | use LWP::Simple;
|
---|
258 | use HTML::Parse;
|
---|
259 | print parse_html(get 'http://www.sn.no/libwww-perl/')->format;
|
---|
260 |
|
---|
261 |
|
---|
262 |
|
---|
263 | =head1 PARSE URLS
|
---|
264 |
|
---|
265 | To access individual elements of a URL, try this:
|
---|
266 |
|
---|
267 | use URI::URL;
|
---|
268 | $host = url("http://www.sn.no/")->host;
|
---|
269 |
|
---|
270 | or
|
---|
271 |
|
---|
272 | use URI::URL;
|
---|
273 | $u = url("ftp://ftp.sn.no/test/aas;type=i");
|
---|
274 | print "Protocol scheme is ", $u->scheme, "\n";
|
---|
275 | print "Host is ", $u->host, " at port ", $u->port, "\n";
|
---|
276 |
|
---|
277 | or even
|
---|
278 |
|
---|
279 | use URI::URL;
|
---|
280 | my($host,$port) = (url("ftp://ftp.sn.no/test/aas;type=i")->crack)[3,4];
|
---|
281 |
|
---|
282 |
|
---|
283 | =head1 EXPAND RELATIVE URLS
|
---|
284 |
|
---|
285 | This code reads URLs and print expanded version.
|
---|
286 |
|
---|
287 | use URI::URL;
|
---|
288 | $BASE = "http://www.sn.no/some/place?query";
|
---|
289 | while (<>) {
|
---|
290 | print url($_, $BASE)->abs->as_string, "\n";
|
---|
291 | }
|
---|
292 |
|
---|
293 | We can expand URLs in an HTML document by using the parser to build a
|
---|
294 | tree that we then traverse:
|
---|
295 |
|
---|
296 | %link_elements =
|
---|
297 | (
|
---|
298 | 'a' => 'href',
|
---|
299 | 'img' => 'src',
|
---|
300 | 'form' => 'action',
|
---|
301 | 'link' => 'href',
|
---|
302 | );
|
---|
303 |
|
---|
304 | use HTML::Parse;
|
---|
305 | use URI::URL;
|
---|
306 |
|
---|
307 | $BASE = "http://somewhere/root/";
|
---|
308 | $h = parse_htmlfile("xxx.html");
|
---|
309 | $h->traverse(\&expand_urls, 1);
|
---|
310 |
|
---|
311 | print $h->as_HTML;
|
---|
312 |
|
---|
313 | sub expand_urls
|
---|
314 | {
|
---|
315 | my($e, $start) = @_;
|
---|
316 | return 1 unless $start;
|
---|
317 | my $attr = $link_elements{$e->tag};
|
---|
318 | return 1 unless defined $attr;
|
---|
319 | my $url = $e->attr($attr);
|
---|
320 | return 1 unless defined $url;
|
---|
321 | $e->attr($attr, url($url, $BASE)->abs->as_string);
|
---|
322 | }
|
---|
323 |
|
---|
324 |
|
---|
325 |
|
---|
326 | =head1 BASE URL
|
---|
327 |
|
---|
328 | If you want to resolve relative links in a page you will have to
|
---|
329 | determine which base URL to use. The HTTP::Response objects now has a
|
---|
330 | base() method.
|
---|
331 |
|
---|
332 | $BASE = $res->base;
|
---|
333 |
|
---|
334 |
|
---|
335 |
|
---|
336 | =head1 COPYRIGHT
|
---|
337 |
|
---|
338 | Copyright 1996-1997, Gisle Aas
|
---|
339 |
|
---|
340 | This library is free software; you can redistribute it and/or
|
---|
341 | modify it under the same terms as Perl itself.
|
---|
342 |
|
---|
343 |
|
---|