Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Blame
Revision Log

source: gsdl/tags/gsdl-2_71-distribution/gsdl/packages/w3mir/w3mir-1.0.8/w3mir.PL@ 14121

Last change on this file since 14121 was 719, checked in by davidb, 25 years ago
added w3mir package
Property svn:keywords set to `Author Date Id Revision`
File size: 98.1 KB

Line
1	# --perl--
2
3	use Config;
4
5	&read_makefile;
6	$fullperl = resolve_make_var('FULLPERL') \|\| $Config{'perlpath'};
7	$islib = resolve_make_var('INSTALLSITELIB');
8
9	$name = $0;
10	$name =~ s~^.*/~~;
11	$name =~ s~.PL$~~;
12
13	open(OUT,"> $name") \|\|
14	die "Could open $name for writing: $!\n";
15
16	print "writing $name\n";
17
18	while (<DATA>) {
19	if (m~^\#!/./perl.$~o) {
20	# This substitutes the path perl was installed at on this system
21	# _and_ removed any (-w) options.
22	print OUT "#!",$fullperl,$1,"\n";
23	next;
24	}
25	if (/^use lib/o) {
26	# This substitutes the actuall library install path
27	print OUT "use lib '$islib';\n";
28	next;
29	}
30	print OUT;
31	}
32
33	close(OUT);
34
35	# Make it executable too, and writeable
36	chmod 0755, $name;
37
38	#### The library
39
40	sub resolve_make_var ($) {
41
42	my($var) = shift @_;
43	my($val) = $make{$var};
44
45	# print "Resolving: ",$var,"=",$val,"\n";
46
47	while ($val =~ s~\$\((\S+)\)~$make{$1}~g) {}
48	# print "Resolved: $var: $make{$var} -> $val\n";
49	$val;
50	}
51
52
53	sub read_makefile {
54
55	open(MAKEFILE, 'Makefile') \|\|
56	die "Could not open Makefile for reading: $!\n";
57
58	while (<MAKEFILE>) {
59	chomp;
60	next unless m/^([A-Z]+)\s=\s(\S+)$/;
61	$make{$1}=$2;
62	# print "Makevar: $1 = $2\n";
63	}
64
65	close(MAKEFILE)
66	}
67
68	__END__
69	#!/local/bin/perl5 -w
70	# Perl 5.002 or later. w3mir is mostly tested with perl 5.004
71	#
72	# You might want to change or comment out this:
73	use lib '/hom/janl/lib/perl';
74	#
75	# Once upon a long time ago this was Oscar Nierstrasz's
76	# <[email protected]> htget script.
77	#
78	# Retrieves HTML pages, creating local copies in the _current_
79	# directory. The script will check for the last-modified stamp on the
80	# document, and will not fetch it if the document isn't changed.
81	#
82	# Bug list is in w3mir-README.
83	#
84	# Test cases for janl to use:
85	# w3mir -r -fs http://www.eff.org/ - infinite recursion!
86	# --- but cursory examination seems to indicate confused server...
87	# http://java.sun.com/progGuide/index.html check out the img things.
88	#
89	# Copyright Holders:
90	# Nicolai Langfeldt, [email protected]
91	# Gorm Haug Eriksen, [email protected]
92	# Chris Szurgot, [email protected]
93	# Ed Jordan, [email protected]
94	# Alex Knowles, [email protected] aka ark.
95	# Copying and modification is governed by the "Artistic License" enclosed in
96	# the w3mir distribution
97	#
98	# History (European format date: dd/mm/yy):
99	# oscar 25/03/94 -- added -s option to send output to stdout
100	# oscar 28/03/94 -- made HTTP 1.0 the default
101	# oscar 30/05/94 -- special handling of directory URLs missing a trailing "/"
102	# gorm 20/02/95 -- added mirror capacity + fixed a couple of bugs
103	# janl 28/03/95 -- added a working commandline parser.
104	# janl 18/09/95 -- Changed to use a net http library. Removed dependency of
105	# url.pl.
106	# janl 19/09/95 -- Extensive rewrite. Simplified a lot, works better.
107	# HTML files are now saved in a new and improved manner,
108	# which means they can be recognized as such w/o fancy
109	# filename extention type rules.
110	# szurgot 27/01/96-- Added "Plaintextmode" wrapper to binmode PAGE.
111	# binmode page is required under Win32, but broke modified
112	# checking
113	# -- Minor change added ; to "# '" strings for Emacs cperl-mode
114	# szurgot 07/02/96-- When reading in local file for checking of URLs changed
115	# local ($/) =0; to equal undef;
116	# janl 08/02/96 -- Added szurgot's changes and changed them :-)
117	# szurgot 09/02/96-- Added code to strip /#.*$/ from urls when reading from
118	# local file
119	# -- Added hasAlarm variable to w3http.pl. Set to 1 if you have
120	# alarm(). 0 otherwise.
121	# -- Moved code setting up the valid extensions list into the
122	# args processing where it belonged
123	# janl 20/02/96 -- Added szurgot changes again.
124	# -- Make timeout code work.
125	# -- and made another win32 test.
126	# janl 19/03/96 -- Worked through the code for handling not-modified
127	# documents, it was a bit shabby after htmlop was intro'ed.
128	# janl 20/03/96 -- -l fix
129	# janl 23/04/96 -- Added -fs by request (by Rik Faith)
130	# janl 16/05/96 -- Made -R mandatory, added use and support for
131	# w3http::SAVEBIN
132	# szurgot 19/05/96-- Win95 adaptions.
133	# janl 19/05/96 -- -C did not exactly work as expected. Thanks to Petr
134	# Novak for bug descriptions.
135	# janl 19/05/96 -- Changed logic for @didntget, @got and so on to use
136	# @queue and %urlstat.
137	# janl 09/09/96 -- Removed -R switch.
138	# janl 14/09/96 -- Added ir (initial referer) switch
139	# janl 21/09/96 -- Made retry code saner. There probably needs to be a
140	# sleep before retry comencing switch. When no tty is
141	# present it should be fairly long.
142	# gorm 15/09/96 -- Added cr (check robot) switch. Default to 1 (on)
143	# janl 22/09/96 -- Modified gorms patch to use WWW::RobotRules. Changed
144	# robot switch to be consistent with current w3mir
145	# practice.
146	# janl 27/09/96 -- Spelling corrections from [email protected]
147	# -- Folded in manual diffs from ark.
148	# ark 24/09/96 -- Simple facilities to edit the incomming file(s)
149	# janl 27/09/96 -- Added switch to enable <!--NOMIRROR--> editing and
150	# foolproofed ark's patch a bit.
151	# janl 02/10/96 -- Added -umask switch.
152	# -- Redirected documents did not have a meaningful referer
153	# value (it was undefined).
154	# -- Got w3mir into strict discipline, found some typos...
155	# janl 20/10/96 -- Mtime is preserved
156	# janl 21/10/96 -- -lc switch added. Mtime preservation works better.
157	# janl 06/11/96 -- Treat 301 like 302.
158	# janl 02/12/96 -- Added config file code, fetch/ignore rules, apply
159	# janl 04/12/96 -- Better checking of config input.
160	# janl 06/12/96 -- Putting together the URL selection/editing brains.
161	# janl 07/12/96 -- Checking out some bugs. Adding multiscope options.
162	# janl 12/12/96 -- Adding to and defeaturing the multiscope options.
163	# janl 13/12/96 -- Continuing work in multiscope stuff
164	# -- Unreferenced file and empty directory removal works.
165	# janl 19/02/97 -- Can extract urls from adobe acrobat pdf files :-)
166	# Important: It does _not_ edit urls, so they still
167	# point at the original site(s).
168	# janl 21/02/97 -- Fix -lc bug related to case and the apply things.
169	# -- only use SAVEURL if needed
170	# janl 11/03/97 -- Finish work on SAVEURL conditional.
171	# -- Fixed directory removal code.
172	# -- parse_args did not abort when unknown option/argument
173	# was specified.
174	# janl 12/03/97 -- Made test case for -lc. Didn't work. Fixed it. I think.
175	# Realized we have bug w.r.t. hostname caseing.
176	# janl 13/03/97 -- All redirected to URLs within scope are now queued.
177	# That should make the mirror more complete, but it won't
178	# help browsability when it comes to the redirected doc.
179	# -- Moved robot retrival to the inside of the mirror loop
180	# since we now possebly mirror several sites.
181	# -- Changed 'fetch-options' to 'options'.
182	# -- Added 'proxy-options'/-pflush to controll proxy server(s).
183	# janl 09/04/97 -- Started using URI::URL.
184	# janl 11/04/97 -- Debugging and using URI::URL more correctly various places
185	# janl 09/05/97 -- Added --agent switch
186	# janl 12/05/97 -- Simplified scope checks for root URL, changed URL 'apply'
187	# processing.
188	# -- Small output formating fix in the robot rules code.
189	# -- Version is now 0.99
190	# janl 14/05/97 -- htmpop no-longer puts '<!DOCTYPE...' into doc, so check
191	# for '<HTML' instead
192	# janl 11/06/97 -- Made :port optional in server part of auth-domain.
193	# Always removing :80 from server part to match netloc.
194	# janl 22/07/97 -- More debugging of rewrite for new features -B, -I.
195	# janl 01/08/97 -- Fixed bug in RE quoting for Ignore/Fetch
196	# janl 04/08/97 -- s/writepage/write_page/g
197	# janl 07/09/97 -- 0.99b1 is released
198	# janl 19/09/97 -- Kaj Hejer discovers omissions in non-html-url-mining code.
199	# -- 0.99b2 is released
200	# janl 24/09/97 -- Matt Chapman found bug in realm-name extraction.
201	# janl 10/10/97 -- Referer: header supression supressed User: header instead
202	# -- Added fixup handling, writes .redirs and .referers
203	# (no dot in win32)
204	# -- Read .w3mirc (w3mir.ini on win32) if present
205	# -- Stop file removal code from removing these files
206	# janl 16/10/97 -- process_tag was mangling url attributes in tags with more
207	# than one of them. Problem found by Robert L. Binkley
208	# janl 04/12/97 -- Fixed problem with authentication, misplaced +
209	# -- default inter-docuent pause is 0. I figure it's better
210	# to keep one httpd occupied in a steady stream than to
211	# wait for it to die before we talk to it again.
212	# janl 13/12/97 -- The arguments to index.html in the form of index.html/foo
213	# handling code was incomplete. To make it complete would
214	# have been hard, so it was removed.
215	# -- If a URL changes from file to directory or vice versa
216	# this is now handled.
217	# janl 11/01/98 -- PDF files with no URLs does not cause warnings now.
218	# -- Close REFERERS and REDIRECTS before calling w3mfix
219	# janl 22/01/98 -- Proxy authentication as outlined by Christian Geuer
220	# janl 04/02/98 -- Version 1pre1
221	# janl 18/02/98 -- Fixed wild_re after tip by Prentiss Riddle.
222	# -- Version 1pre2
223	# janl 20/02/98 -- w3http updated to handle complex content-types.
224	# -- Fix wild_re more, bug noted by James Dumser
225	# -- 1.0pre3
226	# janl 18/03/98 -- Version 1.0 is released
227	# janl 09/04/98 -- Added feature so user can disable newline conversion.
228	# janl 20/04/98 -- Only convert newlines in HTML files. -> 1.0.2
229	# janl 09/05/98 -- More carefull clean_disk code.
230	# -- Check if the redirected URL was a root url, if so
231	# issue a warning and exit.
232	# janl 12/05/98 -- use ->unix_path instead of ->as_string to derive local
233	# filename.
234	# janl 25/05/98 -- -B didn't work too well.
235	# janl 09/07/98 -- Redirect to fragment broke us, less broken now -> 1.0.4
236	# janl 24/09/98 -- Better errormessages on errors -> 1.0.5
237	# janl 21/11/98 -- Fix errormessages better.
238	# janl 05/01/99 -- Drop 'Referer: (commandline)'
239	# janl 13/04/99 -- Add initial referer to root urls in batch mode.
240	#
241	# Variable name discipline:
242	# - remote, umodified URL. Variables prefixed 'rum_'
243	# - local, filesystem. Variables prefixed 'lf_'.
244	# Use these prefixes so we know what we're working with at all times.
245	# Also, URL objects are postfixed _o
246	#
247	# The apply rules and scope rules work this way:
248	# - First apply the user rules to the remote url.
249	# - Check if document is within scope after this.
250	# - Then apply w3mir's rules to the result. This results is the local,
251	# filesystem, name.
252	#
253	# We use features introduced in 5.002.
254	require 5.002;
255
256	# win32 and $nulldevice need to be globals, other modules use them.
257	use vars qw($win32 $nulldevice);
258
259	# To figure out what kind of system this is
260	BEGIN {
261	use Config;
262	$win32 = ( $Config{'osname'} eq 'MSWin32' );
263	}
264	# More ways to die:
265	use Carp;
266	# Http module:
267	use w3http;
268	# html url extraction and manupulation:
269	use htmlop;
270	# Extract urls from adobe acrobat pdf files:
271	use w3pdfuri;
272	# Date computer:
273	use HTTP::Date;
274	# URLs:
275	use URI::URL;
276	# For flush method
277	use FileHandle;
278
279	# Full discipline:
280	use strict;
281
282	# Set params in the http package, HTTP protocol version:
283	$w3http::version="1.0";
284
285	# The defaults should be for a robotic http agent on good behaviour.
286	my $debug=0; # Debug level
287	my $verbose=0; # Verbosity level, -1 = quiet, 0 = normal, 1...
288	my $pause=0; # Pause between http requests
289	my $retryPause=600; # Pause between retries. 10 minutes.
290	my $retry=3; # Max 3 stabs pr. url.
291	my $r=0; # Recurse? no recursion = absolutify links
292	my $remove=0; # Remove files that are not there?
293	my $s=0; # 0: save on disk 1: stdout 2: just forget 'em
294	my $useauth=0; # Use authorization
295	my %authdata; # Authorization data
296	my $check_robottxt = 1; # Check robots.txt
297	my $do_referer = 1; # Send referers header
298	my $do_user = 1; # Send user header
299	my $cache_header = ''; # The cache-control/pragma: no-cache header
300	my $using_proxy = 0; # Using proxy server or not?
301	my $batch=0; # Batch get URLs?
302	my $read_urls=0; # Get urls from STDIN?
303	my $abs=0; # Absolutify URLs?
304	my $immediate_redir=0; # Immediately follow a redirect?
305	my @root_urls; # This is where we start, the root documents
306	my @root_dirs; # The corresponding directories. for remove
307	my $chdirto=''; # Place to chdir to after reading config file
308	my %nodelete=(); # Files that should not be deleted
309	my $numarg=0; # Number of arguments accepted.
310
311	# Fixup related things
312	my $fixrc=''; # Name of w3mfix config file
313	my $fixup=1; # Do things needed to run fixup
314	my $runfix=0; # Run w3mfix for user?
315	my $fixopen=0; # Fixup files open?
316
317	my $indexname='index.html';
318
319	my $VERSION;
320	$VERSION='1.0.8';
321	$w3http::agent = my $w3mir_agent = "w3mir/$VERSION-1999-05-28";
322	my $iref=''; # Initial referer. Must evaluate to false
323
324	# Derived settings
325	my $mine_urls=0; # Mine URLs from documents?
326	my $process_urls=0; # Perform (URL) processing of documents?
327
328	# Queue of urls to get.
329	my @rum_queue = ();
330	my @urls = ();
331	# URL status map.
332	my %rum_urlstat = ();
333	# Status codes:
334	my $QUEUED = 0; # Queued but not gotten yet.
335	my $TERROR = 100; # Transient error, retry later
336	my $HLERR = 101; # Permanent error, give up
337	my $GOTIT = 200; # Gotten. Note similarity to http result code
338	my $NOTMOD = 304; # Not modified.
339	# Negative codes for nonexistent files, easier to check.
340	my $NEVERMIND= -1; # Don't want it
341	my $REDIR = -302; # Does not exist, redirected
342	my $ENOTFND = -404; # Does not exist.
343	my $OTHERERR = -600; # Some other error happened
344	my $FROBOTS = -601; # Forbidden by robots.txt rule
345
346	# Directory/files survey:
347	my %lf_file; # What files are present in FS? Disposition? One of:
348	my $FILEDEL=0; # Delete file
349	my $FILEHERE=1; # File present in filesystem only
350	my $FILETHERE=2; # File present on server too.
351	my %lf_dir; # Number of files/dirs in dir. If 0 dir is
352	# eligible for deletion.
353
354	my %fiddled=(); # If a file becomes a directory or a directory
355	# becomes a file it is considered fiddled and
356	# w3mir will not fiddle with it again in this
357	# run.
358
359	# Bitbucket device, very OS dependent.
360	$nulldevice='/dev/null';
361	$nulldevice='nul:' if ($win32);
362
363	# What to get, and not.
364	# Text of user supplied fetch/ignore rules
365	my $rule_text=" # User defined fetch/ignore rules\n";
366	# Code ref to the rule procedure
367	my $rule_code;
368
369	# Code to prefix and postfix the generated code. Prefix should make
370	# $_ contain the url to match. Postfix should return 1, the default
371	# is to get the url/file.
372	my $rule_prefix='$rule_code = sub { local($_) = shift;'."\n";
373	my $rule_postfix=" return 1;\n}";
374
375	# Scope tests generated by URL/Also directives in cfg. The scope code
376	# is just like the rule code, but used for program generated
377	# fetch/ignore rules related to multiscope retrival.
378	my $scope_fetch=" # Automatic fetch rules for multiscope retrival\n";
379	my $scope_ignore=" # Automatic ignore rules for multiscope retrival\n";
380	my $scope_code;
381
382	my $scope_prefix='$scope_code = sub { local($_) = shift;'."\n";
383	my $scope_postfix=" return 0;\n}";
384
385	# Function to apply to urls, se rule comments.
386	my $user_apply_code; # User specified apply code
387	my $apply_code; # w3mirs apply code
388	my $apply_prefix='$apply_code = sub { local($_) = @_;'."\n";
389	my $apply_lc=' $_ = lc $_; ';
390	my $apply_postfix=' return $_;'."\n}";
391	my @user_apply; # List of users apply rules.
392	my @internal_apply; # List of w3mirs apply rules.
393
394	my $infoloss=0; # 1 if any URL translations (which cause
395	# information loss) are in effect. If this is
396	# true we use the SAVEURL operation.
397	my $list; # List url on STDOUT?
398	my $edit; # Edit doc? Remove <!--NOMIRROR>...<!--/NOMIRROR-->
399	my $header; # Text to insert in header
400	my $lc=0; # Convert urls/filenames to lowercase?
401	my $fetch=0; # What to fetch: -1: Some, 0: not modified 1: all
402	my $convertnl=1; # Convert newlines?
403
404	# Non text/html formats we can extract urls from. Function must take one
405	# argument: the filename.
406	my %knownformats = ( 'application/pdf', \&w3pdfuri::list,
407	'application/x-pdf', \&w3pdfuri::list,
408	);
409
410	# Known 'magic numbers' of the known formats. The value is used as
411	# key in %knownformats. the key part is a exact match for the
412	# following <string> beginning at the first byte of the file.
413	# This should probably be made more flexible, but not until we need it.
414
415	my %knownmagic = ( '%PDF-', 'application/pdf' );
416
417	my $iinline=''; # inline RE code to make RE caseinsensitive
418	my $ipost=''; # RE postfix to make it caseinsensitive
419
420	usage() unless parse_args(@ARGV);
421
422	{
423	my $w3mirc='.w3mirc';
424
425	$w3mirc='w3mir.ini' if $win32;
426
427	if (-f $w3mirc) {
428	parse_cfg_file($w3mirc);
429	$nodelete{$w3mirc}=1;
430	}
431	}
432
433	# Check arguments and options
434	if ($#root_urls>=0) {
435	# OK
436	} else {
437	print "URLs: $#rum_queue\n";
438	usage("No URLs given");
439	}
440
441	# Are we converting newlines today?
442	$w3http::convert=0 unless $convertnl;
443
444	if ($chdirto) {
445	&mkdir($chdirto.'/this-is-not-created-odd-or-what');
446	chdir($chdirto) \|\|
447	die "w3mir: Can't change working directory to '$chdirto': $!\n";
448	}
449
450	$SIG{'INT'}=sub { print STDERR "\nCaught SIGINT!\n"; exit 1; };
451	$SIG{'QUIT'}=sub { print STDERR "\nCaught SIGQUIT!\n"; exit 1; };
452	$SIG{'HUP'}=sub { print STDERR "\nCaught SIGHUP!\n"; exit 1; };
453
454	&open_fixup if $fixup;
455
456	# Derive how much document processing we should do.
457	$mine_urls=( $r \|\| $list );
458	$process_urls=(!$batch && !$edit && !$header);
459	# $abs can be set explicitly with -abs, and implicitly if not recursing
460	$abs = 1 unless $r;
461	print "Absolute references\n" if $abs && $debug;
462
463	# Cache_controll specified but proxy not in use?
464	die "w3mir: If you want to control a cache, use a proxy server!\n"
465	if ($cache_header && !$using_proxy);
466
467	# Compile the second order code
468
469	# - The rum scope tests
470	my $full_rules=$scope_prefix.$scope_fetch.$scope_ignore.$scope_postfix;
471	# warn "Scope rules:\n-------------\n$full_rules\n---------------\n";
472	eval $full_rules;
473
474	die "w3mir: Program generated rules did not compile.\nPlease report to w3mir-core\@usit.uio.no. The code is:\n----\n".
475	$full_rules."\n----\n"
476	if !defined($scope_code);
477
478	$full_rules=$rule_prefix.$rule_text.$rule_postfix;
479	# warn "Fetch rules:\n-------------\n$full_rules\n---------------\n";
480	eval $full_rules;
481
482	# - The user specified rum tests
483	die "w3mir: Ignore/Fetch rules did not compile.\nPlease report to w3mir-core\@usit.uio.no. The code is:\n----\n".
484	$full_rules."\n----\n"
485	if !defined($rule_code);
486
487	# - The user specified apply rules
488
489	my $full_apply=$apply_prefix.($lc?$apply_lc:'').
490	join($ipost.";\n",@user_apply).(($#user_apply>=0)?$ipost:"").";\n".
491	$apply_postfix;
492	eval $full_apply;
493
494	die "w3mir: User apply rules did not compile.\nPlease report to w3mir-core\@usit.uio.no. The code is:
495	----
496	".$full_apply."
497	----\n" if !defined($apply_code);
498
499	$user_apply_code=$apply_code;
500
501	# - The w3mir generated apply rules
502
503	$full_apply=$apply_prefix.($lc?$apply_lc:'').
504	join($ipost.";\n",@internal_apply).(($#internal_apply>=0)?$ipost:"").";\n".
505	$apply_postfix;
506	eval $full_apply;
507
508	die "Internal apply rules did not compile. The code is:
509	----
510	".$full_apply."
511	----\n" if !defined($apply_code);
512
513	# - Information loss via -lc? There are other sources as well.
514	$infoloss=1 if $lc;
515
516	warn "Infoloss is $infoloss\n" if $debug;
517
518	# More setup:
519
520	$w3http::debug=$debug;
521
522	$w3http::verbose=$verbose;
523
524	my %rum_referers=(); # Array of referers, key: rum_url
525	my $Robot_Blob; # WWW::RobotsRules object, decides if rum_url is
526	# forbidden to access for us.
527	my $rum_url_o; # rum url, mostly the current, the one we're getting
528	my %gotrobots; # Did I get robots.txt from site? key: url->netloc
529	my($authuser,$authpass);# Username and password for authentication with server
530	my @rum_newurls; # List of rum_urls in document
531
532	if ($check_robottxt) {
533	# Eval is only way to defer loading of module until we know it's needed?
534	eval 'use WWW::RobotRules;';
535
536	die "Could not load WWW::RobotRules, try -drr switch\n"
537	unless defined(&WWW::RobotRules::parse);
538
539	$Robot_Blob = new WWW::RobotRules $w3mir_agent;
540	}
541
542	# We have several main-modes of operation. Here we select one
543	if ($r) {
544
545	die "w3mir: No URLs? Try 'w3mir -h' for help.\n"
546	if $#root_urls==-1;
547
548	warn "Recursive retrival comencing\n" if $debug;
549
550	die "w3mir: Sorry, you cannot combine -r/recurse with -I/read_urls\n"
551	if $read_urls;
552
553	# Recursive
554	my $url;
555	foreach $url (@root_urls) {
556	warn "Root url dequeued: $url\n" if $debug;
557	if (want_this($url)) {
558	queue($url);
559	&add_referer($url,$iref);
560	} else {
561	die "w3mir: Inconsistent configuration: Specified $url is not inside retrival scope\n";
562	}
563	}
564	mirror();
565
566	} else {
567	if ($batch) {
568	warn "Batch retrival commencing\n" if $debug;
569	# Batch get
570	if ($read_urls) {
571	# Get URLs from <STDIN>
572	while (<STDIN>) {
573	chomp;
574	&add_referer($_,$iref);
575	batch_get($_);
576	}
577	} else {
578	# Get URLs from commandline
579	my $url;
580	foreach $url (@root_urls) {
581	&add_referer($url,$iref);
582	}
583	foreach $url (@root_urls) {
584	batch_get($url);
585	}
586	}
587	} else {
588	warn "Single url retrival commencing\n" if $debug;
589
590	# A single URL, with all processing on
591	die "w3mir: You specified several URLs and not -B/batch\n"
592	if $#root_urls>0;
593	queue($root_urls[0]);
594	&add_referer($root_urls[0],$iref);
595	mirror();
596	}
597	}
598
599	&close_fixup if $fixup;
600
601	# This should clean up files:
602	&clean_disk if $remove;
603
604	warn "w3mir: That's all (".$w3http::xfbytes.'+',$w3http::headbytes.
605	" bytes of it).\n" unless $verbose<0;
606
607	if ($runfix) {
608	eval 'use Config;';
609	warn "Running w3mfix\n";
610	if ($win32) {
611	system($Config{'perlpath'}." w3mfix $fixrc");
612	} else {
613	system("w3mfix $fixrc");
614	}
615	}
616
617	exit 0;
618
619	sub get_document {
620	# Get one document by HTTP ($1/rum_url_o). Save in given filename ($2).
621	# Possebly returning references found in the document. Caller must
622	# set up referer array, check wantedness and everything else. We
623	# handle authentication here though.
624
625	my($rum_url_o)=shift;
626	my($lf_url)=shift;
627	croak("\$rum_url_o is empty") if !defined($rum_url_o) \|\| !$rum_url_o;
628	croak("$lf_url is empty") if !defined($lf_url) \|\| !$lf_url;
629
630	# Make sure it's an object
631	$rum_url_o = url $rum_url_o
632	unless ref $rum_url_o;
633
634	# Derive a filename from the url, the filename contains no URL-quoting
635	my($lf_name) = (url "file:$lf_url")->unix_path;
636
637	# Make all intermediate directories
638	&mkdir($lf_name) if $s==0;
639
640	my($rum_as_string) = $rum_url_o->as_string;
641
642	print STDERR "GET_DOCUMENT: '",$rum_as_string,"' -> '",$lf_name,"'\n"
643	if $debug;
644
645	my $hostport;
646	my $www_auth=''; # Value of that http reply header
647	my $page_ref;
648	my @rum_newurls; # List of URLs extracted
649	my $url_extractor;
650	my $do_query; # Do query or not?
651
652	if (defined($rum_urlstat{$rum_as_string}) &&
653	$rum_urlstat{$rum_as_string}>0) {
654	warn "w3mir: Internal error, ".$rum_as_string.
655	" queued several times\n";
656	next;
657	}
658
659	# Goto here if we want to retry b/c of authentication
660	try_again:
661
662	# Start building the extra http::query arguments again
663	my @EXTRASTUFF=();
664
665	# We'll start by assuming that we're doing the query.
666	$do_query=1;
667
668	# If we're not checking the timestamp, or the file does not exist
669	# then we get the file unconditionally. Otherwise we only want it
670	# if it's updated.
671
672	if ($fetch==1) {
673	# Nothing do do?
674	} else {
675	if (-f $lf_name) {
676	if ($fetch==-1) {
677	print STDERR "w3mir: ".($infoloss?$rum_as_string:$lf_name).
678	", already have it" if $verbose>=0;
679	if (!$mine_urls) {
680	# If -fs and the file exists and we don't need to mine URLs
681	# we're finished!
682	warn "Already have it, no mining, returning!\n" if $debug;
683	print STDERR "\n" if $verbose>=0;
684	return;
685	}
686	$w3http::result=1304; # Pretend it was 'not modified'
687	$do_query=0;
688	} else {
689	push(@EXTRASTUFF,$w3http::IFMODF,$lf_name);
690	}
691	}
692	}
693
694	if ($do_query) {
695
696	# Does the server want authorization for this file? $www_auth is
697	# only set if authentication was requested the first time around.
698
699	# For testing:
700	# $www_auth='Basic realm="foo"';
701
702	if ($www_auth) {
703	my($authdata,$method,$realm);
704
705	($method,$realm)= $www_auth =~ m/^(\S+)\s+realm=\"([^\"]+)\"/i;
706	$method=lc $method;
707	$realm=lc $realm;
708	die "w3mir: '$method' authentication needed, don't know that.\n"
709	if ($method ne 'basic');
710
711	$hostport = $rum_url_o->netloc;
712	$authdata=$authdata{$hostport}{$realm} \|\| $authdata{$hostport}{'*'} \|\|
713	$authdata{''}{$realm} \|\| $authdata{''}{'*'};
714
715	if ($authdata) {
716	push(@EXTRASTUFF,$w3http::AUTHORIZ,$authdata);
717	} else {
718	print STDERR "w3mir: No authorization data for $hostport/$realm\n";
719	$rum_urlstat{$rum_as_string}=$NEVERMIND;
720	next;
721	}
722	}
723
724	push(@EXTRASTUFF,$w3http::FREEHEAD,$cache_header)
725	if ($cache_header);
726
727	# Insert referer header data if at all
728	push(@EXTRASTUFF,$w3http::REFERER,$rum_referers{$rum_as_string}[0])
729	if ($do_referer && exists($rum_referers{$rum_as_string}));
730
731	push(@EXTRASTUFF,$w3http::NOUSER)
732	unless ($do_user);
733
734	# YES, $lf_url is right, w3http::query handles this like an url so
735	# the quoting must all be in place.
736	my $binfile=$lf_url;
737	$binfile='-' if $s==1;
738	$binfile=$nulldevice if $s==2;
739
740	if ($pause) {
741	print STDERR "w3mir: sleeping\n" if $verbose>0;
742	sleep($pause);
743	}
744
745	print STDERR "w3mir: ".($infoloss?$rum_as_string:$lf_name)
746	unless $verbose<0;
747	print STDERR "\nFile: $lf_name\n" if $debug;
748
749	&w3http::query($w3http::GETURL,$rum_as_string,
750	$w3http::SAVEBIN,$binfile,
751	@EXTRASTUFF);
752
753	print STDERR "w3http::result: '",$w3http::result,
754	"' doc size: ", length($w3http::document),
755	" doc type: -",$w3http::headval{'CONTENT-TYPE'},
756	"- plaintexthtml: ",$w3http::plaintexthtml,"\n"
757	if $debug;
758
759	print "Result: ",$w3http::result," Recurse: $r, html: ",
760	$w3http::plaintexthtml,"\n"
761	if $debug;
762
763	} # if $do_query
764
765	if ($w3http::result==200) { # 200 OK
766	$rum_urlstat{$rum_as_string}=$GOTIT;
767
768	if ($mine_urls \|\| $process_urls) {
769
770	if ($w3http::plaintexthtml) {
771	# Only do URL manipulations if this is a html document with no
772	# special content-encoding. We do not handle encodings, yet.
773
774	my $page;
775
776	print STDERR ($process_urls)?", processing":", url mining"
777	if $verbose>0;
778
779	print STDERR "\nurl:'$lf_url'\n"
780	if $debug;
781
782	print "\nMining URLs: $mine_urls, Process: $process_urls\n"
783	if $debug;
784
785	($page,@rum_newurls) =
786	&htmlop::process($w3http::document,
787	# Only get a new document if wanted
788	$process_urls?():($htmlop::NODOC),
789	$htmlop::CANON,
790	$htmlop::ABS,$rum_url_o,
791	# Only list urls if wanted
792	$mine_urls?($htmlop::LIST):(),
793
794	# If user wants absolute URLs do not
795	# relativize them
796
797	$abs?
798	():
799	(
800	$htmlop::TAGCALLBACK,\&process_tag,$lf_url,
801	)
802	);
803
804	# print "URL: ",join("\nURL: ",@rum_newurls),"\n";
805
806	if ($process_urls) {
807	$page_ref=\$page;
808	$w3http::document='';
809	} else {
810	$page_ref=\$w3http::document;
811	}
812
813	} elsif ($s == 0 &&
814	($url_extractor =
815	$knownformats{$w3http::headval{'CONTENT-TYPE'}})) {
816
817	# The knownformats extractors only work on disk files so write
818	# doc to disk if not there already (non-html text will not be)
819	write_page($lf_name,$w3http::document,1);
820
821	# Now we try our hand at fetching URIs from non-html files.
822	print STDERR ", mining URLs" if $verbose>=1;
823	@rum_newurls = &$url_extractor($lf_name);
824	# warn "URLs from PDF: ",join(', ',@rum_newurls),"\n";
825	}
826
827
828	} # if ($mine_urls \|\| $process_urls)
829
830	# print "page_ref defined: ",defined($page_ref),"\n";
831	# print "plaintext: ",$w3http::plaintext,"\n";
832
833	$page_ref=\$w3http::document
834	if !defined($page_ref) && $w3http::plaintexthtml;
835
836	if ($w3http::plaintexthtml) {
837	# ark: this is where I want to do my changes to the page strip
838	# out the <!--NOMIRROR-->...<!--/NOMIRROR--> Stuff.
839	$$page_ref=~ s/<(!--)?\sNO\sMIRROR\s(--)?>[^\000]?<(!--)?\s\/NO\sMIRROR\s*(--)?>//g
840	if $edit;
841
842	if ($header) {
843	# ark: insert a header string at the start of the page
844	my $mirrorstr=$header;
845	$mirrorstr =~ s/\$url/$rum_as_string/g;
846	insert_at_start( $mirrorstr, $page_ref );
847	}
848	}
849
850	write_page($lf_name,$page_ref,0);
851
852	# print "New urls: ",join("\n",@rum_newurls),"\n";
853
854	return @rum_newurls;
855	}
856
857	if ($w3http::result==304 \|\| # 304 Not modified
858	$w3http::result==1304) { # 1304 Have it
859
860	{
861	# last = out of nesting
862
863	my $rum_urlstat;
864	my $rum_newurls;
865
866	@rum_newurls=();
867
868	print STDERR ", not modified"
869	if $verbose>=0 && $w3http::result==304;
870
871	$rum_urlstat{$rum_as_string}=$NOTMOD;
872
873	last unless $mine_urls;
874
875	$rum_newurls=get_references($lf_name);
876
877	# print "New urls: ",ref($rum_newurls),"\n";
878
879	if (!ref($rum_newurls)) {
880	last;
881	} elsif (ref($rum_newurls) eq 'SCALAR') {
882	$page_ref=$rum_newurls;
883	} elsif (ref($rum_newurls) eq 'ARRAY') {
884	@rum_newurls=@$rum_newurls;
885	last;
886	} else {
887	die "\nw3mir: internal error: Unknown return type from get_references\n";
888	}
889
890	# Check if it's a html file. I know this tag is in all html
891	# files, because I put it there as I pull them in.
892	last unless $$page_ref =~ /<HTML/i;
893
894	warn "$lf_name is a html file\n" if $debug;
895
896	# It's a html document
897	print STDERR ", mining URLs" if $verbose>=1;
898
899	# This will give us a list of absolute urls
900	(undef,@rum_newurls) =
901	&htmlop::process($$page_ref,$htmlop::NODOC,
902	$htmlop::ABS,$rum_as_string,
903	$htmlop::USESAVED,'W3MIR',
904	$htmlop::LIST);
905	}
906
907	print STDERR "\n" if $verbose>=0;
908	return @rum_newurls;
909	}
910
911	if ($w3http::result==302 \|\| $w3http::result==301) { # Redirect
912	# Cern and NCSA httpd sends 302 'redirect' if a ending / is
913	# forgotten on a url. More recent httpds send 301 'permanent
914	# redirect' in this case. Here we check if the difference in URLs
915	# is just a / and if so push the url again with the / added. This
916	# code only works if the http server has the right idea about its
917	# own name.
918	#
919	# 18/3/97: Added code to queue redirected-to-URLs that are within
920	# the scope of the retrival.
921	my $new_rum_url;
922
923	$rum_urlstat{$rum_as_string}=$REDIR;
924
925	# Absolutify the new url, it might be relative to the requested
926	# document. That's a ugly wart on some servers/admins.
927	$new_rum_url=url $w3http::headval{'location'};
928	$new_rum_url=$new_rum_url->abs($rum_url_o);
929
930	print REDIRS $rum_as_string,' -> ',$new_rum_url->as_string,"\n"
931	if $fixup;
932
933	if ($immediate_redir) {
934	print STDERR " =>> ",$new_rum_url->as_string,", getting that instead\n";
935	return get_document($new_rum_url,$lf_url);
936	}
937
938	# Some redirect to a fragment of another doc...
939	$new_rum_url->frag(undef);
940	$new_rum_url=$new_rum_url->as_string;
941
942	if ($rum_as_string.'/' eq $new_rum_url) {
943	if (grep { $rum_as_string eq $_; } @root_urls) {
944	print STDERR "\nw3mir: missing / in a start URL detected. Please fix commandline/config file.\n";
945	exit(1);
946	}
947	print STDERR ", missing /\n";
948	queue($new_rum_url);
949	# Initialize referer to something meaningful
950	$rum_referers{$new_rum_url}=$rum_referers{$rum_as_string};
951	} else {
952	print STDERR " =>> $new_rum_url";
953	if (want_this($new_rum_url)) {
954	print STDERR ", getting that\n";
955	queue($new_rum_url);
956	$rum_referers{$new_rum_url}=$rum_referers{$rum_as_string};
957	} else {
958	print STDERR ", don't want it\n";
959	}
960	}
961	return ();
962	}
963
964	if ($w3http::result==403 \|\| # Forbidden
965	$w3http::result==404 \|\| # Not found
966	$w3http::result==406 \|\| # Not Acceptable, hmm, belongs here?
967	$w3http::result==410) { # Gone - no forwarding address known
968
969	$rum_urlstat{$rum_as_string}=$ENOTFND;
970	&handleerror;
971	print STDERR "Was refered from: ",
972	join(',',@{$rum_referers{$rum_as_string}}),
973	"\n" if defined(@{$rum_referers{$rum_as_string}});
974	return ();
975	}
976
977	if ($w3http::result==407) {
978	# Proxy authentication requested
979	die "Proxy server requests authentication but failed to return the\n".
980	"REQUIRED Proxy-Authenticate header for this condition\n"
981	unless exists($w3http::headval{'proxy-authenticate'});
982
983	die "Proxy authentication is required for ".$w3http::headval{'proxy-authenticate'}."\n";
984	}
985
986	if ($w3http::result==401) {
987	# A www-authenticate reply header should acompany a 401 message.
988	if (!exists($w3http::headval{'www-authenticate'})) {
989	warn "w3mir: Server indicated authentication failure but gave no www-authenticate reply\n";
990	$rum_urlstat{$rum_as_string}=$NEVERMIND;
991	} else {
992	# Unauthorized
993	if ($www_auth) {
994	# Failed when authorization data was supplied.
995	$rum_urlstat{$rum_as_string}=$NEVERMIND;
996	print STDERR ", authorization failed data needed for ",
997	$w3http::headval{'www-authenticate'},"\n"
998	if ($verbose>=0);
999	} else {
1000	if ($useauth) {
1001	# First time failure, send back and retry at once with some known
1002	# user/passwd.
1003	$www_auth=$w3http::headval{'www-authenticate'};
1004	print STDERR ", retrying with authorization\n" unless $verbose<0;
1005	goto try_again;
1006	} else {
1007	print ", authorization needed: ",
1008	$w3http::headval{'www-authenticate'},"\n";
1009	$rum_urlstat{$rum_as_string}=$NEVERMIND;
1010	}
1011	}
1012	}
1013	return ();
1014	}
1015
1016	# Something else.
1017	&handleerror;
1018	}
1019
1020
1021	sub robot_check {
1022	# Check if URL is allowed by robots.txt, if we respect it at all
1023	# that is. Return 1 it allowed, 0 otherwise.
1024
1025	my($rum_url_o)=shift;
1026	my $hostport;
1027
1028	if ($check_robottxt) {
1029
1030	$hostport = $rum_url_o->netloc;
1031	if (!exists($gotrobots{$hostport})) {
1032	# Get robots.txt from the server
1033	$gotrobots{$hostport}=1;
1034	my $robourl="http://$hostport/robots.txt";
1035	print STDERR "w3mir: $robourl" if ($verbose>=0);
1036	&w3http::query($w3http::GETURL,$robourl);
1037	$w3http::document='' if ($w3http::result != 200);
1038	print STDERR ", processing" if $verbose>=1;
1039	print STDERR "\n" if ($verbose>=0);
1040	$Robot_Blob->parse($robourl,$w3http::document);
1041	}
1042
1043	if (!$Robot_Blob->allowed($rum_url_o->as_string)) {
1044	# It is forbidden
1045	$rum_urlstat{$rum_url_o->as_string}=$FROBOTS;
1046	warn "w3mir: ",$rum_url_o->as_string,": forbidden by robots.txt\n";
1047	return 0;
1048	}
1049	}
1050	return 1;
1051	}
1052
1053
1054
1055	sub batch_get {
1056	# Batch get _one_ document.
1057	my $rum_url=shift;
1058	my $lf_url;
1059
1060	$rum_url_o = url $rum_url;
1061
1062	return unless robot_check($rum_url_o);
1063
1064	($lf_url=$rum_url) =~ s~.*/~~;
1065	if (!defined($lf_url) \|\| $lf_url eq '') {
1066	($lf_url=$rum_url) =~ s~/$~~;
1067	$lf_url =~ s~.*/~~;
1068	$lf_url .= "-$indexname";
1069	}
1070
1071	warn "Batch get: $rum_url -> $lf_url\n" if $debug;
1072
1073	$immediate_redir=1; # Do follow redirects immediately
1074
1075	get_document($rum_url,$lf_url);
1076	}
1077
1078
1079
1080	sub mirror {
1081	# Mirror (or get) the requested url(s). Possibly recursively.
1082	# Working from whatever cwd is at invocation we'll retrieve all
1083	# files under it in the file hierarchy.
1084
1085	my $rum_url; # URL of the document we're getting now, defined at main level
1086	my $lf_url; # rum_url after apply - and
1087	my $new_lf_url;
1088	my @new_rum_urls;
1089	my $rum_ref;
1090
1091	while (defined($rum_url = pop(@rum_queue))) {
1092
1093	warn "mirror: Poped $rum_url from queue\n" if $debug;
1094
1095	# Unwanted URLs should not be queued
1096	die "Found url $rum_url that I don't want in queue!\n"
1097	unless defined($lf_url=apply($rum_url));
1098
1099	$rum_url_o = url $rum_url;
1100
1101	next unless robot_check($rum_url_o);
1102
1103	# Figure out the filename for our local filesystem.
1104	$lf_url.=$indexname if $lf_url =~ m~/$~ \|\| $lf_url eq '';
1105
1106	@new_rum_urls = get_document($rum_url_o,$lf_url);
1107
1108	print join("\n",@new_rum_urls),"\n" if ($list);
1109
1110	if ($r) {
1111	foreach $rum_ref (@new_rum_urls) {
1112	# warn "Recursive url: $rum_ref\n";
1113	$new_lf_url=apply($rum_ref);
1114	next unless $new_lf_url;
1115
1116	# warn "Want it\n";
1117	$rum_ref =~ s/\#.*$//s; # Clip off section marks
1118
1119	add_referer($rum_ref,$rum_url_o->as_string);
1120	queue($rum_ref);
1121	}
1122	}
1123
1124	@new_rum_urls=();
1125
1126	# Is the URL queue empty? Are there outstanding retries? Refill
1127	# the queue from the retry list.
1128	if ($#rum_queue<0 && $retry-->0) {
1129	foreach $rum_url_o (keys %rum_urlstat) {
1130	$rum_url_o = url $rum_url_o;
1131	if ($rum_urlstat{$rum_url_o->as_string}==100) {
1132	push(@rum_queue,$rum_url_o->as_string);
1133	$rum_urlstat{$rum_url_o->as_string}=0;
1134	}
1135	}
1136	if ($#rum_queue>=0) {
1137	warn "w3mir: Sleeping before retrying. $retry more times left\n"
1138	if $verbose>=0;
1139	sleep($retryPause);
1140	}
1141	}
1142
1143	}
1144	}
1145
1146
1147	sub get_references {
1148	# Get references from a non-html-on-disk file. Return references if
1149	# we know how to find them. Return reference do the complete page
1150	# if it's html. Return single numerical 0 if unknown format.
1151
1152	my($lf_url)=shift;
1153	my($urlextractor)=shift;
1154
1155	my $read; # Buffer of stuff read from file to check filetype
1156	my $magic;
1157	my $url_extractor;
1158	my $rum_ref;
1159	my $page;
1160
1161	warn "w3mir: Looking at local $lf_url\n" if $debug;
1162
1163	# Open file and read the first 10kilobytes for file-type-test
1164	# purposes.
1165	if (!open(TMPF,$lf_url)) {
1166	warn "Unable to open $lf_url for reading: $!\n";
1167	last;
1168	}
1169
1170	$page=' 'x10240;
1171	$read=sysread(TMPF,$page,length($page),0);
1172	close(TMPF);
1173
1174	die "Error reading $lf_url: $!\n" if (!defined($read));
1175
1176	if (!defined($url_extractor)) {
1177	$url_extractor=0;
1178
1179	# Check file against list of magic numbers.
1180	foreach $magic (keys %knownmagic) {
1181	if (substr($page,0,length($magic)) eq $magic) {
1182	$url_extractor = $knownformats{$knownmagic{$magic}};
1183	last;
1184	}
1185	}
1186	}
1187
1188	# Found a extraction method, apply.
1189	if ($url_extractor) {
1190	print STDERR ", mining URLs" if $verbose>=1;
1191	return [&$url_extractor($lf_url)];
1192	}
1193
1194	if ($page =~ /<HTML/i) {
1195	open(TMPF,$lf_url) \|\|
1196	die "Could not open $lf_url for reading: $!\n";
1197	# read the whole file.
1198	local($/)=undef;
1199	$page = <TMPF>;
1200	close(TMPF);
1201	return \$page;
1202	}
1203
1204	return 0;
1205	}
1206
1207
1208	sub open_fixup {
1209	# Open the referers and redirects files
1210
1211	my $reffile='.referers';
1212	my $redirfile='.redirs';
1213
1214	if ($win32) {
1215	$reffile="referers";
1216	$redirfile="redirs";
1217	}
1218
1219	$nodelete{$reffile} = $nodelete{$redirfile} = 1;
1220
1221	open(REDIRS,"> $redirfile") \|\|
1222	die "Could not open $redirfile for writing: $!\n";
1223
1224	autoflush REDIRS 1;
1225
1226	open(REFERERS,"> $reffile") \|\|
1227	die "Could not open $reffile for writing: $!\n";
1228
1229	$fixopen=1;
1230	eval 'END { close_fixup; 0; }';
1231	}
1232
1233
1234	sub close_fixup {
1235	# Close the fixup data files. In the case of the referer file also
1236	# write the entire content
1237
1238	return unless $fixopen;
1239
1240	my $referer;
1241
1242	foreach $referer (keys %rum_referers) {
1243	print REFERERS $referer," <- ",join(' ',@{$rum_referers{$referer}}),"\n";
1244	}
1245
1246	close(REFERERS) \|\| warn "Error closing referers file: $!\n";
1247	close(REDIRS) \|\| warn "Error closing redirects file: $!\n";
1248	$fixopen=0;
1249	}
1250
1251
1252	sub clean_disk {
1253	# This procedure removes files that are not present on the server(s)
1254	# anymore.
1255
1256	# - To avoid removing files that were not fetched due to network
1257	# problems we only do blanket removal IFF all documents were
1258	# fetched w/o problems, eventually.
1259	# - In any case we can remove files the server said were not found
1260
1261	# The strategy has three main parts:
1262	# 1. Find all files we have
1263	# 2. Find what files we ought to have
1264	# 3. Remove the difference
1265
1266	my $complete_retrival=1; # Flag saying IFF all documents were fetched
1267	my $urlstat; # Tmp storage
1268	my $rum_url;
1269	my $lf_url;
1270	my $lf_dir;
1271	my $dirs_to_remove;
1272
1273	# For fileremoval code
1274	eval "use File::Find;" unless defined(&find);
1275
1276	die "w3mir: Could not load File::Find module. Don't use -R switch.\n"
1277	unless defined(&find);
1278
1279	# This to shut up -w
1280	$lf_dir=$File::Find::dir;
1281
1282	# *** 1. Find out what files we have ***
1283	#
1284	# This does two things: For each file or directory found:
1285	# - Increases entry count for the container directory
1286	# - If it's a file: $lf_file{relative_path}=$FILEHERE;
1287
1288	chop(@root_dirs);
1289	print STDERR "Looking in: ",join(", ",@root_dirs),"\n" if $debug;
1290
1291	find(\&find_files,@root_dirs);
1292
1293	# *** 2. Find out what files we ought to have ***
1294	#
1295	# First we loop over %rum_urlstat to determine what files are not
1296	# present on the server(s).
1297	foreach $rum_url (keys %rum_urlstat) {
1298	# Figure out name of local file from rum_url
1299	next unless defined($lf_url=apply($rum_url));
1300
1301	$lf_url.=$indexname if $lf_url =~ m~/$~ \|\| $lf_url eq '';
1302
1303	# find prefixes ./, we must too.
1304	$lf_url="./".$lf_url unless substr($lf_url,0,1) eq '/';
1305
1306	# Ignore if file does not exist here.
1307	next unless exists($lf_file{$lf_url});
1308
1309	# The apply rules can map several remote files to same local
1310	# file. If we decided to keep file already we stay with that.
1311	next if $lf_file{$lf_url}==$FILETHERE;
1312
1313	$urlstat=$rum_urlstat{$rum_url};
1314
1315	# Figure out the status code.
1316	if ($urlstat==$GOTIT \|\| $urlstat==$NOTMOD) {
1317	# Present on server. Keep.
1318	$lf_file{$lf_url}=$FILETHERE;
1319	next;
1320	} elsif ($urlstat==$ENOTFND \|\| $urlstat==$NEVERMIND ) {
1321	# One of: not on server, can't get, don't want, access forbiden:
1322	# Schedule for removal.
1323	$lf_file{$lf_url}=$FILEDEL if exists($lf_file{$lf_url});
1324	next;
1325	} elsif ($urlstat==$OTHERERR \|\| $urlstat==$TERROR) {
1326	# Some error occured transfering.
1327	$complete_retrival=0; # The retrival was not complete. Delete less
1328	} elsif ($urlstat==$QUEUED) {
1329	warn "w3mir: Internal inconsistency, $rum_url marked as queued after retrival terminated\n";
1330	$complete_retrival=0; # Fishy. Be conservative about removing
1331	} else {
1332	$complete_retrival=0;
1333	warn "w3mir: Warning: $rum_url is marked as $urlstat.\n".
1334	"w3mir: Please report to w3mir-core\@usit.uio.no.\n";
1335	}
1336	} # foreach %rum_urlstat
1337
1338	# *** 3. Remove the difference ***
1339
1340	# Loop over all found files:
1341	# - Should we have this file?
1342	# - If not: Remove file and decrease directory entry count
1343	# Loop as long as there are directories with 0 entry count:
1344	# - Loop over all directories with 0 entry count:
1345	# - Remove directory
1346	# - Decrease entry count of parent
1347
1348	warn "w3mir: Some error occured, conservative file removal\n"
1349	if !$complete_retrival && $verbose>=0;
1350
1351	# Remove all files we don't want removed from list of files present:
1352	foreach $lf_url (keys %nodelete) {
1353	print STDERR "Not deleting: $lf_url\n" if $verbose>=1;
1354	delete $lf_file{$lf_url} \|\| delete $lf_file{'./'.$lf_url};
1355	}
1356
1357	# Remove files
1358	foreach $lf_url (keys %lf_file) {
1359	if (($complete_retrival && $lf_file{$lf_url}==$FILEHERE) \|\|
1360	($lf_file{$lf_url} == $FILEDEL)) {
1361	if (unlink $lf_url) {
1362	($lf_dir)= $lf_url =~ m~^(.+)/~;
1363	$lf_dir{$lf_dir}--;
1364	$dirs_to_remove=1 if ($lf_dir{$lf_dir}==0);
1365	warn "w3mir: removed file $lf_url\n" if $verbose>=0;
1366	} else {
1367	warn "w3mir: removal of file $lf_url failed: $!\n";
1368	}
1369	}
1370	}
1371
1372	# Remove empty directories
1373	while ($dirs_to_remove) {
1374	$dirs_to_remove=0;
1375	foreach $lf_url (keys %lf_dir) {
1376	next if $lf_url eq '.';
1377	if ($lf_dir{$lf_url}==0) {
1378	if (rmdir($lf_url)) {
1379	warn "w3mir: removed directory $lf_dir\n" if $verbose>=0;
1380	delete $lf_dir{$lf_url};
1381	($lf_dir)= $lf_url =~ m~^(.+)/~;
1382	$lf_dir{$lf_dir}--;
1383	$dirs_to_remove=1 if ($lf_dir{$lf_dir}==0);
1384	} else {
1385	warn "w3mir: removal of directory $lf_dir failed: $!\n";
1386	}
1387	}
1388	}
1389	}
1390	}
1391
1392
1393	sub find_files {
1394	# This is called by the find procedure for every file/dir found.
1395
1396	# This builds two hashes:
1397	# lf_file{<file>}: 1: file exists
1398	# lf_dir{<dir>): Number of files in directory.
1399
1400	lstat($_);
1401
1402	$lf_dir{$File::Find::dir}++;
1403
1404	if (-f _) {
1405	$lf_file{$File::Find::name}=$FILEHERE;
1406	} elsif (-d _) {
1407	# null
1408	# Bug: If an empty directory exists it will not be removed
1409	} else {
1410	warn "w3mir: File $File::Find::name has unknown type. Ignoring.\n";
1411	}
1412	return 0;
1413
1414	}
1415
1416
1417	sub handleerror {
1418	# Handle error status of last http connection, will set the rum_urlstat
1419	# appropriately and print a error message.
1420
1421	my $msg;
1422
1423	if ($verbose<0) {
1424	$msg="w3mir: ".$rum_url_o->as_string.": ";
1425	} else {
1426	$msg=": ";
1427	}
1428
1429	if ($w3http::result == 98) {
1430	# OS/Network error
1431	$msg .= "$!";
1432	$rum_urlstat{$rum_url_o->as_string}=$OTHERERR;
1433	} elsif ($w3http::result == 100) {
1434	# Some kind of error connecting or sending request
1435	$msg .= $w3http::restext \|\| "Timeout";
1436	$rum_urlstat{$rum_url_o->as_string}=$TERROR;
1437	} else {
1438	# Other HTTP error
1439	$rum_urlstat{$rum_url_o->as_string}=$OTHERERR;
1440	$msg .= " ".$w3http::result." ".$w3http::restext;
1441	$msg .= " =>> ".$w3http::headval{'location'}
1442	if (defined($w3http::headval{'location'}));
1443	}
1444	print STDERR "$msg\n";
1445	}
1446
1447
1448	sub queue {
1449	# Queue given url if appropriate and create a status entry for it
1450	my($rum_url_o)=url $_[0];
1451
1452	croak("BUG: undefined \$rum_url_o")
1453	if !defined($rum_url_o);
1454
1455	croak("BUG: undefined \$rum_url_o->as_string")
1456	if !defined($rum_url_o->as_string);
1457
1458	croak("BUG: ".$rum_url_o->as_string." (fragnent) queued")
1459	if $rum_url_o->as_string =~ /\#/;
1460
1461	return if exists($rum_urlstat{$rum_url_o->as_string});
1462	return unless want_this($rum_url_o->as_string);
1463
1464	warn "QUEUED: ",$rum_url_o->as_string,"\n" if $debug;
1465
1466	# Note lack of scope checks.
1467	$rum_urlstat{$rum_url_o->as_string}=$QUEUED;
1468	push(@rum_queue,$rum_url_o->as_string);
1469	}
1470
1471
1472	sub root_queue {
1473	# Queue function for root urls and directories. One or the other might
1474	# be boolean false, in that case, don't queue it.
1475
1476	my $root_url_o;
1477
1478	my($root_url)=shift;
1479	my($root_dir)=shift;
1480
1481	die "w3mir: No fragments in start URLs :".$root_url."\n"
1482	if $root_url =~ /\#/;
1483
1484	if ($root_dir) {
1485	print "Root dir: $root_dir\n" if $debug;
1486	$root_dir="./$root_dir" unless substr($root_dir,0,1) eq '/' or
1487	substr($root_dir,0,2) eq './';
1488	push(@root_dirs,$root_dir);
1489	}
1490
1491
1492	if ($root_url) {
1493	$root_url_o=url $root_url;
1494
1495	# URL canonification, or what we do of it at least.
1496	$root_url_o->host($root_url_o->host);
1497
1498	warn "Root queue: ".$root_url_o->as_string."\n" if $debug;
1499
1500	push(@root_urls,$root_url_o->as_string);
1501
1502	return $root_url_o;
1503	}
1504
1505	}
1506
1507
1508	sub write_page {
1509	# write a retrieved page to wherever it's supposed to be written.
1510	# Added difficulty: all files but plaintext files have already been
1511	# written to disk in w3http.
1512
1513	# $s == 0 save to disk
1514	# $s == 1 dump to stdout
1515	# $s == 2 forget
1516
1517	my($lf_name,$page_ref,$silent) = @_;
1518	my($verb);
1519
1520	if ($silent) {
1521	$verb=-1;
1522	} else {
1523	$verb=$verbose;
1524	}
1525
1526	# confess("\n\$page_ref undefined") if !defined($page_ref);
1527
1528	if ($w3http::plaintexthtml) {
1529	# I have it in memory
1530	if ($s==0) {
1531	print STDERR ", saving" if $verb>0;
1532
1533	while (-d $lf_name) {
1534	# This will run once, maybe twice, $fiddled will be canged the
1535	# first time
1536	if (exists($fiddled{$lf_name})) {
1537	warn "Cannot save $lf_name, there is a directory in the way\n";
1538	return;
1539	}
1540
1541	$fiddled{$lf_name}=1;
1542
1543	rm_rf($lf_name);
1544	print STDERR "w3mir: $lf_name" if $verbose>=0;
1545	}
1546
1547	if (!open(PAGE,">$lf_name")) {
1548	warn "\nw3mir: can't open $lf_name for writing: $!\n";
1549	return;
1550	}
1551	if (!$convertnl) {
1552	binmode PAGE;
1553	warn "BINMODE\n" if $debug;
1554	}
1555	if ($$page_ref ne '') {
1556	print PAGE $$page_ref \|\| die "w3mir: Error writing $lf_name: $!\n";
1557	}
1558	close(PAGE) \|\| die "w3mir: Error closing $lf_name: $!\n";
1559	print STDERR ": ", length($$page_ref), " bytes\n"
1560	if $verb>=0;
1561	setmtime($lf_name,$w3http::headval{'last-modified'})
1562	if exists($w3http::headval{'last-modified'});
1563	} elsif ($s==1) {
1564	print $$page_ref ;
1565	} elsif ($s==2) {
1566	print STDERR ", got and forgot it.\n" unless $verb<0;
1567	}
1568	} else {
1569	# Already written by http module, just emit a message if wanted
1570	if ($s==0) {
1571	print STDERR ": ",$w3http::doclen," bytes\n"
1572	if $verb>=0;
1573	setmtime($lf_name,$w3http::headval{'last-modified'})
1574	if exists($w3http::headval{'last-modified'});
1575	} elsif ($s==2) {
1576	print STDERR ", got and forgot it.\n" if $verb>=0;
1577	}
1578	}
1579	}
1580
1581
1582	sub setmtime {
1583	# Set mtime of the given file
1584	my($file,$time)=@_;
1585	my($tm_sec,$tm_min,$tm_hour,$tm_mday,$tm_mon,$tm_year,$tm_wday,$tm_yday,
1586	$tm_isdst,$tics);
1587
1588	$tm_isdst=0;
1589	$tm_yday=-1;
1590
1591	carp("\$time is undefined"),return if !defined($time);
1592
1593	$tics=str2time($time);
1594	utime(time, $tics, $file) \|\|
1595	warn "Could not change mtime of $file: $!\n";
1596	}
1597
1598
1599	sub movefile {
1600	# Rename a file. Note that copy is not a good alternative, since
1601	# copying over NFS is something we want to Avoid.
1602
1603	# Returns 0 if failure and 1 in case of sucess.
1604
1605	(my $old,my $new) = @_;
1606
1607	# Remove anything that might have the name already.
1608	if (-d $new) {
1609	print STDERR "\n" if $verbose>=0;
1610	rm_rf($new);
1611	$fiddled{$new}=1;
1612	print STDERR "w3mir: $new" if $verbose>=0;
1613	} elsif (-e $new) {
1614	$fiddled{$new}=1;
1615	if (unlink($new)) {
1616	print STDERR "\nw3mir: removed $new\nw3mir: $new"
1617	if $verbose>=0;
1618	} else {
1619	return 0;
1620	}
1621
1622	}
1623
1624	if ($new ne '-' && $new ne $nulldevice) {
1625	warn "MOVING $old -> $new\n" if $debug;
1626	rename($old,$new) \|\|
1627	warn "Could not rename $old to $new: $!\n",return 0;
1628	}
1629	return 1;
1630	}
1631
1632
1633	sub mkdir {
1634	# Make all intermediate directories needed for a file, the file name
1635	# is expected to be included in the argument!
1636
1637	# Reasons for not using File::Path::mkpath:
1638	# - I already wrote this.
1639	# - I get to be able to produce as good and precise errormessages as
1640	# unix and perl will allow me. mkpath will not.
1641	# - It's easier to find out if it worked or not.
1642
1643	my($file) = @_;
1644	my(@dirs) = split("/",$file);
1645	my $path;
1646	my $dir;
1647	my $moved=0;
1648
1649	if (!$dirs[0]) {
1650	shift @dirs;
1651	$path='';
1652	} else {
1653	$path = '.';
1654	}
1655
1656	# This removes the last element of the array, it's meant to shave
1657	# off the file name leaving only the directory name, as a
1658	# convenience, for the caller.
1659	pop @dirs;
1660	foreach $dir (@dirs) {
1661	$path .= "/$dir";
1662	stat($path);
1663	# only make if it isn't already there
1664	next if -d _;
1665
1666	while (!-d _) {
1667	if (exists($fiddled{$path})) {
1668	warn "Cannot make directory $path, there is a file in the way.\n";
1669	return;
1670	}
1671
1672	$fiddled{$path}=1;
1673
1674	if (!-e _) {
1675	mkdir($path,0777);
1676	last;
1677	}
1678
1679	if (unlink($path)) {
1680	warn "w3mir: removed file $path\n" if $verbose>=0;
1681	} else {
1682	warn "Unable to remove $path: $!\n";
1683	next;
1684	}
1685
1686	warn "mkdir $path\n" if $debug;
1687	mkdir($path,0777) \|\|
1688	warn "Unable to create directory $path: $!\n";
1689
1690	stat($path);
1691	}
1692	}
1693	}
1694
1695
1696	sub add_referer {
1697	# Add a referer to the list of referers of a document. Unless it's
1698	# already there.
1699	# Don't mail me if you (only) think this is a bit like a toungetwiser:
1700
1701	# Don't remember referers if BOTH fixup and referer header is disabled.
1702	return if $fixup==0 && $do_referer==0;
1703
1704	my($rum_referee,$rum_referer) = @_ ;
1705	my $re_rum_referer;
1706
1707	if (exists($rum_referers{$rum_referee})) {
1708	$re_rum_referer=quotemeta $rum_referer;
1709	if (!grep(m/^$re_rum_referer$/,@{$rum_referers{$rum_referee}})) {
1710	push(@{$rum_referers{$rum_referee}},$rum_referer);
1711	# warn "$rum_referee <- $rum_referer pushed\n";
1712	} else {
1713	# warn "$rum_referee <- $rum_referer NOT pushed\n";
1714	}
1715	} else {
1716	$rum_referers{$rum_referee}=[$rum_referer];
1717	# warn "$rum_referee <- $rum_referer pushed\n";
1718	}
1719	}
1720
1721
1722	sub user_apply {
1723	# Apply the user apply rules
1724
1725	return &$user_apply_code(shift);
1726
1727	# Debug version:
1728	# my ($foo,$bar);
1729	# $foo=shift;
1730	# $bar=&$apply_code($foo);
1731	# print STDERR "Apply: $foo -> $bar\n";
1732	# return $bar;
1733	}
1734
1735	sub internal_apply {
1736	# Apply the w3mir generated apply rules
1737
1738	return &$apply_code(shift);
1739	}
1740
1741
1742	sub apply {
1743	# Apply the user apply rules. Then if URL is wanted return result of
1744	# w3mir apply rules. Return the undefined value otherwise.
1745
1746	my $url = user_apply(shift);
1747
1748	return undef unless want_this($url);
1749
1750	internal_apply($url);
1751	}
1752
1753
1754	sub want_this {
1755	# Find out if we want the url passed. Just pass it on to the
1756	# generated functions.
1757	my($rum_url)=shift;
1758
1759	# What about robot rules?
1760
1761	# Does scope rule want this?
1762	return &$scope_code($rum_url) &&
1763	# Does user rule want this too?
1764	&$rule_code($rum_url)
1765
1766	}
1767
1768
1769	sub process_tag {
1770	# Process a tag in html file
1771	my $lf_referer = shift; # User argument
1772	my $base_url = shift; # Not used... why not?
1773	my $tag_name = shift;
1774	my $url_attrs = shift;
1775
1776	# Retrun quickly if no URL attributes
1777	return unless defined($url_attrs);
1778
1779	my $attrs = shift;
1780
1781	my $rum_url; # The absolute URL
1782	my $lf_url; # The local filesystem url
1783	my $lf_url_o; # ... and it's object
1784	my $key;
1785
1786	print STDERR "\nProcess Tag: $tag_name, URL attributes: ",
1787	join(', ',@{$url_attrs}),"\nbase_url: ",$base_url,"\nlf_referer: ",
1788	$lf_referer,"\n"
1789	if $debug>2;
1790
1791	$lf_referer =~ s~^/~~;
1792	$lf_referer = "file:/$lf_referer";
1793
1794	foreach $key (@{$url_attrs}) {
1795	if (defined($$attrs{$key})) {
1796	$rum_url=$$attrs{$key};
1797	printf STDERR "$key = $rum_url\n" if $debug;
1798	$lf_url=apply($rum_url);
1799	if (defined($lf_url)) {
1800
1801	printf STDERR "Transformed to $lf_url\n" if $debug>2;
1802
1803	$lf_url =~ s~^/~~; # Remove leading / to avoid doubeling
1804	$lf_url_o=url "file:/$lf_url";
1805
1806	# Save new value in the hash
1807	$$attrs{$key}=($lf_url_o->rel($lf_referer))->as_string;
1808	print STDERR "New value: ",$$attrs{$key},"\n" if $debug>2;
1809
1810	# If there is potential information loss save the old value too
1811	$$attrs{"W3MIR".$key}=$rum_url if $infoloss;
1812	}
1813	}
1814	}
1815	}
1816
1817
1818	sub version {
1819	eval 'require LWP;';
1820	print $w3mir_agent,"\n";
1821	print "LWP version ",$LWP::VERSION,"\n" if defined $LWP::VERSION;
1822	print "Perl version: ",$],"\n";
1823	exit(0);
1824	}
1825
1826
1827	sub parse_args {
1828	my $f;
1829	my $i;
1830
1831	$i=0;
1832
1833	while ($f=shift) {
1834	$i++;
1835	$numarg++;
1836	# This is a demonstration against Getopts::Long.
1837	if ($f =~ s/^-+//) {
1838	$s=1,next if $f eq 's'; # Stdout
1839	$r=1,next if $f eq 'r'; # Recurse
1840	$fetch=1,next if $f eq 'fa'; # Fetch all, no date test
1841	$fetch=-1,next if $f eq 'fs'; # Fetch those we don't already have.
1842	$verbose=-1,next if $f eq 'q'; # Quiet
1843	$verbose=1,next if $f eq 'c'; # Chatty
1844	&version,next if $f eq 'v'; # Version
1845	$pause=shift,next if $f eq 'p'; # Pause between requests
1846	$retryPause=shift,next if $f eq 'rp'; # Pause between retries.
1847	$s=2,$convertnl=0,next if $f eq 'f'; # Forget
1848	$retry=shift,next if $f eq 't'; # reTry
1849	$list=1,next if $f eq 'l'; # List urls
1850	$iref=shift,next if $f eq 'ir'; # Initial referer
1851	$check_robottxt = 0,next if $f eq 'drr'; # Disable robots.txt rules.
1852	umask(oct(shift)),next if $f eq 'umask';
1853	parse_cfg_file(shift),next if $f eq 'cfgfile';
1854	usage(),exit 0 if ($f eq 'help' \|\| $f eq 'h' \|\| $f eq '?');
1855	$remove=1,next if $f eq 'R';
1856	$cache_header = 'Pragma: no-cache',next if $f eq 'pflush';
1857	$w3http::agent=$w3mir_agent=shift,next if $f eq 'agent';
1858	$abs=1,next if $f eq 'abs';
1859	$convertnl=0,$batch=1,next if $f eq 'B';
1860	$read_urls = 1,next if $f eq 'I';
1861	$convertnl=0,next if $f eq 'nnc';
1862
1863	if ($f eq 'lc') {
1864	if ($i == 1) {
1865	$lc=1;
1866	$iinline=($lc?"(?i)":"");
1867	$ipost=($lc?"i":"");
1868	next;
1869	} else {
1870	die "w3mir: -lc must be the first argument on the commandline.\n";
1871	}
1872	}
1873
1874	if ($f eq 'P') { # Proxy
1875	($w3http::proxyserver,$w3http::proxyport)=
1876	shift =~ /([^:]+):?(\d+)?/;
1877	$w3http::proxyport=80 unless $w3http::proxyport;
1878	$using_proxy=1;
1879	next;
1880	}
1881
1882	if ($f eq 'd') { # Debugging level
1883	$f=shift;
1884	unless (($debug = $f) > 0) {
1885	die "w3mir: debug level must be a number greater than zero.\n";
1886	}
1887	next;
1888	}
1889
1890	# Those were all the options...
1891	warn "w3mir: Unknown option: -$f. Use -h for usage info.\n";
1892	exit(1);
1893
1894	} elsif ($f =~ /^http:/) {
1895	my ($rum_url_o,$rum_reurl,$rum_rebase,$server);
1896
1897	$rum_url_o=root_queue($f,'./');
1898
1899	$rum_rebase = quotemeta( ($rum_url_o->as_string =~ m~^(.*/)~)[0] );
1900
1901	push(@internal_apply,"s/^".$rum_rebase."//");
1902	$scope_fetch.="return 1 if m/^".$rum_rebase."/".$ipost.";\n";
1903	$scope_ignore.="return 0 if m/^".
1904	quotemeta("http://".$rum_url_o->netloc."/")."/".$ipost.";\n";
1905
1906	} else {
1907	# If we get this far then the commandline is broken
1908	warn "Unknown commandline argument: $f. Use -h for usage info.\n";
1909	$numarg--;
1910	exit(1);
1911	}
1912	}
1913	return 1;
1914	}
1915
1916
1917	sub parse_cfg_file {
1918	# Read the configuration file. Aborts on errors. Not good to
1919	# mirror something using the wrong config.
1920
1921	my ( $file ) = @_ ;
1922	my ($key, $value, $authserver,$authrealm,$authuser,$authpasswd);
1923	my $i;
1924
1925	die "w3mir: config file $file is not a file.\n" unless -f $file;
1926	open(CFGF, $file) \|\| die "Could not open config file $file: $!\n";
1927
1928	$i=0;
1929
1930	while (<CFGF>) {
1931	# Trim off various junk
1932	chomp;
1933	s/^#.*//;
1934	s/^\s+\|\s$//g;
1935	# Anything left?
1936	next if $_ eq '';
1937	# Examine remains
1938	$i++;
1939	$numarg++;
1940
1941	($key, $value) = split(/\s:\s/,$_,2);
1942	$key = lc $key;
1943
1944	$iref=$value,next if ( $key eq 'initial-referer' );
1945	$header=$value,next if ( $key eq 'header' );
1946	$pause=numeric($value),next if ( $key eq 'pause' );
1947	$retryPause=numeric($value),next if ( $key eq 'retry-pause' );
1948	$debug=numeric($value),next if ( $key eq 'debug' );
1949	$retry=numeric($value),next if ( $key eq 'retries' );
1950	umask(numeric($value)),next if ( $key eq 'umask' );
1951	$check_robottxt=boolean($value),next if ( $key eq 'robot-rules' );
1952	$edit=boolean($value),next if ($key eq 'remove-nomirror');
1953	$indexname=$value,next if ($key eq 'index-name');
1954	$s=nway($value,'save','stdout','forget'),next
1955	if ( $key eq 'file-disposition' );
1956	$verbose=nway($value,'quiet','brief','chatty')-1,next
1957	if ( $key eq 'verbosity' );
1958	$w3http::proxyuser=$value,next if $key eq 'http-proxy-user';
1959	$w3http::proxypasswd=$value,next if $key eq 'http-proxy-passwd';
1960
1961	if ( $key eq 'cd' ) {
1962	$chdirto=$value;
1963	warn "Use of 'cd' is discouraged\n" unless $verbose==-1;
1964	next;
1965	}
1966
1967	if ($key eq 'http-proxy') {
1968	($w3http::proxyserver,$w3http::proxyport)=
1969	$value =~ /([^:]+):?(\d+)?/;
1970	$w3http::proxyport=80 unless $w3http::proxyport;
1971	$using_proxy=1;
1972	next;
1973	}
1974
1975	if ($key eq 'proxy-options') {
1976	my($val,$nval,@popts,$pragma);
1977	$pragma=1;
1978	foreach $val (split(/\s,\/,lc $value)) {
1979	$nval=nway($val,'no-pragma','revalidate','refresh','no-store',);
1980	# Force use of Cache-control: header
1981	$pragma=0 if ($nval==0);
1982	# use to force proxy to revalidate
1983	$pragma=0,push(@popts,'max-age=0') if ($nval==1);
1984	# use to force proxy to refresh
1985	push(@popts,'no-cache') if ($nval==2);
1986	# use if information transfered is sensitive
1987	$pragma=0,push(@popts,'no-store') if ($nval==3);
1988	}
1989	$cache_header=($pragma?'Pragma: ':'Cache-control: ').join(', ',@popts);
1990	next;
1991	}
1992
1993
1994	if ($key eq 'url') {
1995	my ($rum_url_o,$lf_dir,$rum_reurl,$rum_rebase);
1996
1997	# A two argument URL: line?
1998	if ($value =~ m/^(.+)\s+(.+)/i) {
1999	# Two arguments.
2000	# The last is a directory, it must end in /
2001	$lf_dir=$2;
2002	$lf_dir.='/' unless $lf_dir =~ m~/$~;
2003
2004	$rum_url_o=root_queue($1,$lf_dir);
2005
2006	# The first is a URL, make it more canonical, find the base.
2007	# The namespace confusion in this section is correct.(??)
2008	$rum_rebase = quotemeta( ($rum_url_o->as_string =~ m~^(.*/)~)[0] );
2009
2010	# print "URL: ",$rum_url_o->as_string,"\n";
2011	# print "Base: $rum_rebase\n";
2012
2013	# Translate from rum space to lf space:
2014	push(@internal_apply,"s/^".$rum_rebase."/".quotemeta($lf_dir)."/");
2015
2016	# That translation could lead to information loss.
2017	$infoloss=1;
2018
2019	# Fetch rules tests the rum_url_o->as_string. Fetch whatever
2020	# matches the base.
2021	$scope_fetch.="return 1 if m/^".$rum_rebase."/".$ipost.";\n";
2022
2023	# Ignore whatever did not match the base.
2024	$scope_ignore.="return 0 if m/^".
2025	quotemeta("http://".$rum_url_o->netloc."/")."/".$ipost.";\n";
2026
2027	} else {
2028	$rum_url_o=root_queue($value,'./');
2029
2030	$rum_rebase = quotemeta( ($rum_url_o->as_string =~ m~^(.*/)~)[0] );
2031
2032	# Translate from rum space to lf space:
2033	push(@internal_apply,"s/^".$rum_rebase."//");
2034
2035	$scope_fetch.="return 1 if m/^".$rum_rebase."/".$ipost.";\n";
2036	$scope_ignore.="return 0 if m/^".
2037	quotemeta("http://".$rum_url_o->netloc."/")."/".$ipost.";\n";
2038	}
2039	next;
2040	}
2041
2042	if ($key eq 'also-quene') {
2043	print STDERR
2044	"Found 'also-quene' keyword, please replace with 'also-queue'\n";
2045	$key='also-queue';
2046	}
2047
2048	if ($key eq 'also' \|\| $key eq 'also-queue') {
2049	if ($value =~ m/^(.+)\s+(.+)/i) {
2050	my ($rum_url_o,$lf_dir,$rum_reurl,$rum_rebase);
2051	# Two arguments.
2052	# The last is a directory, it must end in /
2053	# print STDERR "URL ",$1," DIR ",$2,"\n";
2054	$lf_dir=$2;
2055	$lf_dir.='/' unless $lf_dir =~ m~/$~;
2056
2057	if ($key eq 'also-queue') {
2058	$rum_url_o=root_queue($1,$lf_dir);
2059	} else {
2060	root_queue("",$lf_dir);
2061	$rum_url_o=url $1;
2062	$rum_url_o->host(lc $rum_url_o->host);
2063	}
2064
2065	# The first is a URL, find the base
2066	$rum_rebase = quotemeta( ($rum_url_o->as_string =~ m~^(.*/)~)[0] );
2067
2068	# print "URL: $rum_url_o->as_string\n";
2069	# print "Base: $rum_rebase\n";
2070	# print "Server: $server\n";
2071
2072	# Ok, now we can transform and select stuff the right way
2073	push(@internal_apply,"s/^".$rum_rebase."/".quotemeta($lf_dir)."/");
2074	$infoloss=1;
2075
2076	# Fetch rules tests the rum_url_o->as_string. Fetch whatever
2077	# matches the base.
2078	$scope_fetch.="return 1 if m/^".$rum_rebase."/".$ipost.";\n";
2079
2080	# Ignore whatever did not match the base. This cures problem
2081	# with '..' from base in in rum space pointing within the the
2082	# scope in ra space. We introduced a extra level (or more) of
2083	# directories with the apply above. Must do same with 'Also:'
2084	# directives.
2085	$scope_ignore.="return 0 if m/^".
2086	quotemeta("http://".$rum_url_o->netloc."/")."/".$ipost.";\n";
2087	} else {
2088	die "Also: requires 2 arguments\n";
2089	}
2090	next;
2091	}
2092
2093	if ($key eq 'quene') {
2094	print STDERR "Found 'quene' keyword, please replace with 'queue'\n";
2095	$key='queue';
2096	}
2097
2098	if ($key eq 'queue') {
2099	root_queue($value,"");
2100	next;
2101	}
2102
2103	if ($key eq 'ignore-re' \|\| $key eq 'fetch-re') {
2104	# Check that it's a re, better that I am strict than for perl to
2105	# make compilation errors.
2106	unless ($value =~ /^m(.).\1[gimosx]$/) {
2107	print STDERR "w3mir: $value is not a recognized regular expression\n";
2108	exit 1;
2109	}
2110	# Fall-through to next cases!
2111	}
2112
2113	if ($key eq 'fetch' \|\| $key eq 'fetch-re') {
2114	my $expr=$value;
2115	$expr = wild_re($expr).$ipost if ($key eq 'fetch');
2116	$rule_text.=' return 1 if '.$expr.";\n";
2117	next;
2118	}
2119
2120	if ($key eq 'ignore' \|\| $key eq 'ignore-re') {
2121	my $expr=$value;
2122	$expr = wild_re($expr).$ipost if ($key eq 'ignore');
2123	# print STDERR "Ignore expression: $expr\n";
2124	$rule_text.=' return 0 if '.$expr.";\n";
2125	next;
2126	}
2127
2128
2129	if ($key eq 'apply') {
2130	unless ($value =~ /^s(.).\1.\1[gimosxe]*$/) {
2131	print STDERR
2132	"w3mir: '$value' is not a recognized regular expression\n";
2133	exit 1;
2134	}
2135	push(@user_apply,$value) ;
2136	$infoloss=1;
2137	next;
2138	}
2139
2140	if ($key eq 'agent') {
2141	$w3http::agent=$w3mir_agent=$value;
2142	next;
2143	}
2144
2145	# The authorization stuff:
2146	if ($key eq 'auth-domain') {
2147	$useauth=1;
2148	($authserver, $authrealm) = split('/',$value,2);
2149	die "w3mir: server part of auth-domain has format server[:port]\n"
2150	unless $authserver =~ /^(\S+(:\d+)?)$\|^\*$/;
2151	$authserver =~ s/:80$//;
2152	die "w3mir: auth-domain '$value' is not valid\n"
2153	if !defined($authserver) \|\| !defined($authrealm);
2154	$authrealm=lc $authrealm;
2155	}
2156
2157	$authuser=$value if ($key eq 'auth-user');
2158	$authpasswd=$value if ($key eq 'auth-passwd');
2159
2160	# Got a full authentication spec?
2161	if ($authserver && $authrealm && $authuser && $authpasswd) {
2162	$authdata{$authserver}{$authrealm}=$authuser.":".$authpasswd;
2163	print "Authentication for $authserver/$authrealm is ".
2164	"$authuser/$authpasswd\n" if $verbose>=0;
2165	# exit;
2166	# Invalidate tmp vars
2167	$authserver=$authrealm=$authuser=$authpasswd=undef;
2168	next;
2169	}
2170
2171	next if $key eq 'auth-user' \|\| $key eq 'auth-passwd' \|\|
2172	$key eq 'auth-domain';
2173
2174	if ($key eq 'fetch-options') {
2175	warn "w3mir: The 'fetch-options' directive has been renamed to 'options'\nw3mir: Please change your configuration file.\n";
2176	$key='options';
2177	# Fall through to 'options'!
2178	}
2179
2180	if ($key eq 'options') {
2181
2182	my($val,$nval);
2183	foreach $val (split(/\s,\s/,lc $value)) {
2184	if ($i==1) {
2185	$nval=nway($val,'recurse','no-date-check','only-nonexistent',
2186	'list-urls','lowercase','remove','batch','read-urls',
2187	'abs','no-newline-conv');
2188	$r=1,next if $nval==0;
2189	$fetch=1,next if $nval==1;
2190	$fetch=-1,next if $nval==2;
2191	$list=1,next if $nval==3;
2192	if ($nval==4) {
2193	$lc=1;
2194	$iinline=($lc?"(?i)":"");
2195	$ipost=($lc?"i":"");
2196	next ;
2197	}
2198	$remove=1,next if $nval==5;
2199	$convertnl=0,$batch=1,next if $nval==6;
2200	$read_urls=1,next if $nval==7;
2201	$abs=1,next if $nval==8;
2202	$convertnl=0,next if $nval==9;
2203	} else {
2204	die "w3mir: options must be the first directive in the config file.\n";
2205	}
2206	}
2207	next;
2208	}
2209
2210	if ($key eq 'disable-headers') {
2211	my($val,$nval);
2212	foreach $val (split(/\s,\s/,lc $value)) {
2213	$nval=nway($val,'referer','user');
2214	$do_referer=0,next if $nval==0;
2215	$do_user=0,next if $nval==1;
2216	}
2217	next;
2218	}
2219
2220
2221	if ($key eq 'fixup') {
2222
2223	$fixrc="$file";
2224	# warn "Fixrc: $fixrc\n";
2225
2226	my($val,$nval);
2227	foreach $val (split(/\s,\s/,lc $value)) {
2228	$nval=nway($val,'on','run','noindex','off');
2229	$runfix=1,next if $nval==1;
2230	# Disable fixup
2231	$fixup=0,next if $nval==3;
2232	# Ignore everyting else
2233	}
2234	next;
2235	}
2236
2237	die "w3mir: Unrecognized directive ('$key') in config file $file at line $.\n";
2238
2239	}
2240	close(CFGF);
2241
2242	if (defined($w3http::proxypasswd) && $w3http::proxyuser) {
2243	warn "Proxy authentication: ".$w3http::proxyuser.":".
2244	$w3http::proxypasswd."\n" if $verbose>=0;
2245	}
2246
2247	}
2248
2249
2250	sub wild_re {
2251	# Here we translate unix wildcard subset to to perlre
2252	local($_) = shift;
2253
2254	# Quote anything that's RE and not wildcard: / ( ) \ \| { } + $ ^
2255	s~([\/\(\)\\\\|\{\}\+)\$\^])~\\$1~g;
2256	# . -> \.
2257	s~\.~\\.~g;
2258	# * -> .*
2259	s~\~\.\~g;
2260	# ? -> .
2261	s~\?~\.~g;
2262
2263	# print STDERR "wild_re: $_\n";
2264
2265	return $_ = '/'.$_.'/';
2266	}
2267
2268
2269	sub numeric {
2270	# Check if argument is numeric?
2271	my ( $number ) = @_ ;
2272	return oct($number) if ($number =~ /\d+/ \|\| $number =~ /\d+.\d+/);
2273	die "Expected a number, got \"$number\"\n";
2274	}
2275
2276
2277	sub boolean {
2278	my ( $boolean ) = @_ ;
2279
2280	$boolean = lc $boolean;
2281
2282	return 0 if ($boolean eq 'false' \|\| $boolean eq 'off' \|\| $boolean eq '0');
2283	return 1 if ($boolean eq 'true' \|\| $boolean eq 'on' \|\| $boolean eq '1');
2284	die "Expected a boolean, got \"$boolean\"\n";
2285	}
2286
2287
2288	sub nway {
2289	my ( $value ) = shift;
2290	my ( @values ) = @_;
2291	my ( $val ) = 0;
2292
2293	$value = lc $value;
2294	while (@_) {
2295	return $val if $value eq shift;
2296	$val++;
2297	}
2298	die "Expected one of ".join(", ",@values).", got \"$value\"\n";
2299	}
2300
2301
2302	sub insert_at_start {
2303	# ark: inserts the first arg at the top of the html in the second arg
2304	# janl: The second arg must be a reference to a scalar.
2305	my( $str, $text_ref ) = @_;
2306	my( @possible ) =("<BODY.?>", "</HEAD.?>", "</TITLE.?>", "<HTML.?>" );
2307	my( $f, $done );
2308
2309	$done=0;
2310	@_=@possible;
2311
2312	while( $done!=1 && ($f=shift) ){
2313	print "Searching for: $f\n";
2314	if( $$text_ref =~ /$f/i ){
2315	print "found it!\n";
2316	$$text_ref =~ s/($f)/$1\n$str/i;
2317	$done=1;
2318	}
2319	}
2320	}
2321
2322
2323
2324	sub rm_rf {
2325	# Recursively remove directories and other files
2326	# File::Path::rmtree does a similar thing but the messages are wrong
2327
2328	my($remove)=shift;
2329
2330	eval "use File::Find;" unless defined(&finddepth);
2331
2332	die "w3mir: Could not load File::Find module when trying to remove $remove\n"
2333	unless defined(&find);
2334
2335	finddepth(\&remove_everything,$remove);
2336
2337	if (rmdir($remove)) {
2338	print STDERR "\nw3mir: removed directory $remove\n" if $verbose>=0;
2339	} else {
2340	print STDERR "w3mir: could not remove $remove: $!\n";
2341	}
2342	}
2343
2344
2345	sub remove_everything {
2346	# This does the removal
2347	((-d && rmdir($_)) \|\| unlink($_)) && $verbose>=0 &&
2348	print STDERR "w3mir: removed $File::Find::name\n";
2349	}
2350
2351
2352
2353	sub usage {
2354	my($message)=shift @_;
2355
2356	print STDERR "w3mir: $message\n" if $message;
2357
2358	die 'w3mir: usage: w3mir [options] <single-http-url>
2359	or: w3mir -B [-I] [options] [<http-urls>]
2360
2361	Options :
2362	-agent <agent> - Set the agent name. Default is w3mir
2363	-abs - Force all URLs to be absolute.
2364	-B - Batch-get documents.
2365	-I - The URLs to get are read from standard input.
2366	-c - be more Chatty.
2367	-cfgfile <file> - Read config from file
2368	-d <debug-level>- set debug level to 1 or 2
2369	-drr - Disable robots.txt rules.
2370	-f - Forget all files, nothing is saved to disk.
2371	-fa - Fetch All, will not check timestamps.
2372	-fs - Fetch Some, do not fetch the files we already have.
2373	-ir <referer> - Initial referer. For picky servers.
2374	-l - List URLs in the documents retrived.
2375	-lc - Convert all URLs (and filenames) to lowercase.
2376	This does not work reliably.
2377	-p <n> - Pause n seconds before retriving each doc.
2378	-q - Quiet, error-messages only
2379	-rp <n> - Retry Pause in seconds.
2380	-P <server:port>- Use host/port for proxy http requests
2381	-pflush - Flush proxy server.
2382	-r - Recursive mirroring.
2383	-R - Remove files not referenced or not present on server.
2384	-s - Send output to stdout instead of file
2385	-t <n> - How many times to (re)try getting a failed doc?
2386	-umask <umask> - Set umask for mirroring, must be usual octal format.
2387	-nnc - No Newline Conversion. Disable newline conversions.
2388	-v - Show w3mir version.
2389	';
2390	}
2391	__END__
2392	# -- perl -- There must be a blank line here
2393
2394	=head1 NAME
2395
2396	w3mir - all purpose HTTP-copying and mirroring tool
2397
2398	=head1 SYNOPSIS
2399
2400	B<w3mir> [B<options>] [I<HTTP-URL>]
2401
2402	B<w3mir> B<-B> [B<options>] <I<HTTP-URLS>>
2403
2404	B<w3mir> is a all purpose HTTP copying and mirroring tool. The
2405	main focus of B<w3mir> is to create and maintain a browsable copy of
2406	one, or several, remote WWW site(s).
2407
2408	Used to the max w3mir can retrive the contents of several related
2409	sites and leave the mirror browseable via a local web server, or from
2410	a filesystem, such as directly from a CDROM.
2411
2412	B<w3mir> has options for all operations that are simple enough for
2413	options. For authentication and passwords, multiple site retrievals
2414	and such you will have to resort to a L</CONFIGURATION-FILE>. If
2415	browsing from a filesystem references ending in '/' needs to be
2416	rewritten to end in '/index.html', and in any case, if there are URLs
2417	that are redirected will need to be changed to make the mirror
2418	browseable, see the documentation of B<Fixup> in the
2419	L</CONFIGURATION-FILE> secton.
2420
2421	B<w3mir>s default behavior is to do as little as possible and to be as
2422	nice as possible to the server(s) it is getting documents from. You
2423	will need to read through the options list to make B<w3mir> do more
2424	complex, and, useful things. Most of the things B<w3mir> can do is
2425	also documented in the w3mir-HOWTO which is available at the B<w3mir>
2426	home-page (F<http://www.math.uio.no/~janl/w3mir/>) as well as in the
2427	w3mir distribution bundle.
2428
2429	=head1 DESCRIPTION
2430
2431	You may specify many options and one HTTP-URL on the w3mir
2432	command line.
2433
2434	A single HTTP URL I<must> be specified either on the command line or
2435	in a B<URL> directive in a configuration file. If the URL refers to a
2436	directory it I<must> end with a "/", otherwise you might get surprised
2437	at what gets retrieved (e.g. rather more than you expect).
2438
2439	Options must be prefixed with at least one - as shown below, you can
2440	use more if you want to. B<-cfgfile> is equivalent to B<--cfgfile> or
2441	even B<------cfgfile>. Options cannot be I<clustered>, i.e., B<-r -R>
2442	is not equivalent to B<-rR>.
2443
2444	=over 4
2445
2446	=item B<-h> \| B<-help> \| B<-?>
2447
2448	prints a brief summary of all command line options and exits.
2449
2450	=item B<-cfgfile> F<file>
2451
2452	Makes B<w3mir> read the given configuration file. See the next section
2453	for how to write such a file.
2454
2455	=item B<-r>
2456
2457	Puts B<w3mir> into recursive mode. The default is to fetch only one
2458	document and then quit. 'I<recursive>' mode means that all the
2459	documents linked to the given document that are fetched, and all they
2460	link to in turn and so on. But only I<Iff> they are in the same
2461	directory or under the same directory as the start document. Any
2462	document that is in or under the starting documents directory is said
2463	to be within the I<scope of retrieval>.
2464
2465	=item B<-fa>
2466
2467	Fetch All. Normally B<w3mir> will only get the document if it has been
2468	updated since the last time it was fetched. This switch turns that
2469	check off.
2470
2471	=item B<-fs>
2472
2473	Fetch Some. Not the opposite of B<-fa>, but rather, fetch the ones we
2474	don't have already. This is handy to restart copying of a site
2475	incompletely copied by earlier, interrupted, runs of B<w3mir>.
2476
2477	=item B<-p> I<n>
2478
2479	Pause for I<n> seconds between getting each document. The default is
2480	30 seconds.
2481
2482	=item B<-rp> I<n>
2483
2484	Retry Pause, in seconds. When B<w3mir> fails to get a document for some
2485	technical reason (timeout mainly) the document will be queued for a
2486	later retry. The retry pause is how long B<w3mir> waits between
2487	finishing a mirror pass before starting a new one to get the still
2488	missing documents. This should be a long time, so network conditions
2489	have a chance to get better. The default is 600 seconds (10 minutes),
2490	which might be a bit too short, for batch running B<w3mir> I would
2491	suggest an hour (3600 seconds) or more.
2492
2493	=item B<-t> I<n>
2494
2495	Number of reTries. If B<w3mir> cannot get all the documents by the
2496	I<n>th retry B<w3mir> gives up. The default is 3.
2497
2498	=item B<-drr>
2499
2500	Disable Robot Rules. The robot exclusion standard is described in
2501	http://info.webcrawler.com/mak/projects/robots/norobots.html. By
2502	default B<w3mir> honors this standard. This option causes B<w3mir> to
2503	ignore it.
2504
2505	=item B<-nnc>
2506
2507	No Newline Conversion. Normally w3mir converts the newline format of
2508	all files that the web server says is a text file. However, not all
2509	web servers are reliable, and so binary files may become corrupted due
2510	to the newline conversion w3mir performs. Use this option to stop
2511	w3mir from converting newlines. This also causes the file to be
2512	regarded as binary when written to disk, to disable the implicit
2513	newline conversion when saving text files on most non-Unix systems.
2514
2515	This will probably be on by default in version 1.1 of w3mir, but not
2516	in version 1.0.
2517
2518	=item B<-R>
2519
2520	Remove files. Normally B<w3mir> will not remove files that are no
2521	longer on the server/part of the retrieved web of files. When this
2522	option is specified all files no longer needed or found on the servers
2523	will be removed. If B<w3mir> fails to get a document for I<any> other
2524	reason the file will not be removed.
2525
2526	=item B<-B>
2527
2528	Batch fetch documents whose URLs are given on the commandline.
2529
2530	In combination with the B<-r> and/or B<-l> switch all HTML and PDF
2531	documents will be mined for URLs, but the documents will be saved on
2532	disk unchanged. When used with the B<-r> switch only one single URL
2533	is allowed. When not used with the B<-r> switch no HTML/URL
2534	processing will be performed at all. When the B<-B> switch is used
2535	with B<-r> w3mir will not do repeated mirrorings reliably since the
2536	changes w3mir needs to do, in the documents, to work reliably are not
2537	done. In any case it's best not to use B<-R> in combination with
2538	B<-B> since that can result in deleting rather more documents than
2539	expected. Hwowever, if the person writing the documents being copied
2540	is good about making references relative and placing the <HTML> tag at
2541	the beginning of documents there is a fair chance that things will
2542	work even so. But I wouln't bet on it. It will, however, work
2543	reliably for repeated mirroring if the B<-r> switch is not used.
2544
2545	When the B<-B> switch is specified redirects for a given document will
2546	be followed no matter where they point. The redirected-to document
2547	will be retrieved in the place of the original document. This is a
2548	potential weakness, since w3mir can be directed to fetch any document
2549	anywhere on the web.
2550
2551	Unless used with B<-r> all retrived files will be stored in one
2552	directory using the remote filename as the local filename. I.e.,
2553	F<http://foo/bar/gazonk.html> will be saved as F<gazonk.html>.
2554	F<http://foo/bar/> will be saved as F<bar-index.html> so as to avoid
2555	name colitions for the common case of URLs ending in /.
2556
2557	=item B<-I>
2558
2559	This switch can only be used with the B<-B> switch, and only after it
2560	on the commandline or configuration file. When given w3mir will get
2561	URLs from standard input (i.e., w3mir can be used as the end of a pipe
2562	that produces URLs.) There should only be one URL pr. line of input.
2563
2564	=item B<-q>
2565
2566	Quiet. Turns off all informational messages, only errors will be
2567	output.
2568
2569	=item B<-c>
2570
2571	Chatty. B<w3mir> will output more progress information. This can be
2572	used if you're watching B<w3mir> work.
2573
2574	=item B<-v>
2575
2576	Version. Output B<w3mir>s version.
2577
2578	=item B<-s>
2579
2580	Copy the given document(s) to STDOUT.
2581
2582	=item B<-f>
2583
2584	Forget. The retrieved documents are not saved on disk, they are just
2585	forgotten. This can be used to prime the cache in proxy servers, or
2586	not save documents you just want to list the URLs in (see B<-l>).
2587
2588	=item B<-l>
2589
2590	List the URLs referred to in the retrieved document(s) on STDOUT.
2591
2592	=item B<-umask> I<n>
2593
2594	Sets the umask, i.e., the permission bits of all retrieved files. The
2595	number is taken as octal unless it starts with a 0x, in which case
2596	it's taken as hexadecimal. No matter what you set this to make sure
2597	you get write as well as read access to created files and directories.
2598
2599	Typical values are:
2600
2601	=over 8
2602
2603	=item 022
2604
2605	let everyone read the files (and directories), only you can change
2606	them.
2607
2608	=item 027
2609
2610	you and everyone in the same file-group as you can read, only you can
2611	change them.
2612
2613	=item 077
2614
2615	only you can read the files, only you can change them.
2616
2617	=item 0
2618
2619	everyone can read, write and change everything.
2620
2621	=back
2622
2623	The default is whatever was set when B<w3mir> was invoked. 022 is a
2624	reasonable value.
2625
2626	This option has no meaning, or effect, on Win32 platforms.
2627
2628	=item B<-P> I<server:port>
2629
2630	Use the given server and port is a HTTP proxy server. If no port is
2631	given port 80 is assumed (this is the normal HTTP port). This is
2632	useful if you are inside a firewall, or use a proxy server to save
2633	bandwidth.
2634
2635	=item B<-pflush>
2636
2637	Proxy flush, force the proxy server to flush it's cache and re-get the
2638	document from the source. The I<Pragma: no-cache> HTTP/1.0 header is
2639	used to implement this.
2640
2641	=item B<-ir> I<referrer>
2642
2643	Initial Referrer. Set the referrer of the first retrieved document.
2644	Some servers are reluctant to serve certain documents unless this is
2645	set right.
2646
2647	=item B<-agent> I<agent>
2648
2649	Set the HTTP User-Agent fields value. Some servers will serve
2650	different documents according to the WWW browsers capabilities.
2651	B<w3mir> normally has B<w3mir>/I<version> in this header field.
2652	Netscape uses things like B<Mozilla/3.01 (X11; I; Linux 2.0.30 i586)>
2653	and MSIE uses things like B<Mozilla/2.0 (compatible; MSIE 3.02;
2654	Windows NT)> (remember to enclose agent strings with spaces in with
2655	double quotes ("))
2656
2657	=item B<-lc>
2658
2659	Lower Case URLs. Some OSes, like W95 and NT, are not case sensitive
2660	when it comes to filenames. Thus web masters using such OSes can case
2661	filenames differently in different places (apps.html, Apps.html,
2662	APPS.HTML). If you mirror to a Unix machine this can result in one
2663	file on the server becoming many in the mirror. This option
2664	lowercases all filenames so the mirror corresponds better with the
2665	server.
2666
2667	If given it must be the first option on the command line.
2668
2669	This option does not work perfectly. Most especially for mixed case
2670	host-names.
2671
2672	=item B<-d> I<n>
2673
2674	Set the debug level. A debug level higher than 0 will produce lots of
2675	extra output for debugging purposes.
2676
2677	=item B<-abs>
2678
2679	Force all URLs to be absolute. If you retrive
2680	F<http://www.ifi.uio.no/~janl/index.html> and it references foo.html
2681	the referense is absolutified into
2682	F<http://www.ifi.uio.no/~janl/foo.html>. In other words, you get
2683	absolute references to the origin site if you use this option.
2684
2685	=back
2686
2687	=head1 CONFIGURATION-FILE
2688
2689	Most things can be mirrored with a (long) command line. But multi
2690	server mirroring, authentication and some other things are only
2691	available through a configuration file. A configuration file can
2692	either be specified with the B<-cfgfile> switch, but w3mir also looks
2693	for .w3mirc (w3mir.ini on Win32 platforms) in the directory where
2694	w3mir is started from.
2695
2696	The configuration file consists of lines of comments and directives.
2697	A directive consists of a keyword followed by a colon (:) and then one
2698	or several arguments.
2699
2700	# This is a comment. And the next line is a directive:
2701	Options: recurse, remove
2702
2703	A comment can only start at the beginning of a line. The directive
2704	keywords are not case-sensitive, but the arguments I<might> be.
2705
2706	=over 4
2707
2708	=item Options: I<recurse> \| I<no-date-check> \| I<only-nonexistent> \| I<list-urls> \| I<lowercase> \| I<remove> \| I<batch> \| I<input-urls> \| I<no-newline-conv>
2709
2710	This must be the first directive in a configuration file.
2711
2712	=over 8
2713
2714	=item I<recurse>
2715
2716	see B<-r> switch.
2717
2718	=item I<no-date-check>
2719
2720	see B<-fa> switch.
2721
2722	=item I<only-nonexistent>
2723
2724	see B<-fs> switch.
2725
2726	=item I<list-urls>
2727
2728	see B<-l> option.
2729
2730	=item I<lowercase>
2731
2732	see B<-lc> option.
2733
2734	=item I<remove>
2735
2736	see B<-R> option.
2737
2738	=item I<batch>
2739
2740	see B<-B> option.
2741
2742	=item I<input-urls>
2743
2744	see B<-I> option.
2745
2746	=item I<no-newline-conv>
2747
2748	see B<-nnc> option.
2749
2750	=back
2751
2752	=item URL: I<HTTP-URL> [I<target-directory>]
2753
2754	The URL directive may only appear once in any configuration file.
2755
2756	Without the optional target directory argument it corresponds directly
2757	to the I<single-HTTP-URL> argument on the command line.
2758
2759	If the optional target directory is given all documents from under the
2760	given URL will be stored in that directory, and under. The target
2761	directory is most likely only specified if the B<Also> directive is
2762	also specified.
2763
2764	If the URL given refers to a directory it I<must> end in a "/",
2765	otherwise you might get quite surprised at what gets retrieved.
2766
2767	Either one URL: directive or the single-HTTP-URL at the command-line
2768	I<must> be given.
2769
2770	=item Also: I<HTTP-URL directory>
2771
2772	This directive is only meaningful if the I<recurse> (or B<-r>)
2773	option is given.
2774
2775	The directive enlarges the scope of a recursive retrieval to contain
2776	the given HTTP-URL and all documents in the same directory or under.
2777	Any documents retrieved because of this directive will be stored in the
2778	given directory of the mirror.
2779
2780	In practice this means that if the documents to be retrieved are stored
2781	on several servers, or in several hierarchies on one server or any
2782	combination of those. Then the B<Also> directive ensures that we get
2783	everything into one single mirror.
2784
2785	This also means that if you're retrieving
2786
2787	URL: http://www.foo.org/gazonk/
2788
2789	but it has inline icons or images stored in http://www.foo.org/icons/
2790	which you will also want to get, then that will be retrieved as well by
2791	entering
2792
2793	Also: http://www.foo.org/icons/ icons
2794
2795	As with the URL directive, if the URL refers to a directory it I<must>
2796	end in a "/".
2797
2798	Another use for it is when mirroring sites that have several names
2799	that all refer to the same (logical) server:
2800
2801	URL: http://www.midifest.com/
2802	Also: http://midifest.com/ .
2803
2804	At this point in time B<w3mir> has no mechanism to easily enlarge the
2805	scope of a mirror after it has been established. That means that you
2806	should survey the documents you are going to retrieve to find out what
2807	icons, graphics and other things they refer to that you want. And
2808	what other sites you might like to retrieve. If you find out that
2809	something is missing you will have to delete the whole mirror, add the
2810	needed B<Also> directives and then reestablish the mirror. This lack
2811	of flexibility in what to retrieve will be addressed at a later date.
2812
2813	See also the B<Also-quene> directive.
2814
2815	=item Also-quene: I<HTTP-URL directory>
2816
2817	This is like Also, except that the URL itself is also quened. The
2818	Also directive will not cause any documents to be retrived UNLESS they
2819	are referenced by some other document w3mir has already retrived.
2820
2821	=item Quene: I<HTTP-URL>
2822
2823	This is quenes the URL for retrival, but does not enlarge the scope of
2824	the retrival. If the URL is outside the scope of retrival it will not
2825	be retrived anyway.
2826
2827	The observant reader will see that B<Also-quene> is like B<Also>
2828	combined with B<Quene>.
2829
2830	=item Initial-referer: I<referer>
2831
2832	see B<-ir> option.
2833
2834	=item Ignore: F<wildcard>
2835
2836	=item Fetch: F<wildcard>
2837
2838	=item Ignore-RE: F<regular-expression>
2839
2840	=item Fetch-RE: F<regular-expression>
2841
2842	These four are used to set up rules about which documents, within the
2843	scope of retrieval, should be gotten and which not. The default is to
2844	get I<anything> that is within the scope of retrieval. That may not
2845	be practical though. This goes for CGI scripts, and especially server
2846	side image maps and other things that are executed/evaluated on the
2847	server. There might be other things you want unfetched as well.
2848
2849	B<w3mir> stores the I<Ignore>/I<Fetch> rules in a list. When a
2850	document is considered for retrieval the URL is checked against the
2851	list in the same order that the rules appeared in the configuration
2852	file. If the URL matches any rule the search stops at once. If it
2853	matched a I<Ignore> rule the document is not fetched and any URLs in
2854	other documents pointing to it will point to the document at the
2855	original server (not inside the mirror). If it matched a I<Fetch>
2856	rule the document is gotten. If not matched by any ruøes the document
2857	is gotten.
2858
2859	The F<wildcard>s are a very limited subset of Unix-wildcards.
2860	B<w3mir> understands only 'I<?>', 'I<*>', and 'I<[x-y]>' ranges.
2861
2862	The F<perl-regular-expression> is perls superset of the normal Unix
2863	regular expression syntax. They must be completely specified,
2864	including the prefixed m, a delimiter of your choice (except the
2865	paired delimiters: parenthesis, brackets and braces), and any of the
2866	RE modifiers. E.g.,
2867
2868	Ignore-RE: m/.gif$/i
2869
2870	or
2871
2872	Ignore-RE: m~/././.*/~
2873
2874	and so on. "#" cannot be used as delimiter as it is the comment
2875	character in the configuration file. This also has the bad
2876	side-effect of making you unable to match fragment names (#foobar)
2877	directly. Fortunately perl allows writing ``#'' as ``\043''.
2878
2879	You must be very carefull of using the RE anchors (``^'' and ``$''
2880	with the RE versions of these and the I<Apply> directive. Given the
2881	rules:
2882
2883	Fetch-RE: m/foobar.cgi$/
2884	Ignore: *.cgi
2885
2886	the all files called ``foobar.cgi'' will be fetched. However, if the
2887	file is referenced as ``foobar.cgi?query=mp3'' it will I<not> be
2888	fetched since the ``$'' anchor will prevent it from matching the
2889	I<Fetch-RE> directive and then it will match the I<Ignore> directive
2890	instead. If you want to match ``foobar.cgi'' but not ``foobar.cgifu''
2891	you can use perls ``\b'' character class which matches a word
2892	boundrary:
2893
2894	Fetch-RE: m/foobar.cgi\b/
2895	Ignore: *.cgi
2896
2897	which will get ``foobar.cgi'' as well as ``foobar.cgi?query=mp3'' but
2898	not ``foobar.cgifu''. BUT, you must keep in mind that a lot of
2899	diffetent characters make a word boundrary, maybe something more
2900	subtle is needed.
2901
2902	=item Apply: I<regular-expression>
2903
2904	This is used to change a URL into another URL. It is a potentially
2905	I<very> powerful feature, and it also provides ample chance for you to
2906	shoot your own foot. The whole aparatus is somewhat tenative, if you
2907	find there is a need for changes in how Apply rules work please
2908	E-mail. If you are going to use this feature please read the
2909	documentation for I<Fetch-RE> and I<Ignore-RE> first.
2910
2911	The B<Apply> expressions are applied, in sequence, to the URLs in
2912	their absolute form. I.e., with the whole
2913	http://host:port/dir/ec/tory/file URL. It is only after this B<w3mir>
2914	checks if a document is within the scope of retrieval or not. That
2915	means that B<Apply> rules can be used to change certain URLs to fall
2916	inside the scope of retrieval, and vice versa.
2917
2918	The I<regular-expression> is perls superset of the usual Unix regular
2919	expressions for substitution. As with I<Fetch> and I<Ignore> rules it
2920	must be specified fully, with the I<s> and delimiting character. It
2921	has the same restrictions with regards to delimiters. E.g.,
2922
2923	Apply: s~/foo/~/bar/~i
2924
2925	to translate the path element I<foo> to I<bar> in all URLs.
2926
2927	"#" cannot be used as delimiter as it is the comment character in the
2928	configuration file.
2929
2930	Please note that w3mir expects that URLs identifying 'directories'
2931	keep idenfifying directories after application of Apply rules. Ditto
2932	for files.
2933
2934	=item Agent: I<agent>
2935
2936	see B<-agent> option.
2937
2938	=item Pause: I<n>
2939
2940	see B<-p> option.
2941
2942	=item Retry-Pause: I<n>
2943
2944	see B<-rp> option.
2945
2946	=item Retries: I<n>
2947
2948	see B<-t> option.
2949
2950	=item debug: I<n>
2951
2952	see B<-d> option.
2953
2954	=item umask I<n>
2955
2956	see B<-umask> option.
2957
2958	=item Robot-Rules: I<on> \| I<off>
2959
2960	Turn robot rules on of off. See B<-drr> option.
2961
2962	=item Remove-Nomirror: I<on> \| I<off>
2963
2964	If this is enabled sections between two consecutive
2965
2966	<!--NO MIRROR-->
2967
2968	comments in a mirrored document will be removed. This editing is
2969	performed even if batch getting is specified.
2970
2971	=item Header: I<html/text>
2972
2973	Insert this I<complete> html/text into the start of the document.
2974	This will be done even if batch is specified.
2975
2976	=item File-Disposition: I<save> \| I<stdout> \| I<forget>
2977
2978	What to do with a retrieved file. The I<save> alternative is default.
2979	The two others correspond to the B<-s> and B<-f> options. Only one
2980	may be specified.
2981
2982	=item Verbosity: I<quiet> \| I<brief> \| I<chatty>
2983
2984	How much B<w3mir> informs you of it's progress. I<Brief> is the
2985	default. The two others correspond to the B<-q> and B<-c> switches.
2986
2987	=item Cd: I<directory>
2988
2989	Change to given directory before starting work. If it does not exist
2990	it will be quietly created. Using this option breaks the 'fixup'
2991	code so consider not using it, ever.
2992
2993	=item HTTP-Proxy: I<server:port>
2994
2995	see the B<-P> switch.
2996
2997	=item HTTP-Proxy-user: I<username>
2998
2999	=item HTTP-Proxy-passwd: I<password>
3000
3001	These two are is used to activate authentication with the proxy
3002	server. L<w3mir> only supports I<basic> proxy autentication, and is
3003	quite simpleminded about it, if proxy authentication is on L<w3mir>
3004	will always give it to the proxy. The domain concept is not supported
3005	with proxy-authentication.
3006
3007	=item Proxy-Options: I<no-pragma> \| I<revalidate> \| I<refresh> \| I<no-store>
3008
3009	Set proxy options. There are two ways to pass proxy options, HTTP/1.0
3010	compatible and HTTP/1.1 compatible. Newer proxy-servers will
3011	understand the 1.1 way as well as 1.0. With old proxy-servers only
3012	the 1.0 way will work. L<w3mir> will prefer the 1.0 way.
3013
3014	The only 1.0 compatible proxy-option is I<refresh>, it corresponds to
3015	the B<-pflush> option and forces the proxy server to pass the request
3016	to a upstream server to retrieve a I<fresh> copy of the document.
3017
3018	The I<no-pragma> option forces w3mir to use the HTTP/1.1 proxy
3019	control header, use this only with servers you know to be new,
3020	otherwise it won't work at all. Use of any option but I<refresh> will
3021	also cause HTTP/1.1 to be used.
3022
3023	I<revalidate> forces the proxy server to contact the upstream server
3024	to validate that it has a fresh copy of the document. This is nicer
3025	to the net than I<refresh> option which forces re-get of the document
3026	no matter if the server has a fresh copy already.
3027
3028	I<no-store> forbids the proxy from storing the document in other than
3029	in transient storage. This can be used when transferring sensitive
3030	documents, but is by no means any warranty that the document can't be
3031	found on any storage device on the proxy-server after the transfer.
3032	Cryptography, if legal in your contry, is the solution if you want the
3033	contents to be secret.
3034
3035	I<refresh> corresponds to the HTTP/1.0 header I<Pragma: no-cache> or
3036	the identical HTTP/1.1 I<Cache-control> option. I<revalidate> and
3037	I<no-store> corresponds to I<max-age=0> and I<no-store> respectively.
3038
3039	=item Authorization
3040
3041	B<w3mir> supports only the I<basic> authentication of HTTP/1.0. This
3042	method can assign a password to a given user/server/I<realm>. The
3043	"user" is your user-name on the server. The "server" is the server.
3044	The I<realm> is a HTTP concept. It is simply a grouping of files and
3045	documents. One file or a whole directory hierarchy can belong to a
3046	realm. One server may have many realms. A user may have separate
3047	passwords for each realm, or the same password for all the realms the
3048	user has access to. A combination of a server and a realm is called a
3049	I<domain>.
3050
3051	=over 8
3052
3053	=item Auth-Domain: I<server:port/realm>
3054
3055	Give the server and port, and the belonging realm (making a domain)
3056	that the following authentication data holds for. You may specify "*"
3057	wildcard for either of I<server:port> and I<realm>, this will work
3058	well if you only have one usernme and password on all the servers
3059	mirrored.
3060
3061	=item Auth-User: I<user>
3062
3063	Your user-name.
3064
3065	=item Auth-Passwd: I<password>
3066
3067	Your password.
3068
3069	=back
3070
3071	These three directives may be repeated, in clusters, as many times as
3072	needed to give the necessary authentication information
3073
3074	=item Disable-Headers: I<referer> \| I<user>
3075
3076	Stop B<w3mir> from sending the given headers. This can be used for
3077	anonymity, making your retrievals harder to track. It will be even
3078	harder if you specify a generic B<Agent>, like Netscape.
3079
3080	=item Fixup: I<...>
3081
3082	This directive controls some aspects of the separate program w3mfix.
3083	w3mfix uses the same configuration file as w3mir since it needs a lot
3084	of the information in the B<w3mir> configuration file to do it's work
3085	correctly. B<w3mfix> is used to make mirrors more browseable on
3086	filesystems (disk or CDROM), and to fix redirected URLs and some other
3087	URL editing. If you want a mirror to be browseable of disk or CDROM
3088	you almost certainly need to run w3mfix. In many cases it is not
3089	necessary when you run a mirror to be used through a WWW server.
3090
3091	To make B<w3mir> write the data files B<w3mfix> needs, and do nothing
3092	else, simply put
3093
3094	=over 8
3095
3096	Fixup: on
3097
3098	=back
3099
3100	in the configuration file. To make B<w3mir> run B<w3mfix>
3101	automatically after each time B<w3mir> has completed a mirror run
3102	specify
3103
3104	=over 8
3105
3106	Fixup: run
3107
3108	=back
3109
3110	L<w3mfix> is documented in a separate man page in a effort to not
3111	prolong I<this> manpage unnecessarily.
3112
3113	=item Index-name: I<name-of-index-file>
3114
3115	When retriving URLs ending in '/' w3mir needs to append a filename to
3116	store it localy. The default value for this is 'index.html' (this is
3117	the most used, its use originated in the NCSA HTTPD as far as I know).
3118	Some WWW servers use the filename 'Welcome.html' or 'welcome.html'
3119	instead (this was the default in the old CERN HTTPD). And servers
3120	running on limited OSes frequently use 'index.htm'. To keep things
3121	consistent and sane w3mir and the server should use the same name.
3122	Put
3123
3124	Index-name: welcome.html
3125
3126	when mirroring from a site that uses that convention.
3127
3128	When doing a multiserver retrival where the servers use two or more
3129	different names for this you should use B<Apply> rules to make the
3130	names consistent within the mirror.
3131
3132	When making a mirror for use with a WWW server, the mirror should use
3133	the same name as the new server for this, to acomplish that
3134	B<Index-name> should be combined with B<Apply>.
3135
3136	Here is an example of use in the to latter cases when Welcome.html is
3137	the prefered I<index> name:
3138
3139	Index-name: Welcome.html
3140	Apply: s~/index.html$~/Welcome.html~
3141
3142	Similarly, if index.html is the prefered I<index> name.
3143
3144	Apply: s~/Welcome.html~/index.html~
3145
3146	I<Index-name> is not needed since index.html is the default index name.
3147
3148	=back
3149
3150	=head1 EXAMPLES
3151
3152	=over 4
3153
3154	=item * Just get the latest Dr-Fun if it has been changed since the last
3155	time
3156
3157	w3mir http://sunsite.unc.edu/Dave/Dr-Fun/latest.jpg
3158
3159	=item * Recursively fetch everything on the Star Wars site, remove
3160	what is no longer at the server from the mirror:
3161
3162	w3mir -R -r http://www.starwars.com/
3163
3164	=item * Fetch the contents of the Sega site through a proxy, pausing
3165	for 30 seconds between each document
3166
3167	w3mir -r -p 30 -P www.foo.org:4321 http://www.sega.com/
3168
3169	=item * Do everything according to F<w3mir.cfg>
3170
3171	w3mir -cfgfile w3mir.cfg
3172
3173	=item * A simple configuration file
3174
3175	# Remember, options first, as many as you like, comma separated
3176	Options: recurse, remove
3177	#
3178	# Start here:
3179	URL: http://www.starwars.com/
3180	#
3181	# Speed things up
3182	Pause: 0
3183	#
3184	# Don't get junk
3185	Ignore: *.cgi
3186	Ignore: *-cgi
3187	Ignore: *.map
3188	#
3189	# Proxy:
3190	HTTP-Proxy: www.foo.org:4321
3191	#
3192	# You _should_ cd away from the directory where the config file is.
3193	cd: starwars
3194	#
3195	# Authentication:
3196	Auth-domain: server:port/realm
3197	Auth-user: me
3198	Auth-passwd: my_password
3199	#
3200	# You can use '*' in place of server:port and/or realm:
3201	Auth-domain: /
3202	Auth-user: otherme
3203	Auth-user: otherpassword
3204
3205	=item Also:
3206
3207	# Retrive all of janl's home pages:
3208	Options: recurse
3209	#
3210	# This is the two argument form of URL:. It fetches the first into the second
3211	URL: http://www.math.uio.no/~janl/ math/janl
3212	#
3213	# These says that any documents refered to that lives under these places
3214	# should be gotten too. Into the named directories. Two arguments are
3215	# required for 'Also:'.
3216	Also: http://www.math.uio.no/drift/personer/ math/drift
3217	Also: http://www.ifi.uio.no/~janl/ ifi/janl
3218	Also: http://www.mi.uib.no/~nicolai/ math-uib/nicolai
3219	#
3220	# The options above will result in this directory hierarchy under
3221	# where you started w3mir:
3222	# w3mir/math/janl files from http://www.math.uio.no/~janl
3223	# w3mir/math/drift from http://www.math.uio.no/drift/personer/
3224	# w3mir/ifi/janl from http://www.ifi.uio.no/~janl/
3225	# w3mir/math-uib/nicolai from http://www.mi.uib.no/~nicolai/
3226
3227	=item Ignore-RE and Fetch-RE
3228
3229	# Get only jpeg/jpg files, no gifs
3230	Fetch-RE: m/\.jp(e)?g$/
3231	Ignore-RE: m/\.gif$/
3232
3233	=item Apply
3234
3235	As I said earlier, B<Apply> has not been used for Real Work yet, that
3236	I know of. But B<Apply> I<could>, be used to map all web servers at
3237	the university of Oslo inside the scope of retrieval very easily:
3238
3239	# Start at the main server
3240	URL: http://www.uio.no/
3241	# Change http://.uio.no and http://129.240. to be a subdirectory
3242	# of http://www.uio.no/.
3243	Apply: s~^http://(.*\.uio\.no(?:\d+)?)/~http://www.uio.no/$1/~i
3244	Apply: s~^http://(129\.240\.[^:]*(?:\d+)?)/~http://www.uio.no/$1/~i
3245
3246
3247	=back
3248
3249	There are two rather extensive example files in the B<w3mir> distribution.
3250
3251	=head1 BUGS
3252
3253	=over 4
3254
3255	=item The -lc switch does not work too well.
3256
3257	=back
3258
3259	=head1 FEATURES
3260
3261	These are not bugs.
3262
3263	=over 4
3264
3265	=item URLs with two /es ('//') in the path component does not work as
3266	some might expect. According to my reading of the URL spec. it is an
3267	illegal construct, which is a Good Thing, because I don't know how to
3268	handle it if it's legal.
3269
3270	=item If you start at http://foo/bar/ then index.html might be gotten
3271	twice.
3272
3273	=item Some documents point to a point above the server root, i.e.,
3274	http://some.server/../stuff.html. Netscape, and other browsers, in
3275	defiance of the URL standard documents will change the URL to
3276	http://some.server/stuff.html. W3mir will not.
3277
3278	=item Authentication is I<only> tried if the server requests it. This
3279	might lead to a lot of extra connections going up and down, but that's
3280	the way it's gotta work for now.
3281
3282	=back
3283
3284	=head1 SEE ALSO
3285
3286	L<w3mfix>
3287
3288	=head1 AUTHORS
3289
3290	B<w3mir>s authors can be reached at I<[email protected]>.
3291	B<w3mir>s home page is at http://www.math.uio.no/~janl/w3mir/

Note: See TracBrowser for help on using the repository browser.

Download in other formats: