source: gsdl/tags/gsdl-2_71-distribution/gsdl/packages/w3mir/w3mir-1.0.8/w3mir.PL@ 14121

Last change on this file since 14121 was 719, checked in by davidb, 25 years ago

added w3mir package

  • Property svn:keywords set to Author Date Id Revision
File size: 98.1 KB
Line 
1# -*-perl-*-
2
3use Config;
4
5&read_makefile;
6$fullperl = resolve_make_var('FULLPERL') || $Config{'perlpath'};
7$islib = resolve_make_var('INSTALLSITELIB');
8
9$name = $0;
10$name =~ s~^.*/~~;
11$name =~ s~.PL$~~;
12
13open(OUT,"> $name") ||
14 die "Could open $name for writing: $!\n";
15
16print "writing $name\n";
17
18while (<DATA>) {
19 if (m~^\#!/.*/perl.*$~o) {
20 # This substitutes the path perl was installed at on this system
21 # _and_ removed any (-w) options.
22 print OUT "#!",$fullperl,$1,"\n";
23 next;
24 }
25 if (/^use lib/o) {
26 # This substitutes the actuall library install path
27 print OUT "use lib '$islib';\n";
28 next;
29 }
30 print OUT;
31}
32
33close(OUT);
34
35# Make it executable too, and writeable
36chmod 0755, $name;
37
38#### The library
39
40sub resolve_make_var ($) {
41
42 my($var) = shift @_;
43 my($val) = $make{$var};
44
45# print "Resolving: ",$var,"=",$val,"\n";
46
47 while ($val =~ s~\$\((\S+)\)~$make{$1}~g) {}
48# print "Resolved: $var: $make{$var} -> $val\n";
49 $val;
50}
51
52
53sub read_makefile {
54
55 open(MAKEFILE, 'Makefile') ||
56 die "Could not open Makefile for reading: $!\n";
57
58 while (<MAKEFILE>) {
59 chomp;
60 next unless m/^([A-Z]+)\s*=\s*(\S+)$/;
61 $make{$1}=$2;
62# print "Makevar: $1 = $2\n";
63 }
64
65 close(MAKEFILE)
66}
67
68__END__
69#!/local/bin/perl5 -w
70# Perl 5.002 or later. w3mir is mostly tested with perl 5.004
71#
72# You might want to change or comment out this:
73use lib '/hom/janl/lib/perl';
74#
75# Once upon a long time ago this was Oscar Nierstrasz's
76# <[email protected]> htget script.
77#
78# Retrieves HTML pages, creating local copies in the _current_
79# directory. The script will check for the last-modified stamp on the
80# document, and will not fetch it if the document isn't changed.
81#
82# Bug list is in w3mir-README.
83#
84# Test cases for janl to use:
85# w3mir -r -fs http://www.eff.org/ - infinite recursion!
86# --- but cursory examination seems to indicate confused server...
87# http://java.sun.com/progGuide/index.html check out the img things.
88#
89# Copyright Holders:
90# Nicolai Langfeldt, [email protected]
91# Gorm Haug Eriksen, [email protected]
92# Chris Szurgot, [email protected]
93# Ed Jordan, [email protected]
94# Alex Knowles, [email protected] aka ark.
95# Copying and modification is governed by the "Artistic License" enclosed in
96# the w3mir distribution
97#
98# History (European format date: dd/mm/yy):
99# oscar 25/03/94 -- added -s option to send output to stdout
100# oscar 28/03/94 -- made HTTP 1.0 the default
101# oscar 30/05/94 -- special handling of directory URLs missing a trailing "/"
102# gorm 20/02/95 -- added mirror capacity + fixed a couple of bugs
103# janl 28/03/95 -- added a working commandline parser.
104# janl 18/09/95 -- Changed to use a net http library. Removed dependency of
105# url.pl.
106# janl 19/09/95 -- Extensive rewrite. Simplified a lot, works better.
107# HTML files are now saved in a new and improved manner,
108# which means they can be recognized as such w/o fancy
109# filename extention type rules.
110# szurgot 27/01/96-- Added "Plaintextmode" wrapper to binmode PAGE.
111# binmode page is required under Win32, but broke modified
112# checking
113# -- Minor change added ; to "# '" strings for Emacs cperl-mode
114# szurgot 07/02/96-- When reading in local file for checking of URLs changed
115# local ($/) =0; to equal undef;
116# janl 08/02/96 -- Added szurgot's changes and changed them :-)
117# szurgot 09/02/96-- Added code to strip /#.*$/ from urls when reading from
118# local file
119# -- Added hasAlarm variable to w3http.pl. Set to 1 if you have
120# alarm(). 0 otherwise.
121# -- Moved code setting up the valid extensions list into the
122# args processing where it belonged
123# janl 20/02/96 -- Added szurgot changes again.
124# -- Make timeout code work.
125# -- and made another win32 test.
126# janl 19/03/96 -- Worked through the code for handling not-modified
127# documents, it was a bit shabby after htmlop was intro'ed.
128# janl 20/03/96 -- -l fix
129# janl 23/04/96 -- Added -fs by request (by Rik Faith)
130# janl 16/05/96 -- Made -R mandatory, added use and support for
131# w3http::SAVEBIN
132# szurgot 19/05/96-- Win95 adaptions.
133# janl 19/05/96 -- -C did not exactly work as expected. Thanks to Petr
134# Novak for bug descriptions.
135# janl 19/05/96 -- Changed logic for @didntget, @got and so on to use
136# @queue and %urlstat.
137# janl 09/09/96 -- Removed -R switch.
138# janl 14/09/96 -- Added ir (initial referer) switch
139# janl 21/09/96 -- Made retry code saner. There probably needs to be a
140# sleep before retry comencing switch. When no tty is
141# present it should be fairly long.
142# gorm 15/09/96 -- Added cr (check robot) switch. Default to 1 (on)
143# janl 22/09/96 -- Modified gorms patch to use WWW::RobotRules. Changed
144# robot switch to be consistent with current w3mir
145# practice.
146# janl 27/09/96 -- Spelling corrections from [email protected]
147# -- Folded in manual diffs from ark.
148# ark 24/09/96 -- Simple facilities to edit the incomming file(s)
149# janl 27/09/96 -- Added switch to enable <!--NOMIRROR--> editing and
150# foolproofed ark's patch a bit.
151# janl 02/10/96 -- Added -umask switch.
152# -- Redirected documents did not have a meaningful referer
153# value (it was undefined).
154# -- Got w3mir into strict discipline, found some typos...
155# janl 20/10/96 -- Mtime is preserved
156# janl 21/10/96 -- -lc switch added. Mtime preservation works better.
157# janl 06/11/96 -- Treat 301 like 302.
158# janl 02/12/96 -- Added config file code, fetch/ignore rules, apply
159# janl 04/12/96 -- Better checking of config input.
160# janl 06/12/96 -- Putting together the URL selection/editing brains.
161# janl 07/12/96 -- Checking out some bugs. Adding multiscope options.
162# janl 12/12/96 -- Adding to and defeaturing the multiscope options.
163# janl 13/12/96 -- Continuing work in multiscope stuff
164# -- Unreferenced file and empty directory removal works.
165# janl 19/02/97 -- Can extract urls from adobe acrobat pdf files :-)
166# Important: It does _not_ edit urls, so they still
167# point at the original site(s).
168# janl 21/02/97 -- Fix -lc bug related to case and the apply things.
169# -- only use SAVEURL if needed
170# janl 11/03/97 -- Finish work on SAVEURL conditional.
171# -- Fixed directory removal code.
172# -- parse_args did not abort when unknown option/argument
173# was specified.
174# janl 12/03/97 -- Made test case for -lc. Didn't work. Fixed it. I think.
175# Realized we have bug w.r.t. hostname caseing.
176# janl 13/03/97 -- All redirected to URLs within scope are now queued.
177# That should make the mirror more complete, but it won't
178# help browsability when it comes to the redirected doc.
179# -- Moved robot retrival to the inside of the mirror loop
180# since we now possebly mirror several sites.
181# -- Changed 'fetch-options' to 'options'.
182# -- Added 'proxy-options'/-pflush to controll proxy server(s).
183# janl 09/04/97 -- Started using URI::URL.
184# janl 11/04/97 -- Debugging and using URI::URL more correctly various places
185# janl 09/05/97 -- Added --agent switch
186# janl 12/05/97 -- Simplified scope checks for root URL, changed URL 'apply'
187# processing.
188# -- Small output formating fix in the robot rules code.
189# -- Version is now 0.99
190# janl 14/05/97 -- htmpop no-longer puts '<!DOCTYPE...' into doc, so check
191# for '<HTML' instead
192# janl 11/06/97 -- Made :port optional in server part of auth-domain.
193# Always removing :80 from server part to match netloc.
194# janl 22/07/97 -- More debugging of rewrite for new features -B, -I.
195# janl 01/08/97 -- Fixed bug in RE quoting for Ignore/Fetch
196# janl 04/08/97 -- s/writepage/write_page/g
197# janl 07/09/97 -- 0.99b1 is released
198# janl 19/09/97 -- Kaj Hejer discovers omissions in non-html-url-mining code.
199# -- 0.99b2 is released
200# janl 24/09/97 -- Matt Chapman found bug in realm-name extraction.
201# janl 10/10/97 -- Referer: header supression supressed User: header instead
202# -- Added fixup handling, writes .redirs and .referers
203# (no dot in win32)
204# -- Read .w3mirc (w3mir.ini on win32) if present
205# -- Stop file removal code from removing these files
206# janl 16/10/97 -- process_tag was mangling url attributes in tags with more
207# than one of them. Problem found by Robert L. Binkley
208# janl 04/12/97 -- Fixed problem with authentication, misplaced +
209# -- default inter-docuent pause is 0. I figure it's better
210# to keep one httpd occupied in a steady stream than to
211# wait for it to die before we talk to it again.
212# janl 13/12/97 -- The arguments to index.html in the form of index.html/foo
213# handling code was incomplete. To make it complete would
214# have been hard, so it was removed.
215# -- If a URL changes from file to directory or vice versa
216# this is now handled.
217# janl 11/01/98 -- PDF files with no URLs does not cause warnings now.
218# -- Close REFERERS and REDIRECTS before calling w3mfix
219# janl 22/01/98 -- Proxy authentication as outlined by Christian Geuer
220# janl 04/02/98 -- Version 1pre1
221# janl 18/02/98 -- Fixed wild_re after tip by Prentiss Riddle.
222# -- Version 1pre2
223# janl 20/02/98 -- w3http updated to handle complex content-types.
224# -- Fix wild_re more, bug noted by James Dumser
225# -- 1.0pre3
226# janl 18/03/98 -- Version 1.0 is released
227# janl 09/04/98 -- Added feature so user can disable newline conversion.
228# janl 20/04/98 -- Only convert newlines in HTML files. -> 1.0.2
229# janl 09/05/98 -- More carefull clean_disk code.
230# -- Check if the redirected URL was a root url, if so
231# issue a warning and exit.
232# janl 12/05/98 -- use ->unix_path instead of ->as_string to derive local
233# filename.
234# janl 25/05/98 -- -B didn't work too well.
235# janl 09/07/98 -- Redirect to fragment broke us, less broken now -> 1.0.4
236# janl 24/09/98 -- Better errormessages on errors -> 1.0.5
237# janl 21/11/98 -- Fix errormessages better.
238# janl 05/01/99 -- Drop 'Referer: (commandline)'
239# janl 13/04/99 -- Add initial referer to root urls in batch mode.
240#
241# Variable name discipline:
242# - remote, umodified URL. Variables prefixed 'rum_'
243# - local, filesystem. Variables prefixed 'lf_'.
244# Use these prefixes so we know what we're working with at all times.
245# Also, URL objects are postfixed _o
246#
247# The apply rules and scope rules work this way:
248# - First apply the user rules to the remote url.
249# - Check if document is within scope after this.
250# - Then apply w3mir's rules to the result. This results is the local,
251# filesystem, name.
252#
253# We use features introduced in 5.002.
254require 5.002;
255
256# win32 and $nulldevice need to be globals, other modules use them.
257use vars qw($win32 $nulldevice);
258
259# To figure out what kind of system this is
260BEGIN {
261 use Config;
262 $win32 = ( $Config{'osname'} eq 'MSWin32' );
263}
264# More ways to die:
265use Carp;
266# Http module:
267use w3http;
268# html url extraction and manupulation:
269use htmlop;
270# Extract urls from adobe acrobat pdf files:
271use w3pdfuri;
272# Date computer:
273use HTTP::Date;
274# URLs:
275use URI::URL;
276# For flush method
277use FileHandle;
278
279# Full discipline:
280use strict;
281
282# Set params in the http package, HTTP protocol version:
283$w3http::version="1.0";
284
285# The defaults should be for a robotic http agent on good behaviour.
286my $debug=0; # Debug level
287my $verbose=0; # Verbosity level, -1 = quiet, 0 = normal, 1...
288my $pause=0; # Pause between http requests
289my $retryPause=600; # Pause between retries. 10 minutes.
290my $retry=3; # Max 3 stabs pr. url.
291my $r=0; # Recurse? no recursion = absolutify links
292my $remove=0; # Remove files that are not there?
293my $s=0; # 0: save on disk 1: stdout 2: just forget 'em
294my $useauth=0; # Use authorization
295my %authdata; # Authorization data
296my $check_robottxt = 1; # Check robots.txt
297my $do_referer = 1; # Send referers header
298my $do_user = 1; # Send user header
299my $cache_header = ''; # The cache-control/pragma: no-cache header
300my $using_proxy = 0; # Using proxy server or not?
301my $batch=0; # Batch get URLs?
302my $read_urls=0; # Get urls from STDIN?
303my $abs=0; # Absolutify URLs?
304my $immediate_redir=0; # Immediately follow a redirect?
305my @root_urls; # This is where we start, the root documents
306my @root_dirs; # The corresponding directories. for remove
307my $chdirto=''; # Place to chdir to after reading config file
308my %nodelete=(); # Files that should not be deleted
309my $numarg=0; # Number of arguments accepted.
310
311# Fixup related things
312my $fixrc=''; # Name of w3mfix config file
313my $fixup=1; # Do things needed to run fixup
314my $runfix=0; # Run w3mfix for user?
315my $fixopen=0; # Fixup files open?
316
317my $indexname='index.html';
318
319my $VERSION;
320$VERSION='1.0.8';
321$w3http::agent = my $w3mir_agent = "w3mir/$VERSION-1999-05-28";
322my $iref=''; # Initial referer. Must evaluate to false
323
324# Derived settings
325my $mine_urls=0; # Mine URLs from documents?
326my $process_urls=0; # Perform (URL) processing of documents?
327
328# Queue of urls to get.
329my @rum_queue = ();
330my @urls = ();
331# URL status map.
332my %rum_urlstat = ();
333# Status codes:
334my $QUEUED = 0; # Queued but not gotten yet.
335my $TERROR = 100; # Transient error, retry later
336my $HLERR = 101; # Permanent error, give up
337my $GOTIT = 200; # Gotten. Note similarity to http result code
338my $NOTMOD = 304; # Not modified.
339# Negative codes for nonexistent files, easier to check.
340my $NEVERMIND= -1; # Don't want it
341my $REDIR = -302; # Does not exist, redirected
342my $ENOTFND = -404; # Does not exist.
343my $OTHERERR = -600; # Some other error happened
344my $FROBOTS = -601; # Forbidden by robots.txt rule
345
346# Directory/files survey:
347my %lf_file; # What files are present in FS? Disposition? One of:
348my $FILEDEL=0; # Delete file
349my $FILEHERE=1; # File present in filesystem only
350my $FILETHERE=2; # File present on server too.
351my %lf_dir; # Number of files/dirs in dir. If 0 dir is
352 # eligible for deletion.
353
354my %fiddled=(); # If a file becomes a directory or a directory
355 # becomes a file it is considered fiddled and
356 # w3mir will not fiddle with it again in this
357 # run.
358
359# Bitbucket device, very OS dependent.
360$nulldevice='/dev/null';
361$nulldevice='nul:' if ($win32);
362
363# What to get, and not.
364# Text of user supplied fetch/ignore rules
365my $rule_text=" # User defined fetch/ignore rules\n";
366# Code ref to the rule procedure
367my $rule_code;
368
369# Code to prefix and postfix the generated code. Prefix should make
370# $_ contain the url to match. Postfix should return 1, the default
371# is to get the url/file.
372my $rule_prefix='$rule_code = sub { local($_) = shift;'."\n";
373my $rule_postfix=" return 1;\n}";
374
375# Scope tests generated by URL/Also directives in cfg. The scope code
376# is just like the rule code, but used for program generated
377# fetch/ignore rules related to multiscope retrival.
378my $scope_fetch=" # Automatic fetch rules for multiscope retrival\n";
379my $scope_ignore=" # Automatic ignore rules for multiscope retrival\n";
380my $scope_code;
381
382my $scope_prefix='$scope_code = sub { local($_) = shift;'."\n";
383my $scope_postfix=" return 0;\n}";
384
385# Function to apply to urls, se rule comments.
386my $user_apply_code; # User specified apply code
387my $apply_code; # w3mirs apply code
388my $apply_prefix='$apply_code = sub { local($_) = @_;'."\n";
389my $apply_lc=' $_ = lc $_; ';
390my $apply_postfix=' return $_;'."\n}";
391my @user_apply; # List of users apply rules.
392my @internal_apply; # List of w3mirs apply rules.
393
394my $infoloss=0; # 1 if any URL translations (which cause
395 # information loss) are in effect. If this is
396 # true we use the SAVEURL operation.
397my $list; # List url on STDOUT?
398my $edit; # Edit doc? Remove <!--NOMIRROR>...<!--/NOMIRROR-->
399my $header; # Text to insert in header
400my $lc=0; # Convert urls/filenames to lowercase?
401my $fetch=0; # What to fetch: -1: Some, 0: not modified 1: all
402my $convertnl=1; # Convert newlines?
403
404# Non text/html formats we can extract urls from. Function must take one
405# argument: the filename.
406my %knownformats = ( 'application/pdf', \&w3pdfuri::list,
407 'application/x-pdf', \&w3pdfuri::list,
408 );
409
410# Known 'magic numbers' of the known formats. The value is used as
411# key in %knownformats. the key part is a exact match for the
412# following <string> beginning at the first byte of the file.
413# This should probably be made more flexible, but not until we need it.
414
415my %knownmagic = ( '%PDF-', 'application/pdf' );
416
417my $iinline=''; # inline RE code to make RE caseinsensitive
418my $ipost=''; # RE postfix to make it caseinsensitive
419
420usage() unless parse_args(@ARGV);
421
422{
423 my $w3mirc='.w3mirc';
424
425 $w3mirc='w3mir.ini' if $win32;
426
427 if (-f $w3mirc) {
428 parse_cfg_file($w3mirc);
429 $nodelete{$w3mirc}=1;
430 }
431}
432
433# Check arguments and options
434if ($#root_urls>=0) {
435 # OK
436} else {
437 print "URLs: $#rum_queue\n";
438 usage("No URLs given");
439}
440
441# Are we converting newlines today?
442$w3http::convert=0 unless $convertnl;
443
444if ($chdirto) {
445 &mkdir($chdirto.'/this-is-not-created-odd-or-what');
446 chdir($chdirto) ||
447 die "w3mir: Can't change working directory to '$chdirto': $!\n";
448}
449
450$SIG{'INT'}=sub { print STDERR "\nCaught SIGINT!\n"; exit 1; };
451$SIG{'QUIT'}=sub { print STDERR "\nCaught SIGQUIT!\n"; exit 1; };
452$SIG{'HUP'}=sub { print STDERR "\nCaught SIGHUP!\n"; exit 1; };
453
454&open_fixup if $fixup;
455
456# Derive how much document processing we should do.
457$mine_urls=( $r || $list );
458$process_urls=(!$batch && !$edit && !$header);
459# $abs can be set explicitly with -abs, and implicitly if not recursing
460$abs = 1 unless $r;
461print "Absolute references\n" if $abs && $debug;
462
463# Cache_controll specified but proxy not in use?
464die "w3mir: If you want to control a cache, use a proxy server!\n"
465 if ($cache_header && !$using_proxy);
466
467# Compile the second order code
468
469# - The rum scope tests
470my $full_rules=$scope_prefix.$scope_fetch.$scope_ignore.$scope_postfix;
471# warn "Scope rules:\n-------------\n$full_rules\n---------------\n";
472eval $full_rules;
473
474die "w3mir: Program generated rules did not compile.\nPlease report to w3mir-core\@usit.uio.no. The code is:\n----\n".
475 $full_rules."\n----\n"
476 if !defined($scope_code);
477
478$full_rules=$rule_prefix.$rule_text.$rule_postfix;
479# warn "Fetch rules:\n-------------\n$full_rules\n---------------\n";
480eval $full_rules;
481
482# - The user specified rum tests
483die "w3mir: Ignore/Fetch rules did not compile.\nPlease report to w3mir-core\@usit.uio.no. The code is:\n----\n".
484 $full_rules."\n----\n"
485 if !defined($rule_code);
486
487# - The user specified apply rules
488
489my $full_apply=$apply_prefix.($lc?$apply_lc:'').
490 join($ipost.";\n",@user_apply).(($#user_apply>=0)?$ipost:"").";\n".
491 $apply_postfix;
492eval $full_apply;
493
494die "w3mir: User apply rules did not compile.\nPlease report to w3mir-core\@usit.uio.no. The code is:
495----
496".$full_apply."
497----\n" if !defined($apply_code);
498
499$user_apply_code=$apply_code;
500
501# - The w3mir generated apply rules
502
503$full_apply=$apply_prefix.($lc?$apply_lc:'').
504 join($ipost.";\n",@internal_apply).(($#internal_apply>=0)?$ipost:"").";\n".
505 $apply_postfix;
506eval $full_apply;
507
508die "Internal apply rules did not compile. The code is:
509----
510".$full_apply."
511----\n" if !defined($apply_code);
512
513# - Information loss via -lc? There are other sources as well.
514$infoloss=1 if $lc;
515
516warn "Infoloss is $infoloss\n" if $debug;
517
518# More setup:
519
520$w3http::debug=$debug;
521
522$w3http::verbose=$verbose;
523
524my %rum_referers=(); # Array of referers, key: rum_url
525my $Robot_Blob; # WWW::RobotsRules object, decides if rum_url is
526 # forbidden to access for us.
527my $rum_url_o; # rum url, mostly the current, the one we're getting
528my %gotrobots; # Did I get robots.txt from site? key: url->netloc
529my($authuser,$authpass);# Username and password for authentication with server
530my @rum_newurls; # List of rum_urls in document
531
532if ($check_robottxt) {
533 # Eval is only way to defer loading of module until we know it's needed?
534 eval 'use WWW::RobotRules;';
535
536 die "Could not load WWW::RobotRules, try -drr switch\n"
537 unless defined(&WWW::RobotRules::parse);
538
539 $Robot_Blob = new WWW::RobotRules $w3mir_agent;
540}
541
542# We have several main-modes of operation. Here we select one
543if ($r) {
544
545 die "w3mir: No URLs? Try 'w3mir -h' for help.\n"
546 if $#root_urls==-1;
547
548 warn "Recursive retrival comencing\n" if $debug;
549
550 die "w3mir: Sorry, you cannot combine -r/recurse with -I/read_urls\n"
551 if $read_urls;
552
553 # Recursive
554 my $url;
555 foreach $url (@root_urls) {
556 warn "Root url dequeued: $url\n" if $debug;
557 if (want_this($url)) {
558 queue($url);
559 &add_referer($url,$iref);
560 } else {
561 die "w3mir: Inconsistent configuration: Specified $url is not inside retrival scope\n";
562 }
563 }
564 mirror();
565
566} else {
567 if ($batch) {
568 warn "Batch retrival commencing\n" if $debug;
569 # Batch get
570 if ($read_urls) {
571 # Get URLs from <STDIN>
572 while (<STDIN>) {
573 chomp;
574 &add_referer($_,$iref);
575 batch_get($_);
576 }
577 } else {
578 # Get URLs from commandline
579 my $url;
580 foreach $url (@root_urls) {
581 &add_referer($url,$iref);
582 }
583 foreach $url (@root_urls) {
584 batch_get($url);
585 }
586 }
587 } else {
588 warn "Single url retrival commencing\n" if $debug;
589
590 # A single URL, with all processing on
591 die "w3mir: You specified several URLs and not -B/batch\n"
592 if $#root_urls>0;
593 queue($root_urls[0]);
594 &add_referer($root_urls[0],$iref);
595 mirror();
596 }
597}
598
599&close_fixup if $fixup;
600
601# This should clean up files:
602&clean_disk if $remove;
603
604warn "w3mir: That's all (".$w3http::xfbytes.'+',$w3http::headbytes.
605 " bytes of it).\n" unless $verbose<0;
606
607if ($runfix) {
608 eval 'use Config;';
609 warn "Running w3mfix\n";
610 if ($win32) {
611 system($Config{'perlpath'}." w3mfix $fixrc");
612 } else {
613 system("w3mfix $fixrc");
614 }
615}
616
617exit 0;
618
619sub get_document {
620 # Get one document by HTTP ($1/rum_url_o). Save in given filename ($2).
621 # Possebly returning references found in the document. Caller must
622 # set up referer array, check wantedness and everything else. We
623 # handle authentication here though.
624
625 my($rum_url_o)=shift;
626 my($lf_url)=shift;
627 croak("\$rum_url_o is empty") if !defined($rum_url_o) || !$rum_url_o;
628 croak("$lf_url is empty") if !defined($lf_url) || !$lf_url;
629
630 # Make sure it's an object
631 $rum_url_o = url $rum_url_o
632 unless ref $rum_url_o;
633
634 # Derive a filename from the url, the filename contains no URL-quoting
635 my($lf_name) = (url "file:$lf_url")->unix_path;
636
637 # Make all intermediate directories
638 &mkdir($lf_name) if $s==0;
639
640 my($rum_as_string) = $rum_url_o->as_string;
641
642 print STDERR "GET_DOCUMENT: '",$rum_as_string,"' -> '",$lf_name,"'\n"
643 if $debug;
644
645 my $hostport;
646 my $www_auth=''; # Value of that http reply header
647 my $page_ref;
648 my @rum_newurls; # List of URLs extracted
649 my $url_extractor;
650 my $do_query; # Do query or not?
651
652 if (defined($rum_urlstat{$rum_as_string}) &&
653 $rum_urlstat{$rum_as_string}>0) {
654 warn "w3mir: Internal error, ".$rum_as_string.
655 " queued several times\n";
656 next;
657 }
658
659 # Goto here if we want to retry b/c of authentication
660 try_again:
661
662 # Start building the extra http::query arguments again
663 my @EXTRASTUFF=();
664
665 # We'll start by assuming that we're doing the query.
666 $do_query=1;
667
668 # If we're not checking the timestamp, or the file does not exist
669 # then we get the file unconditionally. Otherwise we only want it
670 # if it's updated.
671
672 if ($fetch==1) {
673 # Nothing do do?
674 } else {
675 if (-f $lf_name) {
676 if ($fetch==-1) {
677 print STDERR "w3mir: ".($infoloss?$rum_as_string:$lf_name).
678 ", already have it" if $verbose>=0;
679 if (!$mine_urls) {
680 # If -fs and the file exists and we don't need to mine URLs
681 # we're finished!
682 warn "Already have it, no mining, returning!\n" if $debug;
683 print STDERR "\n" if $verbose>=0;
684 return;
685 }
686 $w3http::result=1304; # Pretend it was 'not modified'
687 $do_query=0;
688 } else {
689 push(@EXTRASTUFF,$w3http::IFMODF,$lf_name);
690 }
691 }
692 }
693
694 if ($do_query) {
695
696 # Does the server want authorization for this file? $www_auth is
697 # only set if authentication was requested the first time around.
698
699 # For testing:
700 # $www_auth='Basic realm="foo"';
701
702 if ($www_auth) {
703 my($authdata,$method,$realm);
704
705 ($method,$realm)= $www_auth =~ m/^(\S+)\s+realm=\"([^\"]+)\"/i;
706 $method=lc $method;
707 $realm=lc $realm;
708 die "w3mir: '$method' authentication needed, don't know that.\n"
709 if ($method ne 'basic');
710
711 $hostport = $rum_url_o->netloc;
712 $authdata=$authdata{$hostport}{$realm} || $authdata{$hostport}{'*'} ||
713 $authdata{'*'}{$realm} || $authdata{'*'}{'*'};
714
715 if ($authdata) {
716 push(@EXTRASTUFF,$w3http::AUTHORIZ,$authdata);
717 } else {
718 print STDERR "w3mir: No authorization data for $hostport/$realm\n";
719 $rum_urlstat{$rum_as_string}=$NEVERMIND;
720 next;
721 }
722 }
723
724 push(@EXTRASTUFF,$w3http::FREEHEAD,$cache_header)
725 if ($cache_header);
726
727 # Insert referer header data if at all
728 push(@EXTRASTUFF,$w3http::REFERER,$rum_referers{$rum_as_string}[0])
729 if ($do_referer && exists($rum_referers{$rum_as_string}));
730
731 push(@EXTRASTUFF,$w3http::NOUSER)
732 unless ($do_user);
733
734 # YES, $lf_url is right, w3http::query handles this like an url so
735 # the quoting must all be in place.
736 my $binfile=$lf_url;
737 $binfile='-' if $s==1;
738 $binfile=$nulldevice if $s==2;
739
740 if ($pause) {
741 print STDERR "w3mir: sleeping\n" if $verbose>0;
742 sleep($pause);
743 }
744
745 print STDERR "w3mir: ".($infoloss?$rum_as_string:$lf_name)
746 unless $verbose<0;
747 print STDERR "\nFile: $lf_name\n" if $debug;
748
749 &w3http::query($w3http::GETURL,$rum_as_string,
750 $w3http::SAVEBIN,$binfile,
751 @EXTRASTUFF);
752
753 print STDERR "w3http::result: '",$w3http::result,
754 "' doc size: ", length($w3http::document),
755 " doc type: -",$w3http::headval{'CONTENT-TYPE'},
756 "- plaintexthtml: ",$w3http::plaintexthtml,"\n"
757 if $debug;
758
759 print "Result: ",$w3http::result," Recurse: $r, html: ",
760 $w3http::plaintexthtml,"\n"
761 if $debug;
762
763 } # if $do_query
764
765 if ($w3http::result==200) { # 200 OK
766 $rum_urlstat{$rum_as_string}=$GOTIT;
767
768 if ($mine_urls || $process_urls) {
769
770 if ($w3http::plaintexthtml) {
771 # Only do URL manipulations if this is a html document with no
772 # special content-encoding. We do not handle encodings, yet.
773
774 my $page;
775
776 print STDERR ($process_urls)?", processing":", url mining"
777 if $verbose>0;
778
779 print STDERR "\nurl:'$lf_url'\n"
780 if $debug;
781
782 print "\nMining URLs: $mine_urls, Process: $process_urls\n"
783 if $debug;
784
785 ($page,@rum_newurls) =
786 &htmlop::process($w3http::document,
787 # Only get a new document if wanted
788 $process_urls?():($htmlop::NODOC),
789 $htmlop::CANON,
790 $htmlop::ABS,$rum_url_o,
791 # Only list urls if wanted
792 $mine_urls?($htmlop::LIST):(),
793
794 # If user wants absolute URLs do not
795 # relativize them
796
797 $abs?
798 ():
799 (
800 $htmlop::TAGCALLBACK,\&process_tag,$lf_url,
801 )
802 );
803
804# print "URL: ",join("\nURL: ",@rum_newurls),"\n";
805
806 if ($process_urls) {
807 $page_ref=\$page;
808 $w3http::document='';
809 } else {
810 $page_ref=\$w3http::document;
811 }
812
813 } elsif ($s == 0 &&
814 ($url_extractor =
815 $knownformats{$w3http::headval{'CONTENT-TYPE'}})) {
816
817 # The knownformats extractors only work on disk files so write
818 # doc to disk if not there already (non-html text will not be)
819 write_page($lf_name,$w3http::document,1);
820
821 # Now we try our hand at fetching URIs from non-html files.
822 print STDERR ", mining URLs" if $verbose>=1;
823 @rum_newurls = &$url_extractor($lf_name);
824 # warn "URLs from PDF: ",join(', ',@rum_newurls),"\n";
825 }
826
827
828 } # if ($mine_urls || $process_urls)
829
830# print "page_ref defined: ",defined($page_ref),"\n";
831# print "plaintext: ",$w3http::plaintext,"\n";
832
833 $page_ref=\$w3http::document
834 if !defined($page_ref) && $w3http::plaintexthtml;
835
836 if ($w3http::plaintexthtml) {
837 # ark: this is where I want to do my changes to the page strip
838 # out the <!--NOMIRROR-->...<!--/NOMIRROR--> Stuff.
839 $$page_ref=~ s/<(!--)?\s*NO\s*MIRROR\s*(--)?>[^\000]*?<(!--)?\s*\/NO\s*MIRROR\s*(--)?>//g
840 if $edit;
841
842 if ($header) {
843 # ark: insert a header string at the start of the page
844 my $mirrorstr=$header;
845 $mirrorstr =~ s/\$url/$rum_as_string/g;
846 insert_at_start( $mirrorstr, $page_ref );
847 }
848 }
849
850 write_page($lf_name,$page_ref,0);
851
852 # print "New urls: ",join("\n",@rum_newurls),"\n";
853
854 return @rum_newurls;
855 }
856
857 if ($w3http::result==304 || # 304 Not modified
858 $w3http::result==1304) { # 1304 Have it
859
860 {
861 # last = out of nesting
862
863 my $rum_urlstat;
864 my $rum_newurls;
865
866 @rum_newurls=();
867
868 print STDERR ", not modified"
869 if $verbose>=0 && $w3http::result==304;
870
871 $rum_urlstat{$rum_as_string}=$NOTMOD;
872
873 last unless $mine_urls;
874
875 $rum_newurls=get_references($lf_name);
876
877 # print "New urls: ",ref($rum_newurls),"\n";
878
879 if (!ref($rum_newurls)) {
880 last;
881 } elsif (ref($rum_newurls) eq 'SCALAR') {
882 $page_ref=$rum_newurls;
883 } elsif (ref($rum_newurls) eq 'ARRAY') {
884 @rum_newurls=@$rum_newurls;
885 last;
886 } else {
887 die "\nw3mir: internal error: Unknown return type from get_references\n";
888 }
889
890 # Check if it's a html file. I know this tag is in all html
891 # files, because I put it there as I pull them in.
892 last unless $$page_ref =~ /<HTML/i;
893
894 warn "$lf_name is a html file\n" if $debug;
895
896 # It's a html document
897 print STDERR ", mining URLs" if $verbose>=1;
898
899 # This will give us a list of absolute urls
900 (undef,@rum_newurls) =
901 &htmlop::process($$page_ref,$htmlop::NODOC,
902 $htmlop::ABS,$rum_as_string,
903 $htmlop::USESAVED,'W3MIR',
904 $htmlop::LIST);
905 }
906
907 print STDERR "\n" if $verbose>=0;
908 return @rum_newurls;
909 }
910
911 if ($w3http::result==302 || $w3http::result==301) { # Redirect
912 # Cern and NCSA httpd sends 302 'redirect' if a ending / is
913 # forgotten on a url. More recent httpds send 301 'permanent
914 # redirect' in this case. Here we check if the difference in URLs
915 # is just a / and if so push the url again with the / added. This
916 # code only works if the http server has the right idea about its
917 # own name.
918 #
919 # 18/3/97: Added code to queue redirected-to-URLs that are within
920 # the scope of the retrival.
921 my $new_rum_url;
922
923 $rum_urlstat{$rum_as_string}=$REDIR;
924
925 # Absolutify the new url, it might be relative to the requested
926 # document. That's a ugly wart on some servers/admins.
927 $new_rum_url=url $w3http::headval{'location'};
928 $new_rum_url=$new_rum_url->abs($rum_url_o);
929
930 print REDIRS $rum_as_string,' -> ',$new_rum_url->as_string,"\n"
931 if $fixup;
932
933 if ($immediate_redir) {
934 print STDERR " =>> ",$new_rum_url->as_string,", getting that instead\n";
935 return get_document($new_rum_url,$lf_url);
936 }
937
938 # Some redirect to a fragment of another doc...
939 $new_rum_url->frag(undef);
940 $new_rum_url=$new_rum_url->as_string;
941
942 if ($rum_as_string.'/' eq $new_rum_url) {
943 if (grep { $rum_as_string eq $_; } @root_urls) {
944 print STDERR "\nw3mir: missing / in a start URL detected. Please fix commandline/config file.\n";
945 exit(1);
946 }
947 print STDERR ", missing /\n";
948 queue($new_rum_url);
949 # Initialize referer to something meaningful
950 $rum_referers{$new_rum_url}=$rum_referers{$rum_as_string};
951 } else {
952 print STDERR " =>> $new_rum_url";
953 if (want_this($new_rum_url)) {
954 print STDERR ", getting that\n";
955 queue($new_rum_url);
956 $rum_referers{$new_rum_url}=$rum_referers{$rum_as_string};
957 } else {
958 print STDERR ", don't want it\n";
959 }
960 }
961 return ();
962 }
963
964 if ($w3http::result==403 || # Forbidden
965 $w3http::result==404 || # Not found
966 $w3http::result==406 || # Not Acceptable, hmm, belongs here?
967 $w3http::result==410) { # Gone - no forwarding address known
968
969 $rum_urlstat{$rum_as_string}=$ENOTFND;
970 &handleerror;
971 print STDERR "Was refered from: ",
972 join(',',@{$rum_referers{$rum_as_string}}),
973 "\n" if defined(@{$rum_referers{$rum_as_string}});
974 return ();
975 }
976
977 if ($w3http::result==407) {
978 # Proxy authentication requested
979 die "Proxy server requests authentication but failed to return the\n".
980 "REQUIRED Proxy-Authenticate header for this condition\n"
981 unless exists($w3http::headval{'proxy-authenticate'});
982
983 die "Proxy authentication is required for ".$w3http::headval{'proxy-authenticate'}."\n";
984 }
985
986 if ($w3http::result==401) {
987 # A www-authenticate reply header should acompany a 401 message.
988 if (!exists($w3http::headval{'www-authenticate'})) {
989 warn "w3mir: Server indicated authentication failure but gave no www-authenticate reply\n";
990 $rum_urlstat{$rum_as_string}=$NEVERMIND;
991 } else {
992 # Unauthorized
993 if ($www_auth) {
994 # Failed when authorization data was supplied.
995 $rum_urlstat{$rum_as_string}=$NEVERMIND;
996 print STDERR ", authorization failed data needed for ",
997 $w3http::headval{'www-authenticate'},"\n"
998 if ($verbose>=0);
999 } else {
1000 if ($useauth) {
1001 # First time failure, send back and retry at once with some known
1002 # user/passwd.
1003 $www_auth=$w3http::headval{'www-authenticate'};
1004 print STDERR ", retrying with authorization\n" unless $verbose<0;
1005 goto try_again;
1006 } else {
1007 print ", authorization needed: ",
1008 $w3http::headval{'www-authenticate'},"\n";
1009 $rum_urlstat{$rum_as_string}=$NEVERMIND;
1010 }
1011 }
1012 }
1013 return ();
1014 }
1015
1016 # Something else.
1017 &handleerror;
1018}
1019
1020
1021sub robot_check {
1022 # Check if URL is allowed by robots.txt, if we respect it at all
1023 # that is. Return 1 it allowed, 0 otherwise.
1024
1025 my($rum_url_o)=shift;
1026 my $hostport;
1027
1028 if ($check_robottxt) {
1029
1030 $hostport = $rum_url_o->netloc;
1031 if (!exists($gotrobots{$hostport})) {
1032 # Get robots.txt from the server
1033 $gotrobots{$hostport}=1;
1034 my $robourl="http://$hostport/robots.txt";
1035 print STDERR "w3mir: $robourl" if ($verbose>=0);
1036 &w3http::query($w3http::GETURL,$robourl);
1037 $w3http::document='' if ($w3http::result != 200);
1038 print STDERR ", processing" if $verbose>=1;
1039 print STDERR "\n" if ($verbose>=0);
1040 $Robot_Blob->parse($robourl,$w3http::document);
1041 }
1042
1043 if (!$Robot_Blob->allowed($rum_url_o->as_string)) {
1044 # It is forbidden
1045 $rum_urlstat{$rum_url_o->as_string}=$FROBOTS;
1046 warn "w3mir: ",$rum_url_o->as_string,": forbidden by robots.txt\n";
1047 return 0;
1048 }
1049 }
1050 return 1;
1051}
1052
1053
1054
1055sub batch_get {
1056 # Batch get _one_ document.
1057 my $rum_url=shift;
1058 my $lf_url;
1059
1060 $rum_url_o = url $rum_url;
1061
1062 return unless robot_check($rum_url_o);
1063
1064 ($lf_url=$rum_url) =~ s~.*/~~;
1065 if (!defined($lf_url) || $lf_url eq '') {
1066 ($lf_url=$rum_url) =~ s~/$~~;
1067 $lf_url =~ s~.*/~~;
1068 $lf_url .= "-$indexname";
1069 }
1070
1071 warn "Batch get: $rum_url -> $lf_url\n" if $debug;
1072
1073 $immediate_redir=1; # Do follow redirects immediately
1074
1075 get_document($rum_url,$lf_url);
1076}
1077
1078
1079
1080sub mirror {
1081 # Mirror (or get) the requested url(s). Possibly recursively.
1082 # Working from whatever cwd is at invocation we'll retrieve all
1083 # files under it in the file hierarchy.
1084
1085 my $rum_url; # URL of the document we're getting now, defined at main level
1086 my $lf_url; # rum_url after apply - and
1087 my $new_lf_url;
1088 my @new_rum_urls;
1089 my $rum_ref;
1090
1091 while (defined($rum_url = pop(@rum_queue))) {
1092
1093 warn "mirror: Poped $rum_url from queue\n" if $debug;
1094
1095 # Unwanted URLs should not be queued
1096 die "Found url $rum_url that I don't want in queue!\n"
1097 unless defined($lf_url=apply($rum_url));
1098
1099 $rum_url_o = url $rum_url;
1100
1101 next unless robot_check($rum_url_o);
1102
1103 # Figure out the filename for our local filesystem.
1104 $lf_url.=$indexname if $lf_url =~ m~/$~ || $lf_url eq '';
1105
1106 @new_rum_urls = get_document($rum_url_o,$lf_url);
1107
1108 print join("\n",@new_rum_urls),"\n" if ($list);
1109
1110 if ($r) {
1111 foreach $rum_ref (@new_rum_urls) {
1112 # warn "Recursive url: $rum_ref\n";
1113 $new_lf_url=apply($rum_ref);
1114 next unless $new_lf_url;
1115
1116 # warn "Want it\n";
1117 $rum_ref =~ s/\#.*$//s; # Clip off section marks
1118
1119 add_referer($rum_ref,$rum_url_o->as_string);
1120 queue($rum_ref);
1121 }
1122 }
1123
1124 @new_rum_urls=();
1125
1126 # Is the URL queue empty? Are there outstanding retries? Refill
1127 # the queue from the retry list.
1128 if ($#rum_queue<0 && $retry-->0) {
1129 foreach $rum_url_o (keys %rum_urlstat) {
1130 $rum_url_o = url $rum_url_o;
1131 if ($rum_urlstat{$rum_url_o->as_string}==100) {
1132 push(@rum_queue,$rum_url_o->as_string);
1133 $rum_urlstat{$rum_url_o->as_string}=0;
1134 }
1135 }
1136 if ($#rum_queue>=0) {
1137 warn "w3mir: Sleeping before retrying. $retry more times left\n"
1138 if $verbose>=0;
1139 sleep($retryPause);
1140 }
1141 }
1142
1143 }
1144}
1145
1146
1147sub get_references {
1148 # Get references from a non-html-on-disk file. Return references if
1149 # we know how to find them. Return reference do the complete page
1150 # if it's html. Return single numerical 0 if unknown format.
1151
1152 my($lf_url)=shift;
1153 my($urlextractor)=shift;
1154
1155 my $read; # Buffer of stuff read from file to check filetype
1156 my $magic;
1157 my $url_extractor;
1158 my $rum_ref;
1159 my $page;
1160
1161 warn "w3mir: Looking at local $lf_url\n" if $debug;
1162
1163 # Open file and read the first 10kilobytes for file-type-test
1164 # purposes.
1165 if (!open(TMPF,$lf_url)) {
1166 warn "Unable to open $lf_url for reading: $!\n";
1167 last;
1168 }
1169
1170 $page=' 'x10240;
1171 $read=sysread(TMPF,$page,length($page),0);
1172 close(TMPF);
1173
1174 die "Error reading $lf_url: $!\n" if (!defined($read));
1175
1176 if (!defined($url_extractor)) {
1177 $url_extractor=0;
1178
1179 # Check file against list of magic numbers.
1180 foreach $magic (keys %knownmagic) {
1181 if (substr($page,0,length($magic)) eq $magic) {
1182 $url_extractor = $knownformats{$knownmagic{$magic}};
1183 last;
1184 }
1185 }
1186 }
1187
1188 # Found a extraction method, apply.
1189 if ($url_extractor) {
1190 print STDERR ", mining URLs" if $verbose>=1;
1191 return [&$url_extractor($lf_url)];
1192 }
1193
1194 if ($page =~ /<HTML/i) {
1195 open(TMPF,$lf_url) ||
1196 die "Could not open $lf_url for reading: $!\n";
1197 # read the whole file.
1198 local($/)=undef;
1199 $page = <TMPF>;
1200 close(TMPF);
1201 return \$page;
1202 }
1203
1204 return 0;
1205}
1206
1207
1208sub open_fixup {
1209 # Open the referers and redirects files
1210
1211 my $reffile='.referers';
1212 my $redirfile='.redirs';
1213
1214 if ($win32) {
1215 $reffile="referers";
1216 $redirfile="redirs";
1217 }
1218
1219 $nodelete{$reffile} = $nodelete{$redirfile} = 1;
1220
1221 open(REDIRS,"> $redirfile") ||
1222 die "Could not open $redirfile for writing: $!\n";
1223
1224 autoflush REDIRS 1;
1225
1226 open(REFERERS,"> $reffile") ||
1227 die "Could not open $reffile for writing: $!\n";
1228
1229 $fixopen=1;
1230 eval 'END { close_fixup; 0; }';
1231}
1232
1233
1234sub close_fixup {
1235 # Close the fixup data files. In the case of the referer file also
1236 # write the entire content
1237
1238 return unless $fixopen;
1239
1240 my $referer;
1241
1242 foreach $referer (keys %rum_referers) {
1243 print REFERERS $referer," <- ",join(' ',@{$rum_referers{$referer}}),"\n";
1244 }
1245
1246 close(REFERERS) || warn "Error closing referers file: $!\n";
1247 close(REDIRS) || warn "Error closing redirects file: $!\n";
1248 $fixopen=0;
1249}
1250
1251
1252sub clean_disk {
1253 # This procedure removes files that are not present on the server(s)
1254 # anymore.
1255
1256 # - To avoid removing files that were not fetched due to network
1257 # problems we only do blanket removal IFF all documents were
1258 # fetched w/o problems, eventually.
1259 # - In any case we can remove files the server said were not found
1260
1261 # The strategy has three main parts:
1262 # 1. Find all files we have
1263 # 2. Find what files we ought to have
1264 # 3. Remove the difference
1265
1266 my $complete_retrival=1; # Flag saying IFF all documents were fetched
1267 my $urlstat; # Tmp storage
1268 my $rum_url;
1269 my $lf_url;
1270 my $lf_dir;
1271 my $dirs_to_remove;
1272
1273 # For fileremoval code
1274 eval "use File::Find;" unless defined(&find);
1275
1276 die "w3mir: Could not load File::Find module. Don't use -R switch.\n"
1277 unless defined(&find);
1278
1279 # This to shut up -w
1280 $lf_dir=$File::Find::dir;
1281
1282 # ***** 1. Find out what files we have *****
1283 #
1284 # This does two things: For each file or directory found:
1285 # - Increases entry count for the container directory
1286 # - If it's a file: $lf_file{relative_path}=$FILEHERE;
1287
1288 chop(@root_dirs);
1289 print STDERR "Looking in: ",join(", ",@root_dirs),"\n" if $debug;
1290
1291 find(\&find_files,@root_dirs);
1292
1293 # ***** 2. Find out what files we ought to have *****
1294 #
1295 # First we loop over %rum_urlstat to determine what files are not
1296 # present on the server(s).
1297 foreach $rum_url (keys %rum_urlstat) {
1298 # Figure out name of local file from rum_url
1299 next unless defined($lf_url=apply($rum_url));
1300
1301 $lf_url.=$indexname if $lf_url =~ m~/$~ || $lf_url eq '';
1302
1303 # find prefixes ./, we must too.
1304 $lf_url="./".$lf_url unless substr($lf_url,0,1) eq '/';
1305
1306 # Ignore if file does not exist here.
1307 next unless exists($lf_file{$lf_url});
1308
1309 # The apply rules can map several remote files to same local
1310 # file. If we decided to keep file already we stay with that.
1311 next if $lf_file{$lf_url}==$FILETHERE;
1312
1313 $urlstat=$rum_urlstat{$rum_url};
1314
1315 # Figure out the status code.
1316 if ($urlstat==$GOTIT || $urlstat==$NOTMOD) {
1317 # Present on server. Keep.
1318 $lf_file{$lf_url}=$FILETHERE;
1319 next;
1320 } elsif ($urlstat==$ENOTFND || $urlstat==$NEVERMIND ) {
1321 # One of: not on server, can't get, don't want, access forbiden:
1322 # Schedule for removal.
1323 $lf_file{$lf_url}=$FILEDEL if exists($lf_file{$lf_url});
1324 next;
1325 } elsif ($urlstat==$OTHERERR || $urlstat==$TERROR) {
1326 # Some error occured transfering.
1327 $complete_retrival=0; # The retrival was not complete. Delete less
1328 } elsif ($urlstat==$QUEUED) {
1329 warn "w3mir: Internal inconsistency, $rum_url marked as queued after retrival terminated\n";
1330 $complete_retrival=0; # Fishy. Be conservative about removing
1331 } else {
1332 $complete_retrival=0;
1333 warn "w3mir: Warning: $rum_url is marked as $urlstat.\n".
1334 "w3mir: Please report to w3mir-core\@usit.uio.no.\n";
1335 }
1336 } # foreach %rum_urlstat
1337
1338 # ***** 3. Remove the difference *****
1339
1340 # Loop over all found files:
1341 # - Should we have this file?
1342 # - If not: Remove file and decrease directory entry count
1343 # Loop as long as there are directories with 0 entry count:
1344 # - Loop over all directories with 0 entry count:
1345 # - Remove directory
1346 # - Decrease entry count of parent
1347
1348 warn "w3mir: Some error occured, conservative file removal\n"
1349 if !$complete_retrival && $verbose>=0;
1350
1351 # Remove all files we don't want removed from list of files present:
1352 foreach $lf_url (keys %nodelete) {
1353 print STDERR "Not deleting: $lf_url\n" if $verbose>=1;
1354 delete $lf_file{$lf_url} || delete $lf_file{'./'.$lf_url};
1355 }
1356
1357 # Remove files
1358 foreach $lf_url (keys %lf_file) {
1359 if (($complete_retrival && $lf_file{$lf_url}==$FILEHERE) ||
1360 ($lf_file{$lf_url} == $FILEDEL)) {
1361 if (unlink $lf_url) {
1362 ($lf_dir)= $lf_url =~ m~^(.+)/~;
1363 $lf_dir{$lf_dir}--;
1364 $dirs_to_remove=1 if ($lf_dir{$lf_dir}==0);
1365 warn "w3mir: removed file $lf_url\n" if $verbose>=0;
1366 } else {
1367 warn "w3mir: removal of file $lf_url failed: $!\n";
1368 }
1369 }
1370 }
1371
1372 # Remove empty directories
1373 while ($dirs_to_remove) {
1374 $dirs_to_remove=0;
1375 foreach $lf_url (keys %lf_dir) {
1376 next if $lf_url eq '.';
1377 if ($lf_dir{$lf_url}==0) {
1378 if (rmdir($lf_url)) {
1379 warn "w3mir: removed directory $lf_dir\n" if $verbose>=0;
1380 delete $lf_dir{$lf_url};
1381 ($lf_dir)= $lf_url =~ m~^(.+)/~;
1382 $lf_dir{$lf_dir}--;
1383 $dirs_to_remove=1 if ($lf_dir{$lf_dir}==0);
1384 } else {
1385 warn "w3mir: removal of directory $lf_dir failed: $!\n";
1386 }
1387 }
1388 }
1389 }
1390}
1391
1392
1393sub find_files {
1394 # This is called by the find procedure for every file/dir found.
1395
1396 # This builds two hashes:
1397 # lf_file{<file>}: 1: file exists
1398 # lf_dir{<dir>): Number of files in directory.
1399
1400 lstat($_);
1401
1402 $lf_dir{$File::Find::dir}++;
1403
1404 if (-f _) {
1405 $lf_file{$File::Find::name}=$FILEHERE;
1406 } elsif (-d _) {
1407 # null
1408 # Bug: If an empty directory exists it will not be removed
1409 } else {
1410 warn "w3mir: File $File::Find::name has unknown type. Ignoring.\n";
1411 }
1412 return 0;
1413
1414}
1415
1416
1417sub handleerror {
1418 # Handle error status of last http connection, will set the rum_urlstat
1419 # appropriately and print a error message.
1420
1421 my $msg;
1422
1423 if ($verbose<0) {
1424 $msg="w3mir: ".$rum_url_o->as_string.": ";
1425 } else {
1426 $msg=": ";
1427 }
1428
1429 if ($w3http::result == 98) {
1430 # OS/Network error
1431 $msg .= "$!";
1432 $rum_urlstat{$rum_url_o->as_string}=$OTHERERR;
1433 } elsif ($w3http::result == 100) {
1434 # Some kind of error connecting or sending request
1435 $msg .= $w3http::restext || "Timeout";
1436 $rum_urlstat{$rum_url_o->as_string}=$TERROR;
1437 } else {
1438 # Other HTTP error
1439 $rum_urlstat{$rum_url_o->as_string}=$OTHERERR;
1440 $msg .= " ".$w3http::result." ".$w3http::restext;
1441 $msg .= " =>> ".$w3http::headval{'location'}
1442 if (defined($w3http::headval{'location'}));
1443 }
1444 print STDERR "$msg\n";
1445}
1446
1447
1448sub queue {
1449 # Queue given url if appropriate and create a status entry for it
1450 my($rum_url_o)=url $_[0];
1451
1452 croak("BUG: undefined \$rum_url_o")
1453 if !defined($rum_url_o);
1454
1455 croak("BUG: undefined \$rum_url_o->as_string")
1456 if !defined($rum_url_o->as_string);
1457
1458 croak("BUG: ".$rum_url_o->as_string." (fragnent) queued")
1459 if $rum_url_o->as_string =~ /\#/;
1460
1461 return if exists($rum_urlstat{$rum_url_o->as_string});
1462 return unless want_this($rum_url_o->as_string);
1463
1464 warn "QUEUED: ",$rum_url_o->as_string,"\n" if $debug;
1465
1466 # Note lack of scope checks.
1467 $rum_urlstat{$rum_url_o->as_string}=$QUEUED;
1468 push(@rum_queue,$rum_url_o->as_string);
1469}
1470
1471
1472sub root_queue {
1473 # Queue function for root urls and directories. One or the other might
1474 # be boolean false, in that case, don't queue it.
1475
1476 my $root_url_o;
1477
1478 my($root_url)=shift;
1479 my($root_dir)=shift;
1480
1481 die "w3mir: No fragments in start URLs :".$root_url."\n"
1482 if $root_url =~ /\#/;
1483
1484 if ($root_dir) {
1485 print "Root dir: $root_dir\n" if $debug;
1486 $root_dir="./$root_dir" unless substr($root_dir,0,1) eq '/' or
1487 substr($root_dir,0,2) eq './';
1488 push(@root_dirs,$root_dir);
1489 }
1490
1491
1492 if ($root_url) {
1493 $root_url_o=url $root_url;
1494
1495 # URL canonification, or what we do of it at least.
1496 $root_url_o->host($root_url_o->host);
1497
1498 warn "Root queue: ".$root_url_o->as_string."\n" if $debug;
1499
1500 push(@root_urls,$root_url_o->as_string);
1501
1502 return $root_url_o;
1503 }
1504
1505}
1506
1507
1508sub write_page {
1509 # write a retrieved page to wherever it's supposed to be written.
1510 # Added difficulty: all files but plaintext files have already been
1511 # written to disk in w3http.
1512
1513 # $s == 0 save to disk
1514 # $s == 1 dump to stdout
1515 # $s == 2 forget
1516
1517 my($lf_name,$page_ref,$silent) = @_;
1518 my($verb);
1519
1520 if ($silent) {
1521 $verb=-1;
1522 } else {
1523 $verb=$verbose;
1524 }
1525
1526# confess("\n\$page_ref undefined") if !defined($page_ref);
1527
1528 if ($w3http::plaintexthtml) {
1529 # I have it in memory
1530 if ($s==0) {
1531 print STDERR ", saving" if $verb>0;
1532
1533 while (-d $lf_name) {
1534 # This will run once, maybe twice, $fiddled will be canged the
1535 # first time
1536 if (exists($fiddled{$lf_name})) {
1537 warn "Cannot save $lf_name, there is a directory in the way\n";
1538 return;
1539 }
1540
1541 $fiddled{$lf_name}=1;
1542
1543 rm_rf($lf_name);
1544 print STDERR "w3mir: $lf_name" if $verbose>=0;
1545 }
1546
1547 if (!open(PAGE,">$lf_name")) {
1548 warn "\nw3mir: can't open $lf_name for writing: $!\n";
1549 return;
1550 }
1551 if (!$convertnl) {
1552 binmode PAGE;
1553 warn "BINMODE\n" if $debug;
1554 }
1555 if ($$page_ref ne '') {
1556 print PAGE $$page_ref || die "w3mir: Error writing $lf_name: $!\n";
1557 }
1558 close(PAGE) || die "w3mir: Error closing $lf_name: $!\n";
1559 print STDERR ": ", length($$page_ref), " bytes\n"
1560 if $verb>=0;
1561 setmtime($lf_name,$w3http::headval{'last-modified'})
1562 if exists($w3http::headval{'last-modified'});
1563 } elsif ($s==1) {
1564 print $$page_ref ;
1565 } elsif ($s==2) {
1566 print STDERR ", got and forgot it.\n" unless $verb<0;
1567 }
1568 } else {
1569 # Already written by http module, just emit a message if wanted
1570 if ($s==0) {
1571 print STDERR ": ",$w3http::doclen," bytes\n"
1572 if $verb>=0;
1573 setmtime($lf_name,$w3http::headval{'last-modified'})
1574 if exists($w3http::headval{'last-modified'});
1575 } elsif ($s==2) {
1576 print STDERR ", got and forgot it.\n" if $verb>=0;
1577 }
1578 }
1579}
1580
1581
1582sub setmtime {
1583 # Set mtime of the given file
1584 my($file,$time)=@_;
1585 my($tm_sec,$tm_min,$tm_hour,$tm_mday,$tm_mon,$tm_year,$tm_wday,$tm_yday,
1586 $tm_isdst,$tics);
1587
1588 $tm_isdst=0;
1589 $tm_yday=-1;
1590
1591 carp("\$time is undefined"),return if !defined($time);
1592
1593 $tics=str2time($time);
1594 utime(time, $tics, $file) ||
1595 warn "Could not change mtime of $file: $!\n";
1596}
1597
1598
1599sub movefile {
1600 # Rename a file. Note that copy is not a good alternative, since
1601 # copying over NFS is something we want to Avoid.
1602
1603 # Returns 0 if failure and 1 in case of sucess.
1604
1605 (my $old,my $new) = @_;
1606
1607 # Remove anything that might have the name already.
1608 if (-d $new) {
1609 print STDERR "\n" if $verbose>=0;
1610 rm_rf($new);
1611 $fiddled{$new}=1;
1612 print STDERR "w3mir: $new" if $verbose>=0;
1613 } elsif (-e $new) {
1614 $fiddled{$new}=1;
1615 if (unlink($new)) {
1616 print STDERR "\nw3mir: removed $new\nw3mir: $new"
1617 if $verbose>=0;
1618 } else {
1619 return 0;
1620 }
1621
1622 }
1623
1624 if ($new ne '-' && $new ne $nulldevice) {
1625 warn "MOVING $old -> $new\n" if $debug;
1626 rename($old,$new) ||
1627 warn "Could not rename $old to $new: $!\n",return 0;
1628 }
1629 return 1;
1630}
1631
1632
1633sub mkdir {
1634 # Make all intermediate directories needed for a file, the file name
1635 # is expected to be included in the argument!
1636
1637 # Reasons for not using File::Path::mkpath:
1638 # - I already wrote this.
1639 # - I get to be able to produce as good and precise errormessages as
1640 # unix and perl will allow me. mkpath will not.
1641 # - It's easier to find out if it worked or not.
1642
1643 my($file) = @_;
1644 my(@dirs) = split("/",$file);
1645 my $path;
1646 my $dir;
1647 my $moved=0;
1648
1649 if (!$dirs[0]) {
1650 shift @dirs;
1651 $path='';
1652 } else {
1653 $path = '.';
1654 }
1655
1656 # This removes the last element of the array, it's meant to shave
1657 # off the file name leaving only the directory name, as a
1658 # convenience, for the caller.
1659 pop @dirs;
1660 foreach $dir (@dirs) {
1661 $path .= "/$dir";
1662 stat($path);
1663 # only make if it isn't already there
1664 next if -d _;
1665
1666 while (!-d _) {
1667 if (exists($fiddled{$path})) {
1668 warn "Cannot make directory $path, there is a file in the way.\n";
1669 return;
1670 }
1671
1672 $fiddled{$path}=1;
1673
1674 if (!-e _) {
1675 mkdir($path,0777);
1676 last;
1677 }
1678
1679 if (unlink($path)) {
1680 warn "w3mir: removed file $path\n" if $verbose>=0;
1681 } else {
1682 warn "Unable to remove $path: $!\n";
1683 next;
1684 }
1685
1686 warn "mkdir $path\n" if $debug;
1687 mkdir($path,0777) ||
1688 warn "Unable to create directory $path: $!\n";
1689
1690 stat($path);
1691 }
1692 }
1693}
1694
1695
1696sub add_referer {
1697 # Add a referer to the list of referers of a document. Unless it's
1698 # already there.
1699 # Don't mail me if you (only) think this is a bit like a toungetwiser:
1700
1701 # Don't remember referers if BOTH fixup and referer header is disabled.
1702 return if $fixup==0 && $do_referer==0;
1703
1704 my($rum_referee,$rum_referer) = @_ ;
1705 my $re_rum_referer;
1706
1707 if (exists($rum_referers{$rum_referee})) {
1708 $re_rum_referer=quotemeta $rum_referer;
1709 if (!grep(m/^$re_rum_referer$/,@{$rum_referers{$rum_referee}})) {
1710 push(@{$rum_referers{$rum_referee}},$rum_referer);
1711 # warn "$rum_referee <- $rum_referer pushed\n";
1712 } else {
1713 # warn "$rum_referee <- $rum_referer NOT pushed\n";
1714 }
1715 } else {
1716 $rum_referers{$rum_referee}=[$rum_referer];
1717 # warn "$rum_referee <- $rum_referer pushed\n";
1718 }
1719}
1720
1721
1722sub user_apply {
1723 # Apply the user apply rules
1724
1725 return &$user_apply_code(shift);
1726
1727# Debug version:
1728# my ($foo,$bar);
1729# $foo=shift;
1730# $bar=&$apply_code($foo);
1731# print STDERR "Apply: $foo -> $bar\n";
1732# return $bar;
1733}
1734
1735sub internal_apply {
1736 # Apply the w3mir generated apply rules
1737
1738 return &$apply_code(shift);
1739}
1740
1741
1742sub apply {
1743 # Apply the user apply rules. Then if URL is wanted return result of
1744 # w3mir apply rules. Return the undefined value otherwise.
1745
1746 my $url = user_apply(shift);
1747
1748 return undef unless want_this($url);
1749
1750 internal_apply($url);
1751}
1752
1753
1754sub want_this {
1755 # Find out if we want the url passed. Just pass it on to the
1756 # generated functions.
1757 my($rum_url)=shift;
1758
1759 # What about robot rules?
1760
1761 # Does scope rule want this?
1762 return &$scope_code($rum_url) &&
1763 # Does user rule want this too?
1764 &$rule_code($rum_url)
1765
1766}
1767
1768
1769sub process_tag {
1770 # Process a tag in html file
1771 my $lf_referer = shift; # User argument
1772 my $base_url = shift; # Not used... why not?
1773 my $tag_name = shift;
1774 my $url_attrs = shift;
1775
1776 # Retrun quickly if no URL attributes
1777 return unless defined($url_attrs);
1778
1779 my $attrs = shift;
1780
1781 my $rum_url; # The absolute URL
1782 my $lf_url; # The local filesystem url
1783 my $lf_url_o; # ... and it's object
1784 my $key;
1785
1786 print STDERR "\nProcess Tag: $tag_name, URL attributes: ",
1787 join(', ',@{$url_attrs}),"\nbase_url: ",$base_url,"\nlf_referer: ",
1788 $lf_referer,"\n"
1789 if $debug>2;
1790
1791 $lf_referer =~ s~^/~~;
1792 $lf_referer = "file:/$lf_referer";
1793
1794 foreach $key (@{$url_attrs}) {
1795 if (defined($$attrs{$key})) {
1796 $rum_url=$$attrs{$key};
1797 printf STDERR "$key = $rum_url\n" if $debug;
1798 $lf_url=apply($rum_url);
1799 if (defined($lf_url)) {
1800
1801 printf STDERR "Transformed to $lf_url\n" if $debug>2;
1802
1803 $lf_url =~ s~^/~~; # Remove leading / to avoid doubeling
1804 $lf_url_o=url "file:/$lf_url";
1805
1806 # Save new value in the hash
1807 $$attrs{$key}=($lf_url_o->rel($lf_referer))->as_string;
1808 print STDERR "New value: ",$$attrs{$key},"\n" if $debug>2;
1809
1810 # If there is potential information loss save the old value too
1811 $$attrs{"W3MIR".$key}=$rum_url if $infoloss;
1812 }
1813 }
1814 }
1815}
1816
1817
1818sub version {
1819 eval 'require LWP;';
1820 print $w3mir_agent,"\n";
1821 print "LWP version ",$LWP::VERSION,"\n" if defined $LWP::VERSION;
1822 print "Perl version: ",$],"\n";
1823 exit(0);
1824}
1825
1826
1827sub parse_args {
1828 my $f;
1829 my $i;
1830
1831 $i=0;
1832
1833 while ($f=shift) {
1834 $i++;
1835 $numarg++;
1836 # This is a demonstration against Getopts::Long.
1837 if ($f =~ s/^-+//) {
1838 $s=1,next if $f eq 's'; # Stdout
1839 $r=1,next if $f eq 'r'; # Recurse
1840 $fetch=1,next if $f eq 'fa'; # Fetch all, no date test
1841 $fetch=-1,next if $f eq 'fs'; # Fetch those we don't already have.
1842 $verbose=-1,next if $f eq 'q'; # Quiet
1843 $verbose=1,next if $f eq 'c'; # Chatty
1844 &version,next if $f eq 'v'; # Version
1845 $pause=shift,next if $f eq 'p'; # Pause between requests
1846 $retryPause=shift,next if $f eq 'rp'; # Pause between retries.
1847 $s=2,$convertnl=0,next if $f eq 'f'; # Forget
1848 $retry=shift,next if $f eq 't'; # reTry
1849 $list=1,next if $f eq 'l'; # List urls
1850 $iref=shift,next if $f eq 'ir'; # Initial referer
1851 $check_robottxt = 0,next if $f eq 'drr'; # Disable robots.txt rules.
1852 umask(oct(shift)),next if $f eq 'umask';
1853 parse_cfg_file(shift),next if $f eq 'cfgfile';
1854 usage(),exit 0 if ($f eq 'help' || $f eq 'h' || $f eq '?');
1855 $remove=1,next if $f eq 'R';
1856 $cache_header = 'Pragma: no-cache',next if $f eq 'pflush';
1857 $w3http::agent=$w3mir_agent=shift,next if $f eq 'agent';
1858 $abs=1,next if $f eq 'abs';
1859 $convertnl=0,$batch=1,next if $f eq 'B';
1860 $read_urls = 1,next if $f eq 'I';
1861 $convertnl=0,next if $f eq 'nnc';
1862
1863 if ($f eq 'lc') {
1864 if ($i == 1) {
1865 $lc=1;
1866 $iinline=($lc?"(?i)":"");
1867 $ipost=($lc?"i":"");
1868 next;
1869 } else {
1870 die "w3mir: -lc must be the first argument on the commandline.\n";
1871 }
1872 }
1873
1874 if ($f eq 'P') { # Proxy
1875 ($w3http::proxyserver,$w3http::proxyport)=
1876 shift =~ /([^:]+):?(\d+)?/;
1877 $w3http::proxyport=80 unless $w3http::proxyport;
1878 $using_proxy=1;
1879 next;
1880 }
1881
1882 if ($f eq 'd') { # Debugging level
1883 $f=shift;
1884 unless (($debug = $f) > 0) {
1885 die "w3mir: debug level must be a number greater than zero.\n";
1886 }
1887 next;
1888 }
1889
1890 # Those were all the options...
1891 warn "w3mir: Unknown option: -$f. Use -h for usage info.\n";
1892 exit(1);
1893
1894 } elsif ($f =~ /^http:/) {
1895 my ($rum_url_o,$rum_reurl,$rum_rebase,$server);
1896
1897 $rum_url_o=root_queue($f,'./');
1898
1899 $rum_rebase = quotemeta( ($rum_url_o->as_string =~ m~^(.*/)~)[0] );
1900
1901 push(@internal_apply,"s/^".$rum_rebase."//");
1902 $scope_fetch.="return 1 if m/^".$rum_rebase."/".$ipost.";\n";
1903 $scope_ignore.="return 0 if m/^".
1904 quotemeta("http://".$rum_url_o->netloc."/")."/".$ipost.";\n";
1905
1906 } else {
1907 # If we get this far then the commandline is broken
1908 warn "Unknown commandline argument: $f. Use -h for usage info.\n";
1909 $numarg--;
1910 exit(1);
1911 }
1912 }
1913 return 1;
1914}
1915
1916
1917sub parse_cfg_file {
1918 # Read the configuration file. Aborts on errors. Not good to
1919 # mirror something using the wrong config.
1920
1921 my ( $file ) = @_ ;
1922 my ($key, $value, $authserver,$authrealm,$authuser,$authpasswd);
1923 my $i;
1924
1925 die "w3mir: config file $file is not a file.\n" unless -f $file;
1926 open(CFGF, $file) || die "Could not open config file $file: $!\n";
1927
1928 $i=0;
1929
1930 while (<CFGF>) {
1931 # Trim off various junk
1932 chomp;
1933 s/^#.*//;
1934 s/^\s+|\s$//g;
1935 # Anything left?
1936 next if $_ eq '';
1937 # Examine remains
1938 $i++;
1939 $numarg++;
1940
1941 ($key, $value) = split(/\s*:\s*/,$_,2);
1942 $key = lc $key;
1943
1944 $iref=$value,next if ( $key eq 'initial-referer' );
1945 $header=$value,next if ( $key eq 'header' );
1946 $pause=numeric($value),next if ( $key eq 'pause' );
1947 $retryPause=numeric($value),next if ( $key eq 'retry-pause' );
1948 $debug=numeric($value),next if ( $key eq 'debug' );
1949 $retry=numeric($value),next if ( $key eq 'retries' );
1950 umask(numeric($value)),next if ( $key eq 'umask' );
1951 $check_robottxt=boolean($value),next if ( $key eq 'robot-rules' );
1952 $edit=boolean($value),next if ($key eq 'remove-nomirror');
1953 $indexname=$value,next if ($key eq 'index-name');
1954 $s=nway($value,'save','stdout','forget'),next
1955 if ( $key eq 'file-disposition' );
1956 $verbose=nway($value,'quiet','brief','chatty')-1,next
1957 if ( $key eq 'verbosity' );
1958 $w3http::proxyuser=$value,next if $key eq 'http-proxy-user';
1959 $w3http::proxypasswd=$value,next if $key eq 'http-proxy-passwd';
1960
1961 if ( $key eq 'cd' ) {
1962 $chdirto=$value;
1963 warn "Use of 'cd' is discouraged\n" unless $verbose==-1;
1964 next;
1965 }
1966
1967 if ($key eq 'http-proxy') {
1968 ($w3http::proxyserver,$w3http::proxyport)=
1969 $value =~ /([^:]+):?(\d+)?/;
1970 $w3http::proxyport=80 unless $w3http::proxyport;
1971 $using_proxy=1;
1972 next;
1973 }
1974
1975 if ($key eq 'proxy-options') {
1976 my($val,$nval,@popts,$pragma);
1977 $pragma=1;
1978 foreach $val (split(/\s*,\*/,lc $value)) {
1979 $nval=nway($val,'no-pragma','revalidate','refresh','no-store',);
1980 # Force use of Cache-control: header
1981 $pragma=0 if ($nval==0);
1982 # use to force proxy to revalidate
1983 $pragma=0,push(@popts,'max-age=0') if ($nval==1);
1984 # use to force proxy to refresh
1985 push(@popts,'no-cache') if ($nval==2);
1986 # use if information transfered is sensitive
1987 $pragma=0,push(@popts,'no-store') if ($nval==3);
1988 }
1989 $cache_header=($pragma?'Pragma: ':'Cache-control: ').join(', ',@popts);
1990 next;
1991 }
1992
1993
1994 if ($key eq 'url') {
1995 my ($rum_url_o,$lf_dir,$rum_reurl,$rum_rebase);
1996
1997 # A two argument URL: line?
1998 if ($value =~ m/^(.+)\s+(.+)/i) {
1999 # Two arguments.
2000 # The last is a directory, it must end in /
2001 $lf_dir=$2;
2002 $lf_dir.='/' unless $lf_dir =~ m~/$~;
2003
2004 $rum_url_o=root_queue($1,$lf_dir);
2005
2006 # The first is a URL, make it more canonical, find the base.
2007 # The namespace confusion in this section is correct.(??)
2008 $rum_rebase = quotemeta( ($rum_url_o->as_string =~ m~^(.*/)~)[0] );
2009
2010 # print "URL: ",$rum_url_o->as_string,"\n";
2011 # print "Base: $rum_rebase\n";
2012
2013 # Translate from rum space to lf space:
2014 push(@internal_apply,"s/^".$rum_rebase."/".quotemeta($lf_dir)."/");
2015
2016 # That translation could lead to information loss.
2017 $infoloss=1;
2018
2019 # Fetch rules tests the rum_url_o->as_string. Fetch whatever
2020 # matches the base.
2021 $scope_fetch.="return 1 if m/^".$rum_rebase."/".$ipost.";\n";
2022
2023 # Ignore whatever did not match the base.
2024 $scope_ignore.="return 0 if m/^".
2025 quotemeta("http://".$rum_url_o->netloc."/")."/".$ipost.";\n";
2026
2027 } else {
2028 $rum_url_o=root_queue($value,'./');
2029
2030 $rum_rebase = quotemeta( ($rum_url_o->as_string =~ m~^(.*/)~)[0] );
2031
2032 # Translate from rum space to lf space:
2033 push(@internal_apply,"s/^".$rum_rebase."//");
2034
2035 $scope_fetch.="return 1 if m/^".$rum_rebase."/".$ipost.";\n";
2036 $scope_ignore.="return 0 if m/^".
2037 quotemeta("http://".$rum_url_o->netloc."/")."/".$ipost.";\n";
2038 }
2039 next;
2040 }
2041
2042 if ($key eq 'also-quene') {
2043 print STDERR
2044 "Found 'also-quene' keyword, please replace with 'also-queue'\n";
2045 $key='also-queue';
2046 }
2047
2048 if ($key eq 'also' || $key eq 'also-queue') {
2049 if ($value =~ m/^(.+)\s+(.+)/i) {
2050 my ($rum_url_o,$lf_dir,$rum_reurl,$rum_rebase);
2051 # Two arguments.
2052 # The last is a directory, it must end in /
2053 # print STDERR "URL ",$1," DIR ",$2,"\n";
2054 $lf_dir=$2;
2055 $lf_dir.='/' unless $lf_dir =~ m~/$~;
2056
2057 if ($key eq 'also-queue') {
2058 $rum_url_o=root_queue($1,$lf_dir);
2059 } else {
2060 root_queue("",$lf_dir);
2061 $rum_url_o=url $1;
2062 $rum_url_o->host(lc $rum_url_o->host);
2063 }
2064
2065 # The first is a URL, find the base
2066 $rum_rebase = quotemeta( ($rum_url_o->as_string =~ m~^(.*/)~)[0] );
2067
2068# print "URL: $rum_url_o->as_string\n";
2069# print "Base: $rum_rebase\n";
2070# print "Server: $server\n";
2071
2072 # Ok, now we can transform and select stuff the right way
2073 push(@internal_apply,"s/^".$rum_rebase."/".quotemeta($lf_dir)."/");
2074 $infoloss=1;
2075
2076 # Fetch rules tests the rum_url_o->as_string. Fetch whatever
2077 # matches the base.
2078 $scope_fetch.="return 1 if m/^".$rum_rebase."/".$ipost.";\n";
2079
2080 # Ignore whatever did not match the base. This cures problem
2081 # with '..' from base in in rum space pointing within the the
2082 # scope in ra space. We introduced a extra level (or more) of
2083 # directories with the apply above. Must do same with 'Also:'
2084 # directives.
2085 $scope_ignore.="return 0 if m/^".
2086 quotemeta("http://".$rum_url_o->netloc."/")."/".$ipost.";\n";
2087 } else {
2088 die "Also: requires 2 arguments\n";
2089 }
2090 next;
2091 }
2092
2093 if ($key eq 'quene') {
2094 print STDERR "Found 'quene' keyword, please replace with 'queue'\n";
2095 $key='queue';
2096 }
2097
2098 if ($key eq 'queue') {
2099 root_queue($value,"");
2100 next;
2101 }
2102
2103 if ($key eq 'ignore-re' || $key eq 'fetch-re') {
2104 # Check that it's a re, better that I am strict than for perl to
2105 # make compilation errors.
2106 unless ($value =~ /^m(.).*\1[gimosx]*$/) {
2107 print STDERR "w3mir: $value is not a recognized regular expression\n";
2108 exit 1;
2109 }
2110 # Fall-through to next cases!
2111 }
2112
2113 if ($key eq 'fetch' || $key eq 'fetch-re') {
2114 my $expr=$value;
2115 $expr = wild_re($expr).$ipost if ($key eq 'fetch');
2116 $rule_text.=' return 1 if '.$expr.";\n";
2117 next;
2118 }
2119
2120 if ($key eq 'ignore' || $key eq 'ignore-re') {
2121 my $expr=$value;
2122 $expr = wild_re($expr).$ipost if ($key eq 'ignore');
2123 # print STDERR "Ignore expression: $expr\n";
2124 $rule_text.=' return 0 if '.$expr.";\n";
2125 next;
2126 }
2127
2128
2129 if ($key eq 'apply') {
2130 unless ($value =~ /^s(.).*\1.*\1[gimosxe]*$/) {
2131 print STDERR
2132 "w3mir: '$value' is not a recognized regular expression\n";
2133 exit 1;
2134 }
2135 push(@user_apply,$value) ;
2136 $infoloss=1;
2137 next;
2138 }
2139
2140 if ($key eq 'agent') {
2141 $w3http::agent=$w3mir_agent=$value;
2142 next;
2143 }
2144
2145 # The authorization stuff:
2146 if ($key eq 'auth-domain') {
2147 $useauth=1;
2148 ($authserver, $authrealm) = split('/',$value,2);
2149 die "w3mir: server part of auth-domain has format server[:port]\n"
2150 unless $authserver =~ /^(\S+(:\d+)?)$|^\*$/;
2151 $authserver =~ s/:80$//;
2152 die "w3mir: auth-domain '$value' is not valid\n"
2153 if !defined($authserver) || !defined($authrealm);
2154 $authrealm=lc $authrealm;
2155 }
2156
2157 $authuser=$value if ($key eq 'auth-user');
2158 $authpasswd=$value if ($key eq 'auth-passwd');
2159
2160 # Got a full authentication spec?
2161 if ($authserver && $authrealm && $authuser && $authpasswd) {
2162 $authdata{$authserver}{$authrealm}=$authuser.":".$authpasswd;
2163 print "Authentication for $authserver/$authrealm is ".
2164 "$authuser/$authpasswd\n" if $verbose>=0;
2165 # exit;
2166 # Invalidate tmp vars
2167 $authserver=$authrealm=$authuser=$authpasswd=undef;
2168 next;
2169 }
2170
2171 next if $key eq 'auth-user' || $key eq 'auth-passwd' ||
2172 $key eq 'auth-domain';
2173
2174 if ($key eq 'fetch-options') {
2175 warn "w3mir: The 'fetch-options' directive has been renamed to 'options'\nw3mir: Please change your configuration file.\n";
2176 $key='options';
2177 # Fall through to 'options'!
2178 }
2179
2180 if ($key eq 'options') {
2181
2182 my($val,$nval);
2183 foreach $val (split(/\s*,\s*/,lc $value)) {
2184 if ($i==1) {
2185 $nval=nway($val,'recurse','no-date-check','only-nonexistent',
2186 'list-urls','lowercase','remove','batch','read-urls',
2187 'abs','no-newline-conv');
2188 $r=1,next if $nval==0;
2189 $fetch=1,next if $nval==1;
2190 $fetch=-1,next if $nval==2;
2191 $list=1,next if $nval==3;
2192 if ($nval==4) {
2193 $lc=1;
2194 $iinline=($lc?"(?i)":"");
2195 $ipost=($lc?"i":"");
2196 next ;
2197 }
2198 $remove=1,next if $nval==5;
2199 $convertnl=0,$batch=1,next if $nval==6;
2200 $read_urls=1,next if $nval==7;
2201 $abs=1,next if $nval==8;
2202 $convertnl=0,next if $nval==9;
2203 } else {
2204 die "w3mir: options must be the first directive in the config file.\n";
2205 }
2206 }
2207 next;
2208 }
2209
2210 if ($key eq 'disable-headers') {
2211 my($val,$nval);
2212 foreach $val (split(/\s*,\s*/,lc $value)) {
2213 $nval=nway($val,'referer','user');
2214 $do_referer=0,next if $nval==0;
2215 $do_user=0,next if $nval==1;
2216 }
2217 next;
2218 }
2219
2220
2221 if ($key eq 'fixup') {
2222
2223 $fixrc="$file";
2224 # warn "Fixrc: $fixrc\n";
2225
2226 my($val,$nval);
2227 foreach $val (split(/\s*,\s*/,lc $value)) {
2228 $nval=nway($val,'on','run','noindex','off');
2229 $runfix=1,next if $nval==1;
2230 # Disable fixup
2231 $fixup=0,next if $nval==3;
2232 # Ignore everyting else
2233 }
2234 next;
2235 }
2236
2237 die "w3mir: Unrecognized directive ('$key') in config file $file at line $.\n";
2238
2239 }
2240 close(CFGF);
2241
2242 if (defined($w3http::proxypasswd) && $w3http::proxyuser) {
2243 warn "Proxy authentication: ".$w3http::proxyuser.":".
2244 $w3http::proxypasswd."\n" if $verbose>=0;
2245 }
2246
2247}
2248
2249
2250sub wild_re {
2251 # Here we translate unix wildcard subset to to perlre
2252 local($_) = shift;
2253
2254 # Quote anything that's RE and not wildcard: / ( ) \ | { } + $ ^
2255 s~([\/\(\)\\\|\{\}\+)\$\^])~\\$1~g;
2256 # . -> \.
2257 s~\.~\\.~g;
2258 # * -> .*
2259 s~\*~\.\*~g;
2260 # ? -> .
2261 s~\?~\.~g;
2262
2263 # print STDERR "wild_re: $_\n";
2264
2265 return $_ = '/'.$_.'/';
2266}
2267
2268
2269sub numeric {
2270 # Check if argument is numeric?
2271 my ( $number ) = @_ ;
2272 return oct($number) if ($number =~ /\d+/ || $number =~ /\d+.\d+/);
2273 die "Expected a number, got \"$number\"\n";
2274}
2275
2276
2277sub boolean {
2278 my ( $boolean ) = @_ ;
2279
2280 $boolean = lc $boolean;
2281
2282 return 0 if ($boolean eq 'false' || $boolean eq 'off' || $boolean eq '0');
2283 return 1 if ($boolean eq 'true' || $boolean eq 'on' || $boolean eq '1');
2284 die "Expected a boolean, got \"$boolean\"\n";
2285}
2286
2287
2288sub nway {
2289 my ( $value ) = shift;
2290 my ( @values ) = @_;
2291 my ( $val ) = 0;
2292
2293 $value = lc $value;
2294 while (@_) {
2295 return $val if $value eq shift;
2296 $val++;
2297 }
2298 die "Expected one of ".join(", ",@values).", got \"$value\"\n";
2299}
2300
2301
2302sub insert_at_start {
2303 # ark: inserts the first arg at the top of the html in the second arg
2304 # janl: The second arg must be a reference to a scalar.
2305 my( $str, $text_ref ) = @_;
2306 my( @possible ) =("<BODY.*?>", "</HEAD.*?>", "</TITLE.*?>", "<HTML.*?>" );
2307 my( $f, $done );
2308
2309 $done=0;
2310 @_=@possible;
2311
2312 while( $done!=1 && ($f=shift) ){
2313 print "Searching for: $f\n";
2314 if( $$text_ref =~ /$f/i ){
2315 print "found it!\n";
2316 $$text_ref =~ s/($f)/$1\n$str/i;
2317 $done=1;
2318 }
2319 }
2320}
2321
2322
2323
2324sub rm_rf {
2325 # Recursively remove directories and other files
2326 # File::Path::rmtree does a similar thing but the messages are wrong
2327
2328 my($remove)=shift;
2329
2330 eval "use File::Find;" unless defined(&finddepth);
2331
2332 die "w3mir: Could not load File::Find module when trying to remove $remove\n"
2333 unless defined(&find);
2334
2335 finddepth(\&remove_everything,$remove);
2336
2337 if (rmdir($remove)) {
2338 print STDERR "\nw3mir: removed directory $remove\n" if $verbose>=0;
2339 } else {
2340 print STDERR "w3mir: could not remove $remove: $!\n";
2341 }
2342}
2343
2344
2345sub remove_everything {
2346 # This does the removal
2347 ((-d && rmdir($_)) || unlink($_)) && $verbose>=0 &&
2348 print STDERR "w3mir: removed $File::Find::name\n";
2349}
2350
2351
2352
2353sub usage {
2354 my($message)=shift @_;
2355
2356 print STDERR "w3mir: $message\n" if $message;
2357
2358 die 'w3mir: usage: w3mir [options] <single-http-url>
2359 or: w3mir -B [-I] [options] [<http-urls>]
2360
2361 Options :
2362 -agent <agent> - Set the agent name. Default is w3mir
2363 -abs - Force all URLs to be absolute.
2364 -B - Batch-get documents.
2365 -I - The URLs to get are read from standard input.
2366 -c - be more Chatty.
2367 -cfgfile <file> - Read config from file
2368 -d <debug-level>- set debug level to 1 or 2
2369 -drr - Disable robots.txt rules.
2370 -f - Forget all files, nothing is saved to disk.
2371 -fa - Fetch All, will not check timestamps.
2372 -fs - Fetch Some, do not fetch the files we already have.
2373 -ir <referer> - Initial referer. For picky servers.
2374 -l - List URLs in the documents retrived.
2375 -lc - Convert all URLs (and filenames) to lowercase.
2376 This does not work reliably.
2377 -p <n> - Pause n seconds before retriving each doc.
2378 -q - Quiet, error-messages only
2379 -rp <n> - Retry Pause in seconds.
2380 -P <server:port>- Use host/port for proxy http requests
2381 -pflush - Flush proxy server.
2382 -r - Recursive mirroring.
2383 -R - Remove files not referenced or not present on server.
2384 -s - Send output to stdout instead of file
2385 -t <n> - How many times to (re)try getting a failed doc?
2386 -umask <umask> - Set umask for mirroring, must be usual octal format.
2387 -nnc - No Newline Conversion. Disable newline conversions.
2388 -v - Show w3mir version.
2389';
2390}
2391__END__
2392# -*- perl -*- There must be a blank line here
2393
2394=head1 NAME
2395
2396w3mir - all purpose HTTP-copying and mirroring tool
2397
2398=head1 SYNOPSIS
2399
2400B<w3mir> [B<options>] [I<HTTP-URL>]
2401
2402B<w3mir> B<-B> [B<options>] <I<HTTP-URLS>>
2403
2404B<w3mir> is a all purpose HTTP copying and mirroring tool. The
2405main focus of B<w3mir> is to create and maintain a browsable copy of
2406one, or several, remote WWW site(s).
2407
2408Used to the max w3mir can retrive the contents of several related
2409sites and leave the mirror browseable via a local web server, or from
2410a filesystem, such as directly from a CDROM.
2411
2412B<w3mir> has options for all operations that are simple enough for
2413options. For authentication and passwords, multiple site retrievals
2414and such you will have to resort to a L</CONFIGURATION-FILE>. If
2415browsing from a filesystem references ending in '/' needs to be
2416rewritten to end in '/index.html', and in any case, if there are URLs
2417that are redirected will need to be changed to make the mirror
2418browseable, see the documentation of B<Fixup> in the
2419L</CONFIGURATION-FILE> secton.
2420
2421B<w3mir>s default behavior is to do as little as possible and to be as
2422nice as possible to the server(s) it is getting documents from. You
2423will need to read through the options list to make B<w3mir> do more
2424complex, and, useful things. Most of the things B<w3mir> can do is
2425also documented in the w3mir-HOWTO which is available at the B<w3mir>
2426home-page (F<http://www.math.uio.no/~janl/w3mir/>) as well as in the
2427w3mir distribution bundle.
2428
2429=head1 DESCRIPTION
2430
2431You may specify many options and one HTTP-URL on the w3mir
2432command line.
2433
2434A single HTTP URL I<must> be specified either on the command line or
2435in a B<URL> directive in a configuration file. If the URL refers to a
2436directory it I<must> end with a "/", otherwise you might get surprised
2437at what gets retrieved (e.g. rather more than you expect).
2438
2439Options must be prefixed with at least one - as shown below, you can
2440use more if you want to. B<-cfgfile> is equivalent to B<--cfgfile> or
2441even B<------cfgfile>. Options cannot be I<clustered>, i.e., B<-r -R>
2442is not equivalent to B<-rR>.
2443
2444=over 4
2445
2446=item B<-h> | B<-help> | B<-?>
2447
2448prints a brief summary of all command line options and exits.
2449
2450=item B<-cfgfile> F<file>
2451
2452Makes B<w3mir> read the given configuration file. See the next section
2453for how to write such a file.
2454
2455=item B<-r>
2456
2457Puts B<w3mir> into recursive mode. The default is to fetch only one
2458document and then quit. 'I<recursive>' mode means that all the
2459documents linked to the given document that are fetched, and all they
2460link to in turn and so on. But only I<Iff> they are in the same
2461directory or under the same directory as the start document. Any
2462document that is in or under the starting documents directory is said
2463to be within the I<scope of retrieval>.
2464
2465=item B<-fa>
2466
2467Fetch All. Normally B<w3mir> will only get the document if it has been
2468updated since the last time it was fetched. This switch turns that
2469check off.
2470
2471=item B<-fs>
2472
2473Fetch Some. Not the opposite of B<-fa>, but rather, fetch the ones we
2474don't have already. This is handy to restart copying of a site
2475incompletely copied by earlier, interrupted, runs of B<w3mir>.
2476
2477=item B<-p> I<n>
2478
2479Pause for I<n> seconds between getting each document. The default is
248030 seconds.
2481
2482=item B<-rp> I<n>
2483
2484Retry Pause, in seconds. When B<w3mir> fails to get a document for some
2485technical reason (timeout mainly) the document will be queued for a
2486later retry. The retry pause is how long B<w3mir> waits between
2487finishing a mirror pass before starting a new one to get the still
2488missing documents. This should be a long time, so network conditions
2489have a chance to get better. The default is 600 seconds (10 minutes),
2490which might be a bit too short, for batch running B<w3mir> I would
2491suggest an hour (3600 seconds) or more.
2492
2493=item B<-t> I<n>
2494
2495Number of reTries. If B<w3mir> cannot get all the documents by the
2496I<n>th retry B<w3mir> gives up. The default is 3.
2497
2498=item B<-drr>
2499
2500Disable Robot Rules. The robot exclusion standard is described in
2501http://info.webcrawler.com/mak/projects/robots/norobots.html. By
2502default B<w3mir> honors this standard. This option causes B<w3mir> to
2503ignore it.
2504
2505=item B<-nnc>
2506
2507No Newline Conversion. Normally w3mir converts the newline format of
2508all files that the web server says is a text file. However, not all
2509web servers are reliable, and so binary files may become corrupted due
2510to the newline conversion w3mir performs. Use this option to stop
2511w3mir from converting newlines. This also causes the file to be
2512regarded as binary when written to disk, to disable the implicit
2513newline conversion when saving text files on most non-Unix systems.
2514
2515This will probably be on by default in version 1.1 of w3mir, but not
2516in version 1.0.
2517
2518=item B<-R>
2519
2520Remove files. Normally B<w3mir> will not remove files that are no
2521longer on the server/part of the retrieved web of files. When this
2522option is specified all files no longer needed or found on the servers
2523will be removed. If B<w3mir> fails to get a document for I<any> other
2524reason the file will not be removed.
2525
2526=item B<-B>
2527
2528Batch fetch documents whose URLs are given on the commandline.
2529
2530In combination with the B<-r> and/or B<-l> switch all HTML and PDF
2531documents will be mined for URLs, but the documents will be saved on
2532disk unchanged. When used with the B<-r> switch only one single URL
2533is allowed. When not used with the B<-r> switch no HTML/URL
2534processing will be performed at all. When the B<-B> switch is used
2535with B<-r> w3mir will not do repeated mirrorings reliably since the
2536changes w3mir needs to do, in the documents, to work reliably are not
2537done. In any case it's best not to use B<-R> in combination with
2538B<-B> since that can result in deleting rather more documents than
2539expected. Hwowever, if the person writing the documents being copied
2540is good about making references relative and placing the <HTML> tag at
2541the beginning of documents there is a fair chance that things will
2542work even so. But I wouln't bet on it. It will, however, work
2543reliably for repeated mirroring if the B<-r> switch is not used.
2544
2545When the B<-B> switch is specified redirects for a given document will
2546be followed no matter where they point. The redirected-to document
2547will be retrieved in the place of the original document. This is a
2548potential weakness, since w3mir can be directed to fetch any document
2549anywhere on the web.
2550
2551Unless used with B<-r> all retrived files will be stored in one
2552directory using the remote filename as the local filename. I.e.,
2553F<http://foo/bar/gazonk.html> will be saved as F<gazonk.html>.
2554F<http://foo/bar/> will be saved as F<bar-index.html> so as to avoid
2555name colitions for the common case of URLs ending in /.
2556
2557=item B<-I>
2558
2559This switch can only be used with the B<-B> switch, and only after it
2560on the commandline or configuration file. When given w3mir will get
2561URLs from standard input (i.e., w3mir can be used as the end of a pipe
2562that produces URLs.) There should only be one URL pr. line of input.
2563
2564=item B<-q>
2565
2566Quiet. Turns off all informational messages, only errors will be
2567output.
2568
2569=item B<-c>
2570
2571Chatty. B<w3mir> will output more progress information. This can be
2572used if you're watching B<w3mir> work.
2573
2574=item B<-v>
2575
2576Version. Output B<w3mir>s version.
2577
2578=item B<-s>
2579
2580Copy the given document(s) to STDOUT.
2581
2582=item B<-f>
2583
2584Forget. The retrieved documents are not saved on disk, they are just
2585forgotten. This can be used to prime the cache in proxy servers, or
2586not save documents you just want to list the URLs in (see B<-l>).
2587
2588=item B<-l>
2589
2590List the URLs referred to in the retrieved document(s) on STDOUT.
2591
2592=item B<-umask> I<n>
2593
2594Sets the umask, i.e., the permission bits of all retrieved files. The
2595number is taken as octal unless it starts with a 0x, in which case
2596it's taken as hexadecimal. No matter what you set this to make sure
2597you get write as well as read access to created files and directories.
2598
2599Typical values are:
2600
2601=over 8
2602
2603=item 022
2604
2605let everyone read the files (and directories), only you can change
2606them.
2607
2608=item 027
2609
2610you and everyone in the same file-group as you can read, only you can
2611change them.
2612
2613=item 077
2614
2615only you can read the files, only you can change them.
2616
2617=item 0
2618
2619everyone can read, write and change everything.
2620
2621=back
2622
2623The default is whatever was set when B<w3mir> was invoked. 022 is a
2624reasonable value.
2625
2626This option has no meaning, or effect, on Win32 platforms.
2627
2628=item B<-P> I<server:port>
2629
2630Use the given server and port is a HTTP proxy server. If no port is
2631given port 80 is assumed (this is the normal HTTP port). This is
2632useful if you are inside a firewall, or use a proxy server to save
2633bandwidth.
2634
2635=item B<-pflush>
2636
2637Proxy flush, force the proxy server to flush it's cache and re-get the
2638document from the source. The I<Pragma: no-cache> HTTP/1.0 header is
2639used to implement this.
2640
2641=item B<-ir> I<referrer>
2642
2643Initial Referrer. Set the referrer of the first retrieved document.
2644Some servers are reluctant to serve certain documents unless this is
2645set right.
2646
2647=item B<-agent> I<agent>
2648
2649Set the HTTP User-Agent fields value. Some servers will serve
2650different documents according to the WWW browsers capabilities.
2651B<w3mir> normally has B<w3mir>/I<version> in this header field.
2652Netscape uses things like B<Mozilla/3.01 (X11; I; Linux 2.0.30 i586)>
2653and MSIE uses things like B<Mozilla/2.0 (compatible; MSIE 3.02;
2654Windows NT)> (remember to enclose agent strings with spaces in with
2655double quotes ("))
2656
2657=item B<-lc>
2658
2659Lower Case URLs. Some OSes, like W95 and NT, are not case sensitive
2660when it comes to filenames. Thus web masters using such OSes can case
2661filenames differently in different places (apps.html, Apps.html,
2662APPS.HTML). If you mirror to a Unix machine this can result in one
2663file on the server becoming many in the mirror. This option
2664lowercases all filenames so the mirror corresponds better with the
2665server.
2666
2667If given it must be the first option on the command line.
2668
2669This option does not work perfectly. Most especially for mixed case
2670host-names.
2671
2672=item B<-d> I<n>
2673
2674Set the debug level. A debug level higher than 0 will produce lots of
2675extra output for debugging purposes.
2676
2677=item B<-abs>
2678
2679Force all URLs to be absolute. If you retrive
2680F<http://www.ifi.uio.no/~janl/index.html> and it references foo.html
2681the referense is absolutified into
2682F<http://www.ifi.uio.no/~janl/foo.html>. In other words, you get
2683absolute references to the origin site if you use this option.
2684
2685=back
2686
2687=head1 CONFIGURATION-FILE
2688
2689Most things can be mirrored with a (long) command line. But multi
2690server mirroring, authentication and some other things are only
2691available through a configuration file. A configuration file can
2692either be specified with the B<-cfgfile> switch, but w3mir also looks
2693for .w3mirc (w3mir.ini on Win32 platforms) in the directory where
2694w3mir is started from.
2695
2696The configuration file consists of lines of comments and directives.
2697A directive consists of a keyword followed by a colon (:) and then one
2698or several arguments.
2699
2700 # This is a comment. And the next line is a directive:
2701 Options: recurse, remove
2702
2703A comment can only start at the beginning of a line. The directive
2704keywords are not case-sensitive, but the arguments I<might> be.
2705
2706=over 4
2707
2708=item Options: I<recurse> | I<no-date-check> | I<only-nonexistent> | I<list-urls> | I<lowercase> | I<remove> | I<batch> | I<input-urls> | I<no-newline-conv>
2709
2710This must be the first directive in a configuration file.
2711
2712=over 8
2713
2714=item I<recurse>
2715
2716see B<-r> switch.
2717
2718=item I<no-date-check>
2719
2720see B<-fa> switch.
2721
2722=item I<only-nonexistent>
2723
2724see B<-fs> switch.
2725
2726=item I<list-urls>
2727
2728see B<-l> option.
2729
2730=item I<lowercase>
2731
2732see B<-lc> option.
2733
2734=item I<remove>
2735
2736see B<-R> option.
2737
2738=item I<batch>
2739
2740see B<-B> option.
2741
2742=item I<input-urls>
2743
2744see B<-I> option.
2745
2746=item I<no-newline-conv>
2747
2748see B<-nnc> option.
2749
2750=back
2751
2752=item URL: I<HTTP-URL> [I<target-directory>]
2753
2754The URL directive may only appear once in any configuration file.
2755
2756Without the optional target directory argument it corresponds directly
2757to the I<single-HTTP-URL> argument on the command line.
2758
2759If the optional target directory is given all documents from under the
2760given URL will be stored in that directory, and under. The target
2761directory is most likely only specified if the B<Also> directive is
2762also specified.
2763
2764If the URL given refers to a directory it I<must> end in a "/",
2765otherwise you might get quite surprised at what gets retrieved.
2766
2767Either one URL: directive or the single-HTTP-URL at the command-line
2768I<must> be given.
2769
2770=item Also: I<HTTP-URL directory>
2771
2772This directive is only meaningful if the I<recurse> (or B<-r>)
2773option is given.
2774
2775The directive enlarges the scope of a recursive retrieval to contain
2776the given HTTP-URL and all documents in the same directory or under.
2777Any documents retrieved because of this directive will be stored in the
2778given directory of the mirror.
2779
2780In practice this means that if the documents to be retrieved are stored
2781on several servers, or in several hierarchies on one server or any
2782combination of those. Then the B<Also> directive ensures that we get
2783everything into one single mirror.
2784
2785This also means that if you're retrieving
2786
2787 URL: http://www.foo.org/gazonk/
2788
2789but it has inline icons or images stored in http://www.foo.org/icons/
2790which you will also want to get, then that will be retrieved as well by
2791entering
2792
2793 Also: http://www.foo.org/icons/ icons
2794
2795As with the URL directive, if the URL refers to a directory it I<must>
2796end in a "/".
2797
2798Another use for it is when mirroring sites that have several names
2799that all refer to the same (logical) server:
2800
2801 URL: http://www.midifest.com/
2802 Also: http://midifest.com/ .
2803
2804At this point in time B<w3mir> has no mechanism to easily enlarge the
2805scope of a mirror after it has been established. That means that you
2806should survey the documents you are going to retrieve to find out what
2807icons, graphics and other things they refer to that you want. And
2808what other sites you might like to retrieve. If you find out that
2809something is missing you will have to delete the whole mirror, add the
2810needed B<Also> directives and then reestablish the mirror. This lack
2811of flexibility in what to retrieve will be addressed at a later date.
2812
2813See also the B<Also-quene> directive.
2814
2815=item Also-quene: I<HTTP-URL directory>
2816
2817This is like Also, except that the URL itself is also quened. The
2818Also directive will not cause any documents to be retrived UNLESS they
2819are referenced by some other document w3mir has already retrived.
2820
2821=item Quene: I<HTTP-URL>
2822
2823This is quenes the URL for retrival, but does not enlarge the scope of
2824the retrival. If the URL is outside the scope of retrival it will not
2825be retrived anyway.
2826
2827The observant reader will see that B<Also-quene> is like B<Also>
2828combined with B<Quene>.
2829
2830=item Initial-referer: I<referer>
2831
2832see B<-ir> option.
2833
2834=item Ignore: F<wildcard>
2835
2836=item Fetch: F<wildcard>
2837
2838=item Ignore-RE: F<regular-expression>
2839
2840=item Fetch-RE: F<regular-expression>
2841
2842These four are used to set up rules about which documents, within the
2843scope of retrieval, should be gotten and which not. The default is to
2844get I<anything> that is within the scope of retrieval. That may not
2845be practical though. This goes for CGI scripts, and especially server
2846side image maps and other things that are executed/evaluated on the
2847server. There might be other things you want unfetched as well.
2848
2849B<w3mir> stores the I<Ignore>/I<Fetch> rules in a list. When a
2850document is considered for retrieval the URL is checked against the
2851list in the same order that the rules appeared in the configuration
2852file. If the URL matches any rule the search stops at once. If it
2853matched a I<Ignore> rule the document is not fetched and any URLs in
2854other documents pointing to it will point to the document at the
2855original server (not inside the mirror). If it matched a I<Fetch>
2856rule the document is gotten. If not matched by any ruøes the document
2857is gotten.
2858
2859The F<wildcard>s are a very limited subset of Unix-wildcards.
2860B<w3mir> understands only 'I<?>', 'I<*>', and 'I<[x-y]>' ranges.
2861
2862The F<perl-regular-expression> is perls superset of the normal Unix
2863regular expression syntax. They must be completely specified,
2864including the prefixed m, a delimiter of your choice (except the
2865paired delimiters: parenthesis, brackets and braces), and any of the
2866RE modifiers. E.g.,
2867
2868 Ignore-RE: m/.gif$/i
2869
2870or
2871
2872 Ignore-RE: m~/.*/.*/.*/~
2873
2874and so on. "#" cannot be used as delimiter as it is the comment
2875character in the configuration file. This also has the bad
2876side-effect of making you unable to match fragment names (#foobar)
2877directly. Fortunately perl allows writing ``#'' as ``\043''.
2878
2879You must be very carefull of using the RE anchors (``^'' and ``$''
2880with the RE versions of these and the I<Apply> directive. Given the
2881rules:
2882
2883 Fetch-RE: m/foobar.cgi$/
2884 Ignore: *.cgi
2885
2886the all files called ``foobar.cgi'' will be fetched. However, if the
2887file is referenced as ``foobar.cgi?query=mp3'' it will I<not> be
2888fetched since the ``$'' anchor will prevent it from matching the
2889I<Fetch-RE> directive and then it will match the I<Ignore> directive
2890instead. If you want to match ``foobar.cgi'' but not ``foobar.cgifu''
2891you can use perls ``\b'' character class which matches a word
2892boundrary:
2893
2894 Fetch-RE: m/foobar.cgi\b/
2895 Ignore: *.cgi
2896
2897which will get ``foobar.cgi'' as well as ``foobar.cgi?query=mp3'' but
2898not ``foobar.cgifu''. BUT, you must keep in mind that a lot of
2899diffetent characters make a word boundrary, maybe something more
2900subtle is needed.
2901
2902=item Apply: I<regular-expression>
2903
2904This is used to change a URL into another URL. It is a potentially
2905I<very> powerful feature, and it also provides ample chance for you to
2906shoot your own foot. The whole aparatus is somewhat tenative, if you
2907find there is a need for changes in how Apply rules work please
2908E-mail. If you are going to use this feature please read the
2909documentation for I<Fetch-RE> and I<Ignore-RE> first.
2910
2911The B<Apply> expressions are applied, in sequence, to the URLs in
2912their absolute form. I.e., with the whole
2913http://host:port/dir/ec/tory/file URL. It is only after this B<w3mir>
2914checks if a document is within the scope of retrieval or not. That
2915means that B<Apply> rules can be used to change certain URLs to fall
2916inside the scope of retrieval, and vice versa.
2917
2918The I<regular-expression> is perls superset of the usual Unix regular
2919expressions for substitution. As with I<Fetch> and I<Ignore> rules it
2920must be specified fully, with the I<s> and delimiting character. It
2921has the same restrictions with regards to delimiters. E.g.,
2922
2923 Apply: s~/foo/~/bar/~i
2924
2925to translate the path element I<foo> to I<bar> in all URLs.
2926
2927"#" cannot be used as delimiter as it is the comment character in the
2928configuration file.
2929
2930Please note that w3mir expects that URLs identifying 'directories'
2931keep idenfifying directories after application of Apply rules. Ditto
2932for files.
2933
2934=item Agent: I<agent>
2935
2936see B<-agent> option.
2937
2938=item Pause: I<n>
2939
2940see B<-p> option.
2941
2942=item Retry-Pause: I<n>
2943
2944see B<-rp> option.
2945
2946=item Retries: I<n>
2947
2948see B<-t> option.
2949
2950=item debug: I<n>
2951
2952see B<-d> option.
2953
2954=item umask I<n>
2955
2956see B<-umask> option.
2957
2958=item Robot-Rules: I<on> | I<off>
2959
2960Turn robot rules on of off. See B<-drr> option.
2961
2962=item Remove-Nomirror: I<on> | I<off>
2963
2964If this is enabled sections between two consecutive
2965
2966 <!--NO MIRROR-->
2967
2968comments in a mirrored document will be removed. This editing is
2969performed even if batch getting is specified.
2970
2971=item Header: I<html/text>
2972
2973Insert this I<complete> html/text into the start of the document.
2974This will be done even if batch is specified.
2975
2976=item File-Disposition: I<save> | I<stdout> | I<forget>
2977
2978What to do with a retrieved file. The I<save> alternative is default.
2979The two others correspond to the B<-s> and B<-f> options. Only one
2980may be specified.
2981
2982=item Verbosity: I<quiet> | I<brief> | I<chatty>
2983
2984How much B<w3mir> informs you of it's progress. I<Brief> is the
2985default. The two others correspond to the B<-q> and B<-c> switches.
2986
2987=item Cd: I<directory>
2988
2989Change to given directory before starting work. If it does not exist
2990it will be quietly created. Using this option breaks the 'fixup'
2991code so consider not using it, ever.
2992
2993=item HTTP-Proxy: I<server:port>
2994
2995see the B<-P> switch.
2996
2997=item HTTP-Proxy-user: I<username>
2998
2999=item HTTP-Proxy-passwd: I<password>
3000
3001These two are is used to activate authentication with the proxy
3002server. L<w3mir> only supports I<basic> proxy autentication, and is
3003quite simpleminded about it, if proxy authentication is on L<w3mir>
3004will always give it to the proxy. The domain concept is not supported
3005with proxy-authentication.
3006
3007=item Proxy-Options: I<no-pragma> | I<revalidate> | I<refresh> | I<no-store>
3008
3009Set proxy options. There are two ways to pass proxy options, HTTP/1.0
3010compatible and HTTP/1.1 compatible. Newer proxy-servers will
3011understand the 1.1 way as well as 1.0. With old proxy-servers only
3012the 1.0 way will work. L<w3mir> will prefer the 1.0 way.
3013
3014The only 1.0 compatible proxy-option is I<refresh>, it corresponds to
3015the B<-pflush> option and forces the proxy server to pass the request
3016to a upstream server to retrieve a I<fresh> copy of the document.
3017
3018The I<no-pragma> option forces w3mir to use the HTTP/1.1 proxy
3019control header, use this only with servers you know to be new,
3020otherwise it won't work at all. Use of any option but I<refresh> will
3021also cause HTTP/1.1 to be used.
3022
3023I<revalidate> forces the proxy server to contact the upstream server
3024to validate that it has a fresh copy of the document. This is nicer
3025to the net than I<refresh> option which forces re-get of the document
3026no matter if the server has a fresh copy already.
3027
3028I<no-store> forbids the proxy from storing the document in other than
3029in transient storage. This can be used when transferring sensitive
3030documents, but is by no means any warranty that the document can't be
3031found on any storage device on the proxy-server after the transfer.
3032Cryptography, if legal in your contry, is the solution if you want the
3033contents to be secret.
3034
3035I<refresh> corresponds to the HTTP/1.0 header I<Pragma: no-cache> or
3036the identical HTTP/1.1 I<Cache-control> option. I<revalidate> and
3037I<no-store> corresponds to I<max-age=0> and I<no-store> respectively.
3038
3039=item Authorization
3040
3041B<w3mir> supports only the I<basic> authentication of HTTP/1.0. This
3042method can assign a password to a given user/server/I<realm>. The
3043"user" is your user-name on the server. The "server" is the server.
3044The I<realm> is a HTTP concept. It is simply a grouping of files and
3045documents. One file or a whole directory hierarchy can belong to a
3046realm. One server may have many realms. A user may have separate
3047passwords for each realm, or the same password for all the realms the
3048user has access to. A combination of a server and a realm is called a
3049I<domain>.
3050
3051=over 8
3052
3053=item Auth-Domain: I<server:port/realm>
3054
3055Give the server and port, and the belonging realm (making a domain)
3056that the following authentication data holds for. You may specify "*"
3057wildcard for either of I<server:port> and I<realm>, this will work
3058well if you only have one usernme and password on all the servers
3059mirrored.
3060
3061=item Auth-User: I<user>
3062
3063Your user-name.
3064
3065=item Auth-Passwd: I<password>
3066
3067Your password.
3068
3069=back
3070
3071These three directives may be repeated, in clusters, as many times as
3072needed to give the necessary authentication information
3073
3074=item Disable-Headers: I<referer> | I<user>
3075
3076Stop B<w3mir> from sending the given headers. This can be used for
3077anonymity, making your retrievals harder to track. It will be even
3078harder if you specify a generic B<Agent>, like Netscape.
3079
3080=item Fixup: I<...>
3081
3082This directive controls some aspects of the separate program w3mfix.
3083w3mfix uses the same configuration file as w3mir since it needs a lot
3084of the information in the B<w3mir> configuration file to do it's work
3085correctly. B<w3mfix> is used to make mirrors more browseable on
3086filesystems (disk or CDROM), and to fix redirected URLs and some other
3087URL editing. If you want a mirror to be browseable of disk or CDROM
3088you almost certainly need to run w3mfix. In many cases it is not
3089necessary when you run a mirror to be used through a WWW server.
3090
3091To make B<w3mir> write the data files B<w3mfix> needs, and do nothing
3092else, simply put
3093
3094=over 8
3095
3096 Fixup: on
3097
3098=back
3099
3100in the configuration file. To make B<w3mir> run B<w3mfix>
3101automatically after each time B<w3mir> has completed a mirror run
3102specify
3103
3104=over 8
3105
3106 Fixup: run
3107
3108=back
3109
3110L<w3mfix> is documented in a separate man page in a effort to not
3111prolong I<this> manpage unnecessarily.
3112
3113=item Index-name: I<name-of-index-file>
3114
3115When retriving URLs ending in '/' w3mir needs to append a filename to
3116store it localy. The default value for this is 'index.html' (this is
3117the most used, its use originated in the NCSA HTTPD as far as I know).
3118Some WWW servers use the filename 'Welcome.html' or 'welcome.html'
3119instead (this was the default in the old CERN HTTPD). And servers
3120running on limited OSes frequently use 'index.htm'. To keep things
3121consistent and sane w3mir and the server should use the same name.
3122Put
3123
3124 Index-name: welcome.html
3125
3126when mirroring from a site that uses that convention.
3127
3128When doing a multiserver retrival where the servers use two or more
3129different names for this you should use B<Apply> rules to make the
3130names consistent within the mirror.
3131
3132When making a mirror for use with a WWW server, the mirror should use
3133the same name as the new server for this, to acomplish that
3134B<Index-name> should be combined with B<Apply>.
3135
3136Here is an example of use in the to latter cases when Welcome.html is
3137the prefered I<index> name:
3138
3139 Index-name: Welcome.html
3140 Apply: s~/index.html$~/Welcome.html~
3141
3142Similarly, if index.html is the prefered I<index> name.
3143
3144 Apply: s~/Welcome.html~/index.html~
3145
3146I<Index-name> is not needed since index.html is the default index name.
3147
3148=back
3149
3150=head1 EXAMPLES
3151
3152=over 4
3153
3154=item * Just get the latest Dr-Fun if it has been changed since the last
3155time
3156
3157 w3mir http://sunsite.unc.edu/Dave/Dr-Fun/latest.jpg
3158
3159=item * Recursively fetch everything on the Star Wars site, remove
3160what is no longer at the server from the mirror:
3161
3162 w3mir -R -r http://www.starwars.com/
3163
3164=item * Fetch the contents of the Sega site through a proxy, pausing
3165for 30 seconds between each document
3166
3167 w3mir -r -p 30 -P www.foo.org:4321 http://www.sega.com/
3168
3169=item * Do everything according to F<w3mir.cfg>
3170
3171 w3mir -cfgfile w3mir.cfg
3172
3173=item * A simple configuration file
3174
3175 # Remember, options first, as many as you like, comma separated
3176 Options: recurse, remove
3177 #
3178 # Start here:
3179 URL: http://www.starwars.com/
3180 #
3181 # Speed things up
3182 Pause: 0
3183 #
3184 # Don't get junk
3185 Ignore: *.cgi
3186 Ignore: *-cgi
3187 Ignore: *.map
3188 #
3189 # Proxy:
3190 HTTP-Proxy: www.foo.org:4321
3191 #
3192 # You _should_ cd away from the directory where the config file is.
3193 cd: starwars
3194 #
3195 # Authentication:
3196 Auth-domain: server:port/realm
3197 Auth-user: me
3198 Auth-passwd: my_password
3199 #
3200 # You can use '*' in place of server:port and/or realm:
3201 Auth-domain: */*
3202 Auth-user: otherme
3203 Auth-user: otherpassword
3204
3205=item Also:
3206
3207 # Retrive all of janl's home pages:
3208 Options: recurse
3209 #
3210 # This is the two argument form of URL:. It fetches the first into the second
3211 URL: http://www.math.uio.no/~janl/ math/janl
3212 #
3213 # These says that any documents refered to that lives under these places
3214 # should be gotten too. Into the named directories. Two arguments are
3215 # required for 'Also:'.
3216 Also: http://www.math.uio.no/drift/personer/ math/drift
3217 Also: http://www.ifi.uio.no/~janl/ ifi/janl
3218 Also: http://www.mi.uib.no/~nicolai/ math-uib/nicolai
3219 #
3220 # The options above will result in this directory hierarchy under
3221 # where you started w3mir:
3222 # w3mir/math/janl files from http://www.math.uio.no/~janl
3223 # w3mir/math/drift from http://www.math.uio.no/drift/personer/
3224 # w3mir/ifi/janl from http://www.ifi.uio.no/~janl/
3225 # w3mir/math-uib/nicolai from http://www.mi.uib.no/~nicolai/
3226
3227=item Ignore-RE and Fetch-RE
3228
3229 # Get only jpeg/jpg files, no gifs
3230 Fetch-RE: m/\.jp(e)?g$/
3231 Ignore-RE: m/\.gif$/
3232
3233=item Apply
3234
3235As I said earlier, B<Apply> has not been used for Real Work yet, that
3236I know of. But B<Apply> I<could>, be used to map all web servers at
3237the university of Oslo inside the scope of retrieval very easily:
3238
3239 # Start at the main server
3240 URL: http://www.uio.no/
3241 # Change http://*.uio.no and http://129.240.* to be a subdirectory
3242 # of http://www.uio.no/.
3243 Apply: s~^http://(.*\.uio\.no(?:\d+)?)/~http://www.uio.no/$1/~i
3244 Apply: s~^http://(129\.240\.[^:]*(?:\d+)?)/~http://www.uio.no/$1/~i
3245
3246
3247=back
3248
3249There are two rather extensive example files in the B<w3mir> distribution.
3250
3251=head1 BUGS
3252
3253=over 4
3254
3255=item The -lc switch does not work too well.
3256
3257=back
3258
3259=head1 FEATURES
3260
3261These are not bugs.
3262
3263=over 4
3264
3265=item URLs with two /es ('//') in the path component does not work as
3266some might expect. According to my reading of the URL spec. it is an
3267illegal construct, which is a Good Thing, because I don't know how to
3268handle it if it's legal.
3269
3270=item If you start at http://foo/bar/ then index.html might be gotten
3271twice.
3272
3273=item Some documents point to a point above the server root, i.e.,
3274http://some.server/../stuff.html. Netscape, and other browsers, in
3275defiance of the URL standard documents will change the URL to
3276http://some.server/stuff.html. W3mir will not.
3277
3278=item Authentication is I<only> tried if the server requests it. This
3279might lead to a lot of extra connections going up and down, but that's
3280the way it's gotta work for now.
3281
3282=back
3283
3284=head1 SEE ALSO
3285
3286L<w3mfix>
3287
3288=head1 AUTHORS
3289
3290B<w3mir>s authors can be reached at I<[email protected]>.
3291B<w3mir>s home page is at http://www.math.uio.no/~janl/w3mir/
Note: See TracBrowser for help on using the repository browser.