source: gsdl/tags/gsdl-2_71-distribution/gsdl/packages/w3mir/w3mir-1.0.8/w3mir-HOWTO.html@ 14121

Last change on this file since 14121 was 719, checked in by davidb, 25 years ago

added w3mir package

  • Property svn:keywords set to Author Date Id Revision
File size: 32.9 KB
Line 
1<!doctype html public "-//W3C//DTD HTML 4.0//EN">
2<html>
3<head>
4<title>W3MIR HOWTO</title>
5<style type="text/css">
6<!--
7 body { background-color: white }
8 h1, h2, h3, b { font-family: sans-serif }
9 .red { color: red }
10-->
11</style>
12<body>
13<h1>W3MIR HOWTO</h1>
14
15<p><b>Corresponding to w3mir version 1.0.2 and above</b>
16
17<p>W3mir is an all purpose WWW copying and mirroring program. Its
18main focus is copying complete directory structures keeping your copy
19browseable through a web server, or directly off a disk or CDROM if
20you want. W3mir will fix URLs that are redirected and everything else
21that needs to be fixed to make your copy browseable. But it also does
22odd jobs, retrieving single documents, batch getting several documents
23and more. You may tell w3mir not to change anything in the retrieved
24documents. W3mir has been in development quite a long time so you
25find options to do a lot of things needed when copying things off the
26web.
27
28<p>With w3mir you may copy the entire contents a web server. Or just
29a directory hierarchy, or several related hierarchies off as many
30servers as you like. They don't even have to be related.
31
32<p>W3mir supports HTML4, and has partial support for CSS, Java,
33ActiveX and Adobe Acrobat (PDF) files. And it works on Win32
34machines.
35
36<p><b>Warning:</b> W3mir enables you to copy a lot of things off the
37Web, but remember, the things you retrieve might be copyrighted and
38the copy you make with w3mir might in fact be illegal to make and
39posses.
40
41<hr>
42
43<h2><a name="contents">Contents</a></h2>
44
45<p><a href="#intro">README</a> (You want to read this! <b
46class="red">Really!</b>)
47
48<p><b>How do I...</b>
49<ol>
50 <li><p><a href="#copy">copy a file?</a>
51 <li><p><a href="#recurse">copy a directory hierarchy?</a>
52 <li><p><a href="#resources">copy the needed resource files from another
53 directory hierarchy?</a>
54 <li><p><a href="#ignore">avoid copying files I don't want or copy only
55 files I want?</a>
56 <li><p><a href="#rm">remove the files that are no longer on the
57 original site from the mirror?</a>
58 <li><p><a href="#depth">limit how deep w3mir will recurse?</a>
59 <li><p><a href="#memory">limit w3mirs memory usage?</a>
60 <li><p><a href="#multi">copy files from multiple sites?</a>
61 <li><p><a href="#alias">copy files from one server with several names?</a>
62 <li><p><a href="#aborted">restart a mirror process after stopping it
63 prematurely?</a>
64 <li><p><a href="#enlarge">enlarge or prune an established mirror?</a>
65 <li><p><a href="#cat">'cat' a file?</a>
66 <li><p><a href="#list">list URLs in a document?</a>
67 <li><p><a href="#robots">disable robots.txt obedience?</a>
68 <li><p><a href="#corrupt">stop w3mir from corrupting binary files?</a>
69 <li><p><a href="#auth">copy a site that wants user-name and password?</a>
70 <li><p><a href="#mauth">access a site that wants several different
71 user-names and passwords?</a>
72 <li><p><a href="#proxy">use a proxy server?</a>
73 <li><p><a href="#pauth">authenticate myself to a proxy server?</a>
74 <li><p><a href="#proxytweak">ensure that the proxy server ...?</a>
75 <li><p><a href="#batchget">batch get files with w3mir?</a>
76 <li><p><a href="#cgi">handle CGI?</a>
77 <li><p><a href="#imap">handle server side image-maps?</a>
78 <li><p><a href="#java">handle Java and ActiveX?</a>
79 <li><p><a href="#script">handle java-script and other script languages?</a>
80 <li><p><a href="#css">handle the other things with 'partial support'?</a>
81 <li><p><a href="#anon">keep my identity secret?</a>
82 <li><p><a href="#ns">pretend that I'm using Netscape, Internet
83 Explorer or Lynx?</a>
84 <li><p><a href="#other">do other things?</a>
85</ol>
86
87<hr>
88
89<h2><a name="intro">README</a></h2>
90
91<p>W3mir may be used in two, main, ways:
92
93<ul>
94 <li><p>To copy something random once.
95 <li><p>To keep a local mirror of some remote site
96</ul>
97
98<p>To copy something random once there is a high likeliness you can
99just start w3mir with some simple options and it will do the job you
100want it to. Providing that the remote site is not too complex and
101your expectations of the copy aren't high :-) This is what wget, the
102gnu w3 mirroring program, does and is good at.
103
104<h3>Configuration file</h3>
105
106<p>Once you want to keep a copy of a remote site up-to-date over time,
107mirror something with server side image-maps, redirects or
108authentication you have to write a configuration file for w3mir. This
109is what w3mir is good at, compared to wget. Writing the file is not
110hard, and there are two example files in the w3mir distribution. It
111will also be explained here. The configuration file is typically
112called <tt>.w3mirc</tt> (<tt>w3mir.ini</tt> on win32 machines), and
113can be written with a simple text editor. It is kept in the top
114directory of the mirror, where w3mir will find it when it starts.
115Please refer to the <a href="#contents">contents</a> for how to handle
116a specific problem with a configuration file.
117
118<hr>
119
120<h2>The answers:</h2>
121
122<hr>
123
124<h3><a name="copy">How do I copy a file?</a></h3>
125
126<p>To copy the top page off www.starwars.com:
127
128<p><tt>w3mir http://www.starwars.com/</tt>
129
130<p><b>Note:</b> it is <em>important</em> that you give the trailing
131slash for server names and directories.
132
133<hr>
134
135<h3><a name="recurse">How to I copy a directory hierarchy?</a></h3>
136
137<p>To copy the entire stuff about episode I from www.starwars.com
138which is stored in <tt>http://www.starwars.com/episode-i/</tt> (I don't
139recommend this, it's quite a lot of data):
140
141<p><tt>w3mir -r http://www.starwars.com/episode-i/</tt>
142
143<p>The corresponding configuration file is simple:
144
145<pre>
146Options: recurse
147URL: http://www.starwars.com/episode-i/
148Fixup: run
149</pre>
150
151<p>The <tt>-r</tt> option makes w3mir recurse down from the starting
152point. It will only copy all the documents under
153http://www.starwars.com/episode-i/ that it sees referenced from those
154same documents. W3mir will <em>not</em> retrieve documents from
155http://www.starwars.com/ because it is considered to be 'over' the
156starting point.
157
158<p>The command-line will get you a copy that is definitely browseable
159via a WEB server, and possibly browseable directly from a CDROM or
160hard-disk. To ensure that it is browseable from CDROM and disk you
161need to use a configuration file with the <tt>Fixup: run</tt> line in.
162It causes w3mir to edit anything that needs editing after the mirror
163has completed, including fixing URLs that caused redirects. The dirty
164work is done by w3mirs helper program w3mfix. The directive will
165cause w3mfix to be run each time w3mir completes the mirror.
166
167<p><b>Note:</b> it is <em>important</em> that you give the trailing
168slash after the directory name. Specifying
169<tt>http://www.starwars.com/episode-i</tt> and
170<tt>http://www.starwars.com/episode-i/</tt> is quite different in
171w3mirs eyes. In the former case episode-i is considered to be a
172document within the / (top) directory of www.starwars.com and w3mir
173will recurse from /, which is a lot more than you wanted. In the
174latter case w3mir understands that episode-i is a directory and will
175consider that directory to be the staring point, which is what you
176wanted.
177
178<hr>
179
180<h3><a name="resources">How do I copy the needed resource files from
181another directory hierarchy?</a></h3>
182
183<p>Some sites store their documents in one place, and puts their
184banners, icons and such in a separate directory called
185<tt>/images</tt>, <tt>/banners</tt>, <tt>/icons</tt>,
186<tt>/resources</tt> or some such. Unless you retrieve these as well as
187the documents things will probably not be too colorful. So, imagine
188that the starwars site stored all the images in one holding directory
189called <tt>/imagery</tt> and you want to copy all the stuff in it that
190the episode-i pages need. Then you do this:
191
192<pre>
193Options: recurse
194URL: http://www.starwars.com/episode-i/ episode-i
195Also: http://www.starwars.com/imagery/ imagery
196Fixup: run
197</pre>
198
199<p>There are two changes here compared to the simpler file we started
200with: There is an extra argument at the end of the URL directive. It
201tells w3mir to store everything gotten from
202<tt>http://www.starwars.com/episode-i/</tt> in the subdirectory
203<tt>episode-i</tt>. The directory can be omitted, but I think its
204neater this way. Then the new directive 'Also:'. It tells w3mir that
205you also want whatever the documents under
206<tt>http://www.starwars.com/episode-i/</tt> references under
207<tt>http://www.starwars.com/imagery/</tt>.
208
209<p><b>Note:</b> this will only get stuff that was used by the
210documents under <tt>http://www.starwars.com/episode-i/</tt>, anything
211stored under <tt>http://www.starwars.com/imagery/</tt> which is not
212used will not be retrieved. If you want everything under
213<tt>imagery</tt> to be retrived use the <tt>Also-quene:</tt>
214directive.
215
216<hr>
217
218<h3><a name="ignore">How do I avoid copying files I don't want or copy
219only files I want?</a></h3>
220
221<p>To control what files w3mir copies you can use the
222<tt>Ignore:</tt>, <tt>Fetch:</tt>, <tt>Ignore-RE:</tt> and
223<tt>Fetch-RE:</tt> directives in the configuration file. The embeded
224references to any file you chose to ignore, i.e., not copy, will point
225at the original site, <em>not</em> to the mirror. This means that the
226mirror user may still get ahold of the file from the original source
227by simply clicking if she so desires.
228
229<p>If a site contains huge .wav audio files that you are not
230interested in you put
231
232<pre>
233Ignore: *.wav
234</pre>
235
236<p>in the configuration file. You may ignore as many different
237filename patterns as you want. If you are mirroring a site you want
238very few, specific files from, say all HTML (named
239<em>something</em><tt>.html</tt>) and all Mpeg video files (named
240<em>something</em><tt>.mpg</tt>) you can write this:
241
242<pre>
243Fetch: *.html
244Fetch: *.mpg
245Ignore: *
246</pre>
247
248<p>W3mir will test each filename against each Fetch/Ignore rule in
249sequence. A html file will match the first line and be fetched. Any
250mpg file will match the second line and be fetched. All other files
251will match the third line, and be ignored. This last line is needed
252because the default is to get any files which are not ignored. By
253arranging fetch and ignore rules carefully you may retrieve exactly
254the filename patterns you want and not retrieve anything else.
255
256<p>If you decide you also want all Mpeg Layer 3 audio files
257(<em>something.</em><tt>mp3</tt>) from the site, after the mirror has
258been established. Then you add this:
259
260<pre>
261Fetch: *.mp3
262</pre>
263
264<p>as the third line, making the <tt>Ignore: *</tt> line the forth and
265last. Then you must fix all references to .mp3 files within the
266mirror by running w3mfix thus:
267
268<pre>
269w3mfix -editref .mp3
270</pre>
271
272<p>which will edit all references to .mp3 files, pointing them the
273right place, on your disk. Ditto when you remove a fetch rule, or add
274or remove an ignore rule. See the answer about <a
275href="#enlarge">enlarging and pruning</a> mirrors for more examples of
276using <tt>w3mfix -editme ...</tt>
277
278<p><b>Note:</b> when retrieving only a very limited set of files, as
279in the example above, you <em>must</em> retrieve the html files,
280because how else will w3mir find URLs of files to retrieve? Only html
281files contain links to other files.
282
283<p>Similarly, you may chose to not mirror whole branches of the
284original site. If you for example mirror my home-pages, and you decide
285not to mirror the comics pages you can put
286
287<pre>
288Ignore: /ts/
289</pre>
290
291<p>or more precisely
292
293<pre>
294Ignore: http://www.ifi.uio.no/~janl/ts/
295</pre>
296
297<p>in the configuration file. If you do this after having established
298the mirror you use w3mfix to fix the references:
299
300<pre>
301w3mfix -editref /ts/
302</pre>
303
304<p><tt>Fetch:</tt> and <tt>Ignore:</tt> rules can only use a very
305limited subset of the Unix wild-cards. w3mir understands only '?',
306'*', and '[a-z]' ranges.
307
308<p><tt>Ignore-RE:</tt> and <tt>Fetch-RE:</tt> are the same as
309<tt>Fetch:</tt> and <tt>Ignore:</tt> except that they give you access
310to the full power of Regular Expressions to make rules for that to get
311or not to get. They support perls superset of the normal Unix regular
312expression syntax. They must be completely specified, including the
313prefixed m, a delimiter of your choice (except the paired delimiters:
314parenthesis, brackets and braces), and any of the RE modifiers. I.e.,
315
316<pre>
317Ignore-RE: m/.gif$/i
318</pre>
319
320<p>or
321
322<pre>
323Ignore-RE: m~/.*/.*/.*/~
324</pre>
325
326<p>and so on. "#" cannot be used as delimiter as it is the comment
327character in the configuration file.
328
329<p>There are some traps when using <tt>Ignore-RE</tt> and
330<tt>Fetch-RE</tt>, please see their documentation in <tt>mandoc
331w3mir</tt> for a more complete explanation.
332
333<hr>
334
335<h3><a name="depth">How do I limit how deep w3mir will recurse?</a></h3>
336
337<p>W3mir has no explicit mechanism to limit the depth of recursion,
338but the same result can be achieved with a simple <tt>Ignore</tt> rule:
339
340<pre>
341Ignore: /*/*/*/*/*/*/
342</pre>
343
344<p>This will ignore any URLs that contain at least 7 slashes ("/").
345Note that a URL contains three slashes that does not have anything to
346do with depth:
347
348<pre>
349http://www.ifi.uio.no/
350</pre>
351
352<p>so only the surplus slashes are used for depth in this match. In the
353example above the limit is 4 levels from the top. The
354<tt>Ignore:</tt> rule that is used to limit recursion depth must be
355listed before any <tt>Fetch:</tt> rules to be effective.
356
357<hr>
358
359<h3><a name="memory">How do I limit w3mirs memory usage?</a></h3>
360
361<p>In a mirror consisting of <em>many</em> files, such as a archive of
362an active mailinglist w3mir will build a very large referer table, in
363part for w3mir to use in the <tt>Referer:</tt> header and in part for
364w3mfix to use in fixing references.
365
366<p>If you disable both the <tt>Referer:</tt> header and don't use
367w3mfix w3mir will not build a referer table. You do this in the
368configuration file:
369
370<pre>
371Disable-headers: referer
372Fixup: off
373</pre>
374
375<p>Please note the potential problems of turning off fixup described
376earlier in this howto. There are normaly no problems associated with
377simple sites, but if there are redirects fixup <em>is</em> needed for
378a consistent mirror.
379
380<hr>
381
382<h3><a name="rm">How do I remove the files are no longer on the
383original site from the mirror?</a></h3>
384
385<p>Over time the site you mirror will add files, and quite possibly
386remove files. Or you might introduce new <tt>Ignore:</tt> rules after
387establishing the mirror that reduces the files wanted in the mirror.
388
389<p>By default w3mir will not delete such old files, some people might
390want to keep the files even if they are removed from the original
391site. To remove the old/unwanted files you add 'remove' to the
392<tt>Options:</tt> line.
393
394<hr>
395
396<h3><a name="multi">How do I copy files from multiple sites?</a></h3>
397
398<p>In the answer to the previous question we see how to mirror several
399related sites. For example, say you want to mirror all my home-pages
400into one mirror:
401
402<pre>
403Option: recurse
404URL: http://www.math.uio.no/~janl/ math/janl
405Also: http://www.math.uio.no/drift/personer/ math/drift
406Also: http://www.ifi.uio.no/~janl/ ifi/janl
407Also: http://www.mi.uib.no/~nicolai/ math-uib/nicolai
408</pre>
409
410<p>As in the previous example this will only get documents that are
411referenced. Any documents that are stored at these location but to
412which w3mir finds no references will not be retrieved. So this will
413fail if the sites are not in any way related, or if you wanted
414<em>everything</em> stored at each site.
415
416<p>To mirror unrelated sites, or get it all you may specify that the
417given URL should be considered a starting-point as well:
418
419<pre>
420Also-quene: http://www.math.uio.no/drift/personer/ math/drift
421</pre>
422
423<p>and, if you want to add an additional starting-point within a already
424named site:
425
426<pre>
427Quene: http://www.math.uio.no/drift/personer/foo.html
428</pre>
429
430<p>Armed with that you should be able to get pretty much anything you
431like.
432
433<hr>
434
435<h3><a name="alias">How do I copy files from one server with several
436names?</a></h3>
437
438<p>Simple, the same way you mirror several servers with different
439names. The math department at University of Oslo has a web server
440known under two names: math-www.uio.no and www.math.uio.no, and both
441names are used in documents stored on it. To copy the whole server,
442one time only, give these URL and Also lines:
443
444<pre>
445URL: http://www.math.uio.no/ .
446Also: http://math-www.uio.no/ .
447</pre>
448
449<p>Note the period/dot (.) at the end of each line. It means that
450w3mir will store the files in the current directory, i.e. documents
451from both servers will be stored in the same place. But since w3mir
452asks to only get documents that are newer than the ones it already has
453any document gotten from the server under the www.math.uio.no name
454will not be gotten from the math-www.uio.no name as well. ... w3mir
455will ask for the document, but the server will tell w3mir that its
456copy is current and there will be no additional transfer of the
457document.
458
459<hr>
460
461<h3><a name="enlarge">How do I enlarge or prune an established
462mirror?</a></h3>
463
464<p>This only works if you use a configuration file.
465
466<p>If you want to add a site or directory to a mirror you simply add
467the needed <tt>Also:</tt> or <tt>Also-Quene:</tt> to the configuration
468file and then you run w3mfix manually, with the -editref option. If,
469you for example have established a mirror of my home-pages, but want to
470add my wife's home-page you add this
471
472<pre>
473Also: http://www.ifi.uio.no/~annen/ ifi/annen
474</pre>
475
476<p>to the configuration shown earlier. Then you run w3mfix, and you want
477it to fix all URLs referencing her home-page, the distinguishing
478characteristic is the name 'annen':
479
480<pre>
481w3mfix -editref annen
482</pre>
483
484<p>but
485
486<pre>
487w3mirx -editref http://www.ifi.uio.no/~annen/
488</pre>
489
490<p>would work too, but it's a lot more to type. This fixes all the
491references to her home-page so that they point to the mirror instead of
492the original pages.
493
494<p>To prune (cut out something) a mirror you do the same. Make the
495change in the configuration file and run 'w3mfix -editme ...' to fix
496the references to that which you removed.
497
498<hr>
499
500<h3><a name="cat">How do I 'cat' a file?</a></h3>
501
502<p>W3mir will output the fetched document to its standard output
503(normally your screen/window) if you specify the '-s' command line
504option. The corresponding configuration file directive is
505
506<pre>
507File-Disposition: stdout
508</pre>
509
510<hr>
511
512<h3><a name="list">How do I list URLs in a document?</a></h3>
513
514<p>To list the URLs in http://www.math.uio.no/:
515
516<pre>
517w3mir -q -f -l http://www.math.uio.no/
518</pre>
519
520<p>The <tt>-q</tt> switch causes w3mir to produce no other output
521which would disturb the URL listing. The <tt>-f</tt> switch tells
522w3mir to forget the document once it has been analyzed, i.e., not save
523it on disk. And finally, the <tt>-l</tt> switch makes w3mir list the
524URLs in the document. You may combine <tt>-l</tt> with <tt>-r</tt>
525and you need not use it with <tt>-f</tt>.
526
527<p>In the configuration file you put <tt>list</tt> on the
528<tt>Options:</tt> line.
529
530<hr>
531
532<h3><a name="aborted">How to I restart a mirror process after stopping
533it prematurely?</a></h3>
534
535<p>You may just rerun the same command once more. But that makes
536w3mir request all the documents you have already once more to see if a
537more recent version is available on the server. You can save time by
538using the <tt>-fs</tt> (Fetch Some) option. This makes w3mir only
539request documents it does not find on your disk. E.g.:
540
541<p><tt>w3mir -fs -r http://www.starwars.com/</tt>
542
543<p>This is not something you would normally put in the configuration
544file, but you can, by adding 'only-nonexistent' on the 'Options:' line.
545
546<hr>
547
548<h3><a name="robots">How do I disable robots.txt obedience?</a></h3>
549
550<p>Normally w3mir will read and obey each sites robots.txt file,
551because w3mir wants to be a nice tool. However robots.txt was designed
552with something slightly different than the normal use of w3mir in
553mind, so if you want w3mir to disregard the robot rules you can use
554<tt>-drr</tt> (Disable Robot Rules) on the command-line, or the line
555
556<pre>
557Robot-Rules: off
558</pre>
559
560<p>in the configuration file. The robot exclusion standard is
561described in <a
562href="http://info.webcrawler.com/mak/projects/robots/norobots.html">http://info.webcrawler.com/mak/projects/robots/norobots.htm</a>.
563
564<hr>
565
566<h3><a name="corrupt">How do I stop w3mir from corrupting binary
567files?</a></h3>
568
569<p>During the normal course of events w3mir converts the newline
570format of fetched HTML documents to your systems native newline
571format. On Unix a newline consists of a single ASCII LF character, on
572Macintoshes it's a single ASCII CR character and on Dos/Windows it's a
573ASCII CR/LF pair. W3mir understands all these and all HTML files are
574saved in the format your operating system prefers.
575
576<p>If, and this is very unlikely, a web server identifies a binary
577file as HTML w3mir will very likely corrupt the file. If you discover
578a file which is obviously ruined in the mirror, but is not ruined when
579you view it on the original site do this:
580
581<ol>
582
583<li>Notify the webmaster on the original site that the file has the
584wrong MIME type
585
586<li>Use the <tt>-nnc</tt> (No Newline Conversion) option on the
587command line, or
588
589<pre>
590Options: no-newline-conv
591</pre>
592
593in the configuration file.
594
595<li>Remove the corrupt file(s).
596
597<li>Run "<tt>w3mir -fs</tt>...", to fetch only the deleted file(s)
598again.
599
600</ol>
601
602<hr>
603
604<h3><a name="auth">How do I copy a site that wants user-name and
605password?</a></h3>
606
607<p>This can only be done with a configuration file. Being able to
608give this on the command-line would give the user-name and password away
609to other users of the system, so the ability to give authentication
610information that way has not been put in w3mir.
611
612<p>In the configuration file you put:
613
614<pre>
615Auth-domain: */*
616Auth-user: me
617Auth-passwd: my-password
618</pre>
619
620<p>This will cause w3mir to give the user-name and password each time
621the server asks. There is no way to make w3mir give the user-name and
622password each time no matter if the server asks or not.
623
624<hr>
625
626<h3><a name="mauth">How do I access a site that wants several
627different user-names and passwords?</a></h3>
628
629<p>If you have several user-names and passwords across
630the server(s) that are copied you need a slightly more advanced
631version of this that associates each user-name/password with a
632authentication "domain". "Domain" is a HTTP concept. It is simply a
633grouping of files and documents within a "realm". One file or a whole
634directory hierarchy can belong to a realm. One server may have many
635realms. A user may have separate passwords for each realm, or the
636same password for all the realms the user has access to. A
637combination of a server name, server port and a realm is called a
638domain.
639
640<pre>
641Auth-domain: theserver:theport/therealm
642Auth-user: me
643Auth-passwd: my-password
644
645Auth-domain: theserver:theport/otherrealm
646Auth-user: other-me
647Auth-password: other-password
648</pre>
649
650W3mir will tell you what the name of the realm is if it is unable to
651authenticate itself with the server. You may also use '*' as the realm
652name if you only copy documents from one realm on that server.
653
654<hr>
655
656<h3><a name="proxy">How do I use a proxy server?</a></h3>
657
658<p>On some secured sites you have to access the Internet through proxy
659servers to get out of the internal network.
660
661<p>A proxy server has a host name, and a port you must use. On the
662command line you simply specify <tt>-P proxy-host-name:proxy-port</tt>. In
663the configuration file you put this:
664
665<pre>
666HTTP-Proxy: proxy-host-name:proxyport
667</pre>
668
669<p>The main advantage of working through proxy servers other than
670security is that you take advantage of any caching the proxy server
671which can speed up retrievals enormously.
672
673<p>Another use of the proxy option is to "prime" the proxy servers
674cache. I.e. you can use w3mir to fetch the documents through the proxy
675server to ensure that the documents are cached there later when you
676want to read them with your browser. If you also specify
677
678<pre>
679File-Disposition: forget
680</pre>
681
682<p>it won't even use any space on your disk, w3mir will just process
683the documents looking for URLs and then <em>not</em> save them.
684
685<hr>
686
687<h3><a name="pauth">How do I authenticate myself to a proxy
688server?</a></h3>
689
690<p>Some proxy servers demands a user-name and password to let you use
691them. W3mir does not support the domain concept in connection with
692proxy authentication because the author cannot imagine that it will be
693needed. You need to put this in your configuration file:
694
695<pre>
696HTTP-Proxy-user: proxy-username
697HTTP-Proxy-passwd: proxy-password
698</pre>
699
700<hr>
701
702<h3><a name="proxytweak">How do I ensure that the proxy server
703...?</a></h3>
704
705<p>HTTP/1.0 proxy servers may be told to not use its current copy of
706a document if you specify the <tt>-pflush</tt> command-line option. Or
707
708<pre>
709Proxy-Options: refresh
710</pre>
711
712<p>in the configuration file. This is useful if the proxy has an old
713copy of some document and does not realize that a newer version exists
714on the origin site. W3mir uses the HTTP/1.0 version of this command
715by default. You can force w3mir to use the HTTP/1.1 version by adding
716<tt>no-pragma</tt> to the line. If you do this it will not work at
717all as you want unless the server knows the HTTP/1.1 protocol.
718
719
720<p>HTTP/1.1 proxy servers can be manipulated in a few more ways. The
721configuration file <tt>Proxy-Options:</tt> directive also takes
722<tt>revalidate</tt> and <tt>no-store</tt> options. The former tells
723the proxy server to check if there is any newer version available.
724This is, in principle, more network friendly than the <tt>refresh</tt>
725option since it will only cause a copy if there is a newer file
726available. The <tt>no-store</tt> option tells the proxy server to not
727store the documents you transfer. This might be useful if the
728documents are 'sensitive' or something like that, but if the proxy
729server does not understand HTTP/1.1 it will not obey this option, and
730it might store the document anyway because the functionality is not
731implemented, so you should not count on this to work.
732
733<hr>
734
735<h3><a name="batchget">How do I batch get files with w3mir?</a></h3>
736
737<p>Normally when fetching files w3mir will process each html (and PDF)
738file to find URLs in them for further retrievals. This is
739time-consuming, and not always wanted. Sometimes you simply want to
740get a file, or more, and save it, untouched:
741
742<pre>
743w3mir -B http://www.starwars.com/ http://www.ifi.uio.no/~janl/
744</pre>
745
746<p>There is a companion switch for <tt>-B</tt>, namely <tt>-I</tt>, it
747makes w3mir read URLs from its standard input, one pr. line. Thus you
748can use w3mir in a pipe to batch get several files whose URLs you find
749in some way. This is a stupid example:
750
751<pre>
752w3mir -q -l -f http://www.ifi.uio.no/ | w3mir -I -B
753</pre>
754
755<p><tt>-B</tt> may also be used with <tt>-r</tt>, but the only effect
756it will have then is to save the html files unchanged on disk, because
757to recurse w3mir <em>has</em> to examine all the html the documents
758for URLs.
759
760<p><b>Please note</b> that using <tt>-B</tt> combined with <tt>-r</tt>
761for mirroring will probably lead to a unstable mirror, because w3mir
762does not get a chance to manipulate the URLs in the documents as it
763needs to be able to maintain a mirror later, and most important of
764all, w3mir needs all html files to contain a &lt;HTML&gt; tag to be
765able to recognize a HTML file as a HTML file. When running with the
766<tt>-B</tt> switch w3mir will not ensure the presence of this and thus
767we must rely on the original documents author to be nice. This is a
768bad bet. In other words, <b>don't use <tt>-B</tt> for recursive
769mirroring</b>, only for batch copying/mirroring of single documents.
770
771<hr>
772
773<h3><a name="cgi">How do I handle CGI?</a></h3>
774
775<p>There is no way w3mir can duplicate the process that happens on the
776Web server when it comes to CGI. For some CGI programs w3mir can
777simply copy the output and store on disk. For other CGI programs this
778is not possible, and the only way out is to make w3mir not get the
779involved files using Ignore rules in the configuration file. These
780will avoid a lot of cgi programs:
781
782<pre>
783Ignore: *.cgi
784Ignore: *-cgi
785</pre>
786
787<p>You might have to add other/more rules for some sites if they have
788other naming conventions or if it's simply impossible to tell from the
789file-name if it's a CGI or not.
790
791<p>When you add ignore rules this causes two things:
792
793<ol>
794<li><p>W3mir will not retrieve documents matching the rules
795<li><p>W3mir will make all references to matching documents point to
796 the site you mirrored from instead of pointing to a non-existent
797 file in the mirror.
798</ol>
799
800<hr>
801
802<h3><a name="imap">How do I handle server side image-maps?</a></h3>
803
804<p>Server side image-maps is yet another thing it's impossible for
805w3mir to relate to. w3mir simply cannot handle them. Put ignore
806rules in the configuration file:
807
808<pre>
809Ignore: *.map
810</pre>
811
812<p>W3mir has full support for client side image-maps though.
813
814<hr>
815
816<h3><a name="java">How do I handle Java and ActiveX?</a></h3>
817
818<p>Java and Active X objects are are included in html pages with a
819<tt>&lt;OBJECT&gt;</tt> or <tt>&lt;APPLET&gt;</tt> tag. W3mir can
820handle these on one condition: The CODEBASE attribute names the
821directory where the program stores its resources (such as
822subprograms, graphic files, sound, text, and so on) and w3mir must
823have read access to this directory. Otherwise w3mir is without hope,
824it's impossible to extract the name of the resources the program needs
825in any reliable way.
826
827<p>HTML4 supports a attribute that enumerates the resources the
828program needs, w3mir is not able to use this yet.
829
830<hr>
831
832<h3><a name="script">How do I handle java-script and other script
833languages?</a></h3>
834
835<p>W3mir does its best to pass scripts (java-script, perl-script,
836etc...) embedded in the HTML undamaged. It cannot, however, extract
837any URLs the script generates and the browser would cause the document
838to refer to or embed in a page.
839
840<p>It will however work if the script generates relative references
841and there is some other way for w3mir to access the referenced file in
842some other manner. Or if the script generates absolute references and
843the person browsing the mirror has access to the site named, then the
844user will be able to browse the referenced documents via that other
845server.
846
847<hr>
848
849<h3><a name="css">How to I handle the other things with 'partial
850support'</a></h3>
851
852<p>W3mir has partial support for CSS. This means that
853<tt>&lt;style&gt</tt> tags and the enclosed style data are passed
854undamaged by w3mir. W3mir will also retrieve the external CSSes named
855in HTML documents. But w3mir will <em>not</em> (yet) analyze the
856CSSes data to find URLs of other resources (such as fonts) named in
857these.
858
859<p>W3mir also has partial support for Adobe Acrobat (PDF) files. This
860means that w3mir can extract URLs from PDF files, and get the named
861documents if you want them. But w3mir cannot edit those URLs so that
862the PDF files point to the mirror instead of wherever on the original
863site they were pointing. If the PDF files contain absolute URLs they
864will continue pointing to where they were pointing before. However,
865if the PDF files contain relative references things will work out.
866
867<p>The reason that URLs in PDF files cannot be edited is that they are
868binary and contain byte pointers. If the URLs length is changed the
869byte pointers will point to the wrong place in the document. Writing
870code to correct these pointers would be quite complex. But if you
871write it I will use it.
872
873<hr>
874
875<h3><a name="anon">How do I keep my identity secret?</a></h3>
876
877<p>The HTTP protocol has a header, <tt>User:</tt> which is recommended
878to use by robots, such as w3mir. Another way to track you is looking
879at the 'Referer:' header w3mir gives in HTTP requests. Both can be
880disabled:
881
882<pre>
883Disable-headers: referer, user
884</pre>
885
886<p>If you in addition use a proxy server that many other users use
887there is little probability you can be tracked (easily) by the server
888you are copying things from. You are however much easier to track
889from the logs in the proxy server. And a court order is quite likely
890to get you tracked in spite of any precautions you take.
891
892<p>W3mir does not support cookies and thus you cannot be tracked with
893the help of that mechanism.
894
895<hr>
896
897<h3><a name="ns">How do I pretend that I'm using Netscape, Internet
898Explorer or Lynx?</a></h3>
899
900<p>Some web sites give you different documents when you ask for a
901specific URL based on what browser you use, or even what OS you appear
902to be using. w3mir identifies itself with a string that looks like
903this:
904
905<p><tt>w3mir/<em>version</em>-<em>release-date</em></tt>
906
907<p>Netscape identifies itself with strings that look something like
908this:
909
910<p><tt>Mozilla/3.01 (X11; I; Linux 2.0.30 i586)</tt>
911
912<p>and Internet Explorer says it's something like this:
913
914<p><tt>Mozilla/2.0 (compatible; MSIE 3.02; Windows NT)</tt>
915
916<p>and Lynx says something like this
917
918<p><tt>Lynx/2.6 libwww-FM/2.14</tt>
919
920<p>You can change w3mirs identification with <tt>-agent 'string'</tt>
921on the command line. In the configuration file you put
922
923<pre>
924Agent: Mozilla/3.01 (X11; I; Linux 2.0.30 i586)
925</pre>
926
927<p>to pretend w3mir is netscape 3.01.
928
929<hr>
930
931<h3><a name="other">How do I do other things?</a></h3>
932
933<p>This document is by no means a complete list of the things you can
934do with w3mir. The w3mir man page (<tt>man w3mir</tt> or <tt>perldoc
935w3mir</tt> lists more things, and goes into more detail of how things
936work so you can use the knowledge to do neat things. There are
937several things mentioned only in the man-page that helps you with
938tricky multi-server mirroring, and gives you better control of what to
939get and not to get and under what name to save it on disk. And a
940couple of other things...
941
942<hr>
943<address>Nicolai Langfeldt 9/7/1998</address>
Note: See TracBrowser for help on using the repository browser.