Context Navigation

← Previous Revision
Latest Revision
Next Revision →
Blame
Revision Log

source: gsdl/tags/gsdl-2_71-distribution/gsdl/packages/w3mir/w3mir-1.0.8/w3mir-HOWTO.html@ 14121

Last change on this file since 14121 was 719, checked in by davidb, 25 years ago
added w3mir package
Property svn:keywords set to `Author Date Id Revision`
File size: 32.9 KB

Line
1	<!doctype html public "-//W3C//DTD HTML 4.0//EN">
2	<html>
3	<head>
4	<title>W3MIR HOWTO</title>
5	<style type="text/css">
6	<!--
7	body { background-color: white }
8	h1, h2, h3, b { font-family: sans-serif }
9	.red { color: red }
10	-->
11	</style>
12	<body>
13	<h1>W3MIR HOWTO</h1>
14
15	<p><b>Corresponding to w3mir version 1.0.2 and above</b>
16
17	<p>W3mir is an all purpose WWW copying and mirroring program. Its
18	main focus is copying complete directory structures keeping your copy
19	browseable through a web server, or directly off a disk or CDROM if
20	you want. W3mir will fix URLs that are redirected and everything else
21	that needs to be fixed to make your copy browseable. But it also does
22	odd jobs, retrieving single documents, batch getting several documents
23	and more. You may tell w3mir not to change anything in the retrieved
24	documents. W3mir has been in development quite a long time so you
25	find options to do a lot of things needed when copying things off the
26	web.
27
28	<p>With w3mir you may copy the entire contents a web server. Or just
29	a directory hierarchy, or several related hierarchies off as many
30	servers as you like. They don't even have to be related.
31
32	<p>W3mir supports HTML4, and has partial support for CSS, Java,
33	ActiveX and Adobe Acrobat (PDF) files. And it works on Win32
34	machines.
35
36	<p><b>Warning:</b> W3mir enables you to copy a lot of things off the
37	Web, but remember, the things you retrieve might be copyrighted and
38	the copy you make with w3mir might in fact be illegal to make and
39	posses.
40
41	<hr>
42
43	<h2><a name="contents">Contents</a></h2>
44
45	<p><a href="#intro">README</a> (You want to read this! <b
46	class="red">Really!</b>)
47
48	<p><b>How do I...</b>
49	<ol>
50	<li><p><a href="#copy">copy a file?</a>
51	<li><p><a href="#recurse">copy a directory hierarchy?</a>
52	<li><p><a href="#resources">copy the needed resource files from another
53	directory hierarchy?</a>
54	<li><p><a href="#ignore">avoid copying files I don't want or copy only
55	files I want?</a>
56	<li><p><a href="#rm">remove the files that are no longer on the
57	original site from the mirror?</a>
58	<li><p><a href="#depth">limit how deep w3mir will recurse?</a>
59	<li><p><a href="#memory">limit w3mirs memory usage?</a>
60	<li><p><a href="#multi">copy files from multiple sites?</a>
61	<li><p><a href="#alias">copy files from one server with several names?</a>
62	<li><p><a href="#aborted">restart a mirror process after stopping it
63	prematurely?</a>
64	<li><p><a href="#enlarge">enlarge or prune an established mirror?</a>
65	<li><p><a href="#cat">'cat' a file?</a>
66	<li><p><a href="#list">list URLs in a document?</a>
67	<li><p><a href="#robots">disable robots.txt obedience?</a>
68	<li><p><a href="#corrupt">stop w3mir from corrupting binary files?</a>
69	<li><p><a href="#auth">copy a site that wants user-name and password?</a>
70	<li><p><a href="#mauth">access a site that wants several different
71	user-names and passwords?</a>
72	<li><p><a href="#proxy">use a proxy server?</a>
73	<li><p><a href="#pauth">authenticate myself to a proxy server?</a>
74	<li><p><a href="#proxytweak">ensure that the proxy server ...?</a>
75	<li><p><a href="#batchget">batch get files with w3mir?</a>
76	<li><p><a href="#cgi">handle CGI?</a>
77	<li><p><a href="#imap">handle server side image-maps?</a>
78	<li><p><a href="#java">handle Java and ActiveX?</a>
79	<li><p><a href="#script">handle java-script and other script languages?</a>
80	<li><p><a href="#css">handle the other things with 'partial support'?</a>
81	<li><p><a href="#anon">keep my identity secret?</a>
82	<li><p><a href="#ns">pretend that I'm using Netscape, Internet
83	Explorer or Lynx?</a>
84	<li><p><a href="#other">do other things?</a>
85	</ol>
86
87	<hr>
88
89	<h2><a name="intro">README</a></h2>
90
91	<p>W3mir may be used in two, main, ways:
92
93	<ul>
94	<li><p>To copy something random once.
95	<li><p>To keep a local mirror of some remote site
96	</ul>
97
98	<p>To copy something random once there is a high likeliness you can
99	just start w3mir with some simple options and it will do the job you
100	want it to. Providing that the remote site is not too complex and
101	your expectations of the copy aren't high :-) This is what wget, the
102	gnu w3 mirroring program, does and is good at.
103
104	<h3>Configuration file</h3>
105
106	<p>Once you want to keep a copy of a remote site up-to-date over time,
107	mirror something with server side image-maps, redirects or
108	authentication you have to write a configuration file for w3mir. This
109	is what w3mir is good at, compared to wget. Writing the file is not
110	hard, and there are two example files in the w3mir distribution. It
111	will also be explained here. The configuration file is typically
112	called <tt>.w3mirc</tt> (<tt>w3mir.ini</tt> on win32 machines), and
113	can be written with a simple text editor. It is kept in the top
114	directory of the mirror, where w3mir will find it when it starts.
115	Please refer to the <a href="#contents">contents</a> for how to handle
116	a specific problem with a configuration file.
117
118	<hr>
119
120	<h2>The answers:</h2>
121
122	<hr>
123
124	<h3><a name="copy">How do I copy a file?</a></h3>
125
126	<p>To copy the top page off www.starwars.com:
127
128	<p><tt>w3mir http://www.starwars.com/</tt>
129
130	<p><b>Note:</b> it is <em>important</em> that you give the trailing
131	slash for server names and directories.
132
133	<hr>
134
135	<h3><a name="recurse">How to I copy a directory hierarchy?</a></h3>
136
137	<p>To copy the entire stuff about episode I from www.starwars.com
138	which is stored in <tt>http://www.starwars.com/episode-i/</tt> (I don't
139	recommend this, it's quite a lot of data):
140
141	<p><tt>w3mir -r http://www.starwars.com/episode-i/</tt>
142
143	<p>The corresponding configuration file is simple:
144
145	<pre>
146	Options: recurse
147	URL: http://www.starwars.com/episode-i/
148	Fixup: run
149	</pre>
150
151	<p>The <tt>-r</tt> option makes w3mir recurse down from the starting
152	point. It will only copy all the documents under
153	http://www.starwars.com/episode-i/ that it sees referenced from those
154	same documents. W3mir will <em>not</em> retrieve documents from
155	http://www.starwars.com/ because it is considered to be 'over' the
156	starting point.
157
158	<p>The command-line will get you a copy that is definitely browseable
159	via a WEB server, and possibly browseable directly from a CDROM or
160	hard-disk. To ensure that it is browseable from CDROM and disk you
161	need to use a configuration file with the <tt>Fixup: run</tt> line in.
162	It causes w3mir to edit anything that needs editing after the mirror
163	has completed, including fixing URLs that caused redirects. The dirty
164	work is done by w3mirs helper program w3mfix. The directive will
165	cause w3mfix to be run each time w3mir completes the mirror.
166
167	<p><b>Note:</b> it is <em>important</em> that you give the trailing
168	slash after the directory name. Specifying
169	<tt>http://www.starwars.com/episode-i</tt> and
170	<tt>http://www.starwars.com/episode-i/</tt> is quite different in
171	w3mirs eyes. In the former case episode-i is considered to be a
172	document within the / (top) directory of www.starwars.com and w3mir
173	will recurse from /, which is a lot more than you wanted. In the
174	latter case w3mir understands that episode-i is a directory and will
175	consider that directory to be the staring point, which is what you
176	wanted.
177
178	<hr>
179
180	<h3><a name="resources">How do I copy the needed resource files from
181	another directory hierarchy?</a></h3>
182
183	<p>Some sites store their documents in one place, and puts their
184	banners, icons and such in a separate directory called
185	<tt>/images</tt>, <tt>/banners</tt>, <tt>/icons</tt>,
186	<tt>/resources</tt> or some such. Unless you retrieve these as well as
187	the documents things will probably not be too colorful. So, imagine
188	that the starwars site stored all the images in one holding directory
189	called <tt>/imagery</tt> and you want to copy all the stuff in it that
190	the episode-i pages need. Then you do this:
191
192	<pre>
193	Options: recurse
194	URL: http://www.starwars.com/episode-i/ episode-i
195	Also: http://www.starwars.com/imagery/ imagery
196	Fixup: run
197	</pre>
198
199	<p>There are two changes here compared to the simpler file we started
200	with: There is an extra argument at the end of the URL directive. It
201	tells w3mir to store everything gotten from
202	<tt>http://www.starwars.com/episode-i/</tt> in the subdirectory
203	<tt>episode-i</tt>. The directory can be omitted, but I think its
204	neater this way. Then the new directive 'Also:'. It tells w3mir that
205	you also want whatever the documents under
206	<tt>http://www.starwars.com/episode-i/</tt> references under
207	<tt>http://www.starwars.com/imagery/</tt>.
208
209	<p><b>Note:</b> this will only get stuff that was used by the
210	documents under <tt>http://www.starwars.com/episode-i/</tt>, anything
211	stored under <tt>http://www.starwars.com/imagery/</tt> which is not
212	used will not be retrieved. If you want everything under
213	<tt>imagery</tt> to be retrived use the <tt>Also-quene:</tt>
214	directive.
215
216	<hr>
217
218	<h3><a name="ignore">How do I avoid copying files I don't want or copy
219	only files I want?</a></h3>
220
221	<p>To control what files w3mir copies you can use the
222	<tt>Ignore:</tt>, <tt>Fetch:</tt>, <tt>Ignore-RE:</tt> and
223	<tt>Fetch-RE:</tt> directives in the configuration file. The embeded
224	references to any file you chose to ignore, i.e., not copy, will point
225	at the original site, <em>not</em> to the mirror. This means that the
226	mirror user may still get ahold of the file from the original source
227	by simply clicking if she so desires.
228
229	<p>If a site contains huge .wav audio files that you are not
230	interested in you put
231
232	<pre>
233	Ignore: *.wav
234	</pre>
235
236	<p>in the configuration file. You may ignore as many different
237	filename patterns as you want. If you are mirroring a site you want
238	very few, specific files from, say all HTML (named
239	<em>something</em><tt>.html</tt>) and all Mpeg video files (named
240	<em>something</em><tt>.mpg</tt>) you can write this:
241
242	<pre>
243	Fetch: *.html
244	Fetch: *.mpg
245	Ignore: *
246	</pre>
247
248	<p>W3mir will test each filename against each Fetch/Ignore rule in
249	sequence. A html file will match the first line and be fetched. Any
250	mpg file will match the second line and be fetched. All other files
251	will match the third line, and be ignored. This last line is needed
252	because the default is to get any files which are not ignored. By
253	arranging fetch and ignore rules carefully you may retrieve exactly
254	the filename patterns you want and not retrieve anything else.
255
256	<p>If you decide you also want all Mpeg Layer 3 audio files
257	(<em>something.</em><tt>mp3</tt>) from the site, after the mirror has
258	been established. Then you add this:
259
260	<pre>
261	Fetch: *.mp3
262	</pre>
263
264	<p>as the third line, making the <tt>Ignore: *</tt> line the forth and
265	last. Then you must fix all references to .mp3 files within the
266	mirror by running w3mfix thus:
267
268	<pre>
269	w3mfix -editref .mp3
270	</pre>
271
272	<p>which will edit all references to .mp3 files, pointing them the
273	right place, on your disk. Ditto when you remove a fetch rule, or add
274	or remove an ignore rule. See the answer about <a
275	href="#enlarge">enlarging and pruning</a> mirrors for more examples of
276	using <tt>w3mfix -editme ...</tt>
277
278	<p><b>Note:</b> when retrieving only a very limited set of files, as
279	in the example above, you <em>must</em> retrieve the html files,
280	because how else will w3mir find URLs of files to retrieve? Only html
281	files contain links to other files.
282
283	<p>Similarly, you may chose to not mirror whole branches of the
284	original site. If you for example mirror my home-pages, and you decide
285	not to mirror the comics pages you can put
286
287	<pre>
288	Ignore: /ts/
289	</pre>
290
291	<p>or more precisely
292
293	<pre>
294	Ignore: http://www.ifi.uio.no/~janl/ts/
295	</pre>
296
297	<p>in the configuration file. If you do this after having established
298	the mirror you use w3mfix to fix the references:
299
300	<pre>
301	w3mfix -editref /ts/
302	</pre>
303
304	<p><tt>Fetch:</tt> and <tt>Ignore:</tt> rules can only use a very
305	limited subset of the Unix wild-cards. w3mir understands only '?',
306	'*', and '[a-z]' ranges.
307
308	<p><tt>Ignore-RE:</tt> and <tt>Fetch-RE:</tt> are the same as
309	<tt>Fetch:</tt> and <tt>Ignore:</tt> except that they give you access
310	to the full power of Regular Expressions to make rules for that to get
311	or not to get. They support perls superset of the normal Unix regular
312	expression syntax. They must be completely specified, including the
313	prefixed m, a delimiter of your choice (except the paired delimiters:
314	parenthesis, brackets and braces), and any of the RE modifiers. I.e.,
315
316	<pre>
317	Ignore-RE: m/.gif$/i
318	</pre>
319
320	<p>or
321
322	<pre>
323	Ignore-RE: m~/././.*/~
324	</pre>
325
326	<p>and so on. "#" cannot be used as delimiter as it is the comment
327	character in the configuration file.
328
329	<p>There are some traps when using <tt>Ignore-RE</tt> and
330	<tt>Fetch-RE</tt>, please see their documentation in <tt>mandoc
331	w3mir</tt> for a more complete explanation.
332
333	<hr>
334
335	<h3><a name="depth">How do I limit how deep w3mir will recurse?</a></h3>
336
337	<p>W3mir has no explicit mechanism to limit the depth of recursion,
338	but the same result can be achieved with a simple <tt>Ignore</tt> rule:
339
340	<pre>
341	Ignore: ///////
342	</pre>
343
344	<p>This will ignore any URLs that contain at least 7 slashes ("/").
345	Note that a URL contains three slashes that does not have anything to
346	do with depth:
347
348	<pre>
349	http://www.ifi.uio.no/
350	</pre>
351
352	<p>so only the surplus slashes are used for depth in this match. In the
353	example above the limit is 4 levels from the top. The
354	<tt>Ignore:</tt> rule that is used to limit recursion depth must be
355	listed before any <tt>Fetch:</tt> rules to be effective.
356
357	<hr>
358
359	<h3><a name="memory">How do I limit w3mirs memory usage?</a></h3>
360
361	<p>In a mirror consisting of <em>many</em> files, such as a archive of
362	an active mailinglist w3mir will build a very large referer table, in
363	part for w3mir to use in the <tt>Referer:</tt> header and in part for
364	w3mfix to use in fixing references.
365
366	<p>If you disable both the <tt>Referer:</tt> header and don't use
367	w3mfix w3mir will not build a referer table. You do this in the
368	configuration file:
369
370	<pre>
371	Disable-headers: referer
372	Fixup: off
373	</pre>
374
375	<p>Please note the potential problems of turning off fixup described
376	earlier in this howto. There are normaly no problems associated with
377	simple sites, but if there are redirects fixup <em>is</em> needed for
378	a consistent mirror.
379
380	<hr>
381
382	<h3><a name="rm">How do I remove the files are no longer on the
383	original site from the mirror?</a></h3>
384
385	<p>Over time the site you mirror will add files, and quite possibly
386	remove files. Or you might introduce new <tt>Ignore:</tt> rules after
387	establishing the mirror that reduces the files wanted in the mirror.
388
389	<p>By default w3mir will not delete such old files, some people might
390	want to keep the files even if they are removed from the original
391	site. To remove the old/unwanted files you add 'remove' to the
392	<tt>Options:</tt> line.
393
394	<hr>
395
396	<h3><a name="multi">How do I copy files from multiple sites?</a></h3>
397
398	<p>In the answer to the previous question we see how to mirror several
399	related sites. For example, say you want to mirror all my home-pages
400	into one mirror:
401
402	<pre>
403	Option: recurse
404	URL: http://www.math.uio.no/~janl/ math/janl
405	Also: http://www.math.uio.no/drift/personer/ math/drift
406	Also: http://www.ifi.uio.no/~janl/ ifi/janl
407	Also: http://www.mi.uib.no/~nicolai/ math-uib/nicolai
408	</pre>
409
410	<p>As in the previous example this will only get documents that are
411	referenced. Any documents that are stored at these location but to
412	which w3mir finds no references will not be retrieved. So this will
413	fail if the sites are not in any way related, or if you wanted
414	<em>everything</em> stored at each site.
415
416	<p>To mirror unrelated sites, or get it all you may specify that the
417	given URL should be considered a starting-point as well:
418
419	<pre>
420	Also-quene: http://www.math.uio.no/drift/personer/ math/drift
421	</pre>
422
423	<p>and, if you want to add an additional starting-point within a already
424	named site:
425
426	<pre>
427	Quene: http://www.math.uio.no/drift/personer/foo.html
428	</pre>
429
430	<p>Armed with that you should be able to get pretty much anything you
431	like.
432
433	<hr>
434
435	<h3><a name="alias">How do I copy files from one server with several
436	names?</a></h3>
437
438	<p>Simple, the same way you mirror several servers with different
439	names. The math department at University of Oslo has a web server
440	known under two names: math-www.uio.no and www.math.uio.no, and both
441	names are used in documents stored on it. To copy the whole server,
442	one time only, give these URL and Also lines:
443
444	<pre>
445	URL: http://www.math.uio.no/ .
446	Also: http://math-www.uio.no/ .
447	</pre>
448
449	<p>Note the period/dot (.) at the end of each line. It means that
450	w3mir will store the files in the current directory, i.e. documents
451	from both servers will be stored in the same place. But since w3mir
452	asks to only get documents that are newer than the ones it already has
453	any document gotten from the server under the www.math.uio.no name
454	will not be gotten from the math-www.uio.no name as well. ... w3mir
455	will ask for the document, but the server will tell w3mir that its
456	copy is current and there will be no additional transfer of the
457	document.
458
459	<hr>
460
461	<h3><a name="enlarge">How do I enlarge or prune an established
462	mirror?</a></h3>
463
464	<p>This only works if you use a configuration file.
465
466	<p>If you want to add a site or directory to a mirror you simply add
467	the needed <tt>Also:</tt> or <tt>Also-Quene:</tt> to the configuration
468	file and then you run w3mfix manually, with the -editref option. If,
469	you for example have established a mirror of my home-pages, but want to
470	add my wife's home-page you add this
471
472	<pre>
473	Also: http://www.ifi.uio.no/~annen/ ifi/annen
474	</pre>
475
476	<p>to the configuration shown earlier. Then you run w3mfix, and you want
477	it to fix all URLs referencing her home-page, the distinguishing
478	characteristic is the name 'annen':
479
480	<pre>
481	w3mfix -editref annen
482	</pre>
483
484	<p>but
485
486	<pre>
487	w3mirx -editref http://www.ifi.uio.no/~annen/
488	</pre>
489
490	<p>would work too, but it's a lot more to type. This fixes all the
491	references to her home-page so that they point to the mirror instead of
492	the original pages.
493
494	<p>To prune (cut out something) a mirror you do the same. Make the
495	change in the configuration file and run 'w3mfix -editme ...' to fix
496	the references to that which you removed.
497
498	<hr>
499
500	<h3><a name="cat">How do I 'cat' a file?</a></h3>
501
502	<p>W3mir will output the fetched document to its standard output
503	(normally your screen/window) if you specify the '-s' command line
504	option. The corresponding configuration file directive is
505
506	<pre>
507	File-Disposition: stdout
508	</pre>
509
510	<hr>
511
512	<h3><a name="list">How do I list URLs in a document?</a></h3>
513
514	<p>To list the URLs in http://www.math.uio.no/:
515
516	<pre>
517	w3mir -q -f -l http://www.math.uio.no/
518	</pre>
519
520	<p>The <tt>-q</tt> switch causes w3mir to produce no other output
521	which would disturb the URL listing. The <tt>-f</tt> switch tells
522	w3mir to forget the document once it has been analyzed, i.e., not save
523	it on disk. And finally, the <tt>-l</tt> switch makes w3mir list the
524	URLs in the document. You may combine <tt>-l</tt> with <tt>-r</tt>
525	and you need not use it with <tt>-f</tt>.
526
527	<p>In the configuration file you put <tt>list</tt> on the
528	<tt>Options:</tt> line.
529
530	<hr>
531
532	<h3><a name="aborted">How to I restart a mirror process after stopping
533	it prematurely?</a></h3>
534
535	<p>You may just rerun the same command once more. But that makes
536	w3mir request all the documents you have already once more to see if a
537	more recent version is available on the server. You can save time by
538	using the <tt>-fs</tt> (Fetch Some) option. This makes w3mir only
539	request documents it does not find on your disk. E.g.:
540
541	<p><tt>w3mir -fs -r http://www.starwars.com/</tt>
542
543	<p>This is not something you would normally put in the configuration
544	file, but you can, by adding 'only-nonexistent' on the 'Options:' line.
545
546	<hr>
547
548	<h3><a name="robots">How do I disable robots.txt obedience?</a></h3>
549
550	<p>Normally w3mir will read and obey each sites robots.txt file,
551	because w3mir wants to be a nice tool. However robots.txt was designed
552	with something slightly different than the normal use of w3mir in
553	mind, so if you want w3mir to disregard the robot rules you can use
554	<tt>-drr</tt> (Disable Robot Rules) on the command-line, or the line
555
556	<pre>
557	Robot-Rules: off
558	</pre>
559
560	<p>in the configuration file. The robot exclusion standard is
561	described in <a
562	href="http://info.webcrawler.com/mak/projects/robots/norobots.html">http://info.webcrawler.com/mak/projects/robots/norobots.htm</a>.
563
564	<hr>
565
566	<h3><a name="corrupt">How do I stop w3mir from corrupting binary
567	files?</a></h3>
568
569	<p>During the normal course of events w3mir converts the newline
570	format of fetched HTML documents to your systems native newline
571	format. On Unix a newline consists of a single ASCII LF character, on
572	Macintoshes it's a single ASCII CR character and on Dos/Windows it's a
573	ASCII CR/LF pair. W3mir understands all these and all HTML files are
574	saved in the format your operating system prefers.
575
576	<p>If, and this is very unlikely, a web server identifies a binary
577	file as HTML w3mir will very likely corrupt the file. If you discover
578	a file which is obviously ruined in the mirror, but is not ruined when
579	you view it on the original site do this:
580
581	<ol>
582
583	<li>Notify the webmaster on the original site that the file has the
584	wrong MIME type
585
586	<li>Use the <tt>-nnc</tt> (No Newline Conversion) option on the
587	command line, or
588
589	<pre>
590	Options: no-newline-conv
591	</pre>
592
593	in the configuration file.
594
595	<li>Remove the corrupt file(s).
596
597	<li>Run "<tt>w3mir -fs</tt>...", to fetch only the deleted file(s)
598	again.
599
600	</ol>
601
602	<hr>
603
604	<h3><a name="auth">How do I copy a site that wants user-name and
605	password?</a></h3>
606
607	<p>This can only be done with a configuration file. Being able to
608	give this on the command-line would give the user-name and password away
609	to other users of the system, so the ability to give authentication
610	information that way has not been put in w3mir.
611
612	<p>In the configuration file you put:
613
614	<pre>
615	Auth-domain: /
616	Auth-user: me
617	Auth-passwd: my-password
618	</pre>
619
620	<p>This will cause w3mir to give the user-name and password each time
621	the server asks. There is no way to make w3mir give the user-name and
622	password each time no matter if the server asks or not.
623
624	<hr>
625
626	<h3><a name="mauth">How do I access a site that wants several
627	different user-names and passwords?</a></h3>
628
629	<p>If you have several user-names and passwords across
630	the server(s) that are copied you need a slightly more advanced
631	version of this that associates each user-name/password with a
632	authentication "domain". "Domain" is a HTTP concept. It is simply a
633	grouping of files and documents within a "realm". One file or a whole
634	directory hierarchy can belong to a realm. One server may have many
635	realms. A user may have separate passwords for each realm, or the
636	same password for all the realms the user has access to. A
637	combination of a server name, server port and a realm is called a
638	domain.
639
640	<pre>
641	Auth-domain: theserver:theport/therealm
642	Auth-user: me
643	Auth-passwd: my-password
644
645	Auth-domain: theserver:theport/otherrealm
646	Auth-user: other-me
647	Auth-password: other-password
648	</pre>
649
650	W3mir will tell you what the name of the realm is if it is unable to
651	authenticate itself with the server. You may also use '*' as the realm
652	name if you only copy documents from one realm on that server.
653
654	<hr>
655
656	<h3><a name="proxy">How do I use a proxy server?</a></h3>
657
658	<p>On some secured sites you have to access the Internet through proxy
659	servers to get out of the internal network.
660
661	<p>A proxy server has a host name, and a port you must use. On the
662	command line you simply specify <tt>-P proxy-host-name:proxy-port</tt>. In
663	the configuration file you put this:
664
665	<pre>
666	HTTP-Proxy: proxy-host-name:proxyport
667	</pre>
668
669	<p>The main advantage of working through proxy servers other than
670	security is that you take advantage of any caching the proxy server
671	which can speed up retrievals enormously.
672
673	<p>Another use of the proxy option is to "prime" the proxy servers
674	cache. I.e. you can use w3mir to fetch the documents through the proxy
675	server to ensure that the documents are cached there later when you
676	want to read them with your browser. If you also specify
677
678	<pre>
679	File-Disposition: forget
680	</pre>
681
682	<p>it won't even use any space on your disk, w3mir will just process
683	the documents looking for URLs and then <em>not</em> save them.
684
685	<hr>
686
687	<h3><a name="pauth">How do I authenticate myself to a proxy
688	server?</a></h3>
689
690	<p>Some proxy servers demands a user-name and password to let you use
691	them. W3mir does not support the domain concept in connection with
692	proxy authentication because the author cannot imagine that it will be
693	needed. You need to put this in your configuration file:
694
695	<pre>
696	HTTP-Proxy-user: proxy-username
697	HTTP-Proxy-passwd: proxy-password
698	</pre>
699
700	<hr>
701
702	<h3><a name="proxytweak">How do I ensure that the proxy server
703	...?</a></h3>
704
705	<p>HTTP/1.0 proxy servers may be told to not use its current copy of
706	a document if you specify the <tt>-pflush</tt> command-line option. Or
707
708	<pre>
709	Proxy-Options: refresh
710	</pre>
711
712	<p>in the configuration file. This is useful if the proxy has an old
713	copy of some document and does not realize that a newer version exists
714	on the origin site. W3mir uses the HTTP/1.0 version of this command
715	by default. You can force w3mir to use the HTTP/1.1 version by adding
716	<tt>no-pragma</tt> to the line. If you do this it will not work at
717	all as you want unless the server knows the HTTP/1.1 protocol.
718
719
720	<p>HTTP/1.1 proxy servers can be manipulated in a few more ways. The
721	configuration file <tt>Proxy-Options:</tt> directive also takes
722	<tt>revalidate</tt> and <tt>no-store</tt> options. The former tells
723	the proxy server to check if there is any newer version available.
724	This is, in principle, more network friendly than the <tt>refresh</tt>
725	option since it will only cause a copy if there is a newer file
726	available. The <tt>no-store</tt> option tells the proxy server to not
727	store the documents you transfer. This might be useful if the
728	documents are 'sensitive' or something like that, but if the proxy
729	server does not understand HTTP/1.1 it will not obey this option, and
730	it might store the document anyway because the functionality is not
731	implemented, so you should not count on this to work.
732
733	<hr>
734
735	<h3><a name="batchget">How do I batch get files with w3mir?</a></h3>
736
737	<p>Normally when fetching files w3mir will process each html (and PDF)
738	file to find URLs in them for further retrievals. This is
739	time-consuming, and not always wanted. Sometimes you simply want to
740	get a file, or more, and save it, untouched:
741
742	<pre>
743	w3mir -B http://www.starwars.com/ http://www.ifi.uio.no/~janl/
744	</pre>
745
746	<p>There is a companion switch for <tt>-B</tt>, namely <tt>-I</tt>, it
747	makes w3mir read URLs from its standard input, one pr. line. Thus you
748	can use w3mir in a pipe to batch get several files whose URLs you find
749	in some way. This is a stupid example:
750
751	<pre>
752	w3mir -q -l -f http://www.ifi.uio.no/ \| w3mir -I -B
753	</pre>
754
755	<p><tt>-B</tt> may also be used with <tt>-r</tt>, but the only effect
756	it will have then is to save the html files unchanged on disk, because
757	to recurse w3mir <em>has</em> to examine all the html the documents
758	for URLs.
759
760	<p><b>Please note</b> that using <tt>-B</tt> combined with <tt>-r</tt>
761	for mirroring will probably lead to a unstable mirror, because w3mir
762	does not get a chance to manipulate the URLs in the documents as it
763	needs to be able to maintain a mirror later, and most important of
764	all, w3mir needs all html files to contain a <HTML> tag to be
765	able to recognize a HTML file as a HTML file. When running with the
766	<tt>-B</tt> switch w3mir will not ensure the presence of this and thus
767	we must rely on the original documents author to be nice. This is a
768	bad bet. In other words, <b>don't use <tt>-B</tt> for recursive
769	mirroring</b>, only for batch copying/mirroring of single documents.
770
771	<hr>
772
773	<h3><a name="cgi">How do I handle CGI?</a></h3>
774
775	<p>There is no way w3mir can duplicate the process that happens on the
776	Web server when it comes to CGI. For some CGI programs w3mir can
777	simply copy the output and store on disk. For other CGI programs this
778	is not possible, and the only way out is to make w3mir not get the
779	involved files using Ignore rules in the configuration file. These
780	will avoid a lot of cgi programs:
781
782	<pre>
783	Ignore: *.cgi
784	Ignore: *-cgi
785	</pre>
786
787	<p>You might have to add other/more rules for some sites if they have
788	other naming conventions or if it's simply impossible to tell from the
789	file-name if it's a CGI or not.
790
791	<p>When you add ignore rules this causes two things:
792
793	<ol>
794	<li><p>W3mir will not retrieve documents matching the rules
795	<li><p>W3mir will make all references to matching documents point to
796	the site you mirrored from instead of pointing to a non-existent
797	file in the mirror.
798	</ol>
799
800	<hr>
801
802	<h3><a name="imap">How do I handle server side image-maps?</a></h3>
803
804	<p>Server side image-maps is yet another thing it's impossible for
805	w3mir to relate to. w3mir simply cannot handle them. Put ignore
806	rules in the configuration file:
807
808	<pre>
809	Ignore: *.map
810	</pre>
811
812	<p>W3mir has full support for client side image-maps though.
813
814	<hr>
815
816	<h3><a name="java">How do I handle Java and ActiveX?</a></h3>
817
818	<p>Java and Active X objects are are included in html pages with a
819	<tt><OBJECT></tt> or <tt><APPLET></tt> tag. W3mir can
820	handle these on one condition: The CODEBASE attribute names the
821	directory where the program stores its resources (such as
822	subprograms, graphic files, sound, text, and so on) and w3mir must
823	have read access to this directory. Otherwise w3mir is without hope,
824	it's impossible to extract the name of the resources the program needs
825	in any reliable way.
826
827	<p>HTML4 supports a attribute that enumerates the resources the
828	program needs, w3mir is not able to use this yet.
829
830	<hr>
831
832	<h3><a name="script">How do I handle java-script and other script
833	languages?</a></h3>
834
835	<p>W3mir does its best to pass scripts (java-script, perl-script,
836	etc...) embedded in the HTML undamaged. It cannot, however, extract
837	any URLs the script generates and the browser would cause the document
838	to refer to or embed in a page.
839
840	<p>It will however work if the script generates relative references
841	and there is some other way for w3mir to access the referenced file in
842	some other manner. Or if the script generates absolute references and
843	the person browsing the mirror has access to the site named, then the
844	user will be able to browse the referenced documents via that other
845	server.
846
847	<hr>
848
849	<h3><a name="css">How to I handle the other things with 'partial
850	support'</a></h3>
851
852	<p>W3mir has partial support for CSS. This means that
853	<tt><style&gt</tt> tags and the enclosed style data are passed
854	undamaged by w3mir. W3mir will also retrieve the external CSSes named
855	in HTML documents. But w3mir will <em>not</em> (yet) analyze the
856	CSSes data to find URLs of other resources (such as fonts) named in
857	these.
858
859	<p>W3mir also has partial support for Adobe Acrobat (PDF) files. This
860	means that w3mir can extract URLs from PDF files, and get the named
861	documents if you want them. But w3mir cannot edit those URLs so that
862	the PDF files point to the mirror instead of wherever on the original
863	site they were pointing. If the PDF files contain absolute URLs they
864	will continue pointing to where they were pointing before. However,
865	if the PDF files contain relative references things will work out.
866
867	<p>The reason that URLs in PDF files cannot be edited is that they are
868	binary and contain byte pointers. If the URLs length is changed the
869	byte pointers will point to the wrong place in the document. Writing
870	code to correct these pointers would be quite complex. But if you
871	write it I will use it.
872
873	<hr>
874
875	<h3><a name="anon">How do I keep my identity secret?</a></h3>
876
877	<p>The HTTP protocol has a header, <tt>User:</tt> which is recommended
878	to use by robots, such as w3mir. Another way to track you is looking
879	at the 'Referer:' header w3mir gives in HTTP requests. Both can be
880	disabled:
881
882	<pre>
883	Disable-headers: referer, user
884	</pre>
885
886	<p>If you in addition use a proxy server that many other users use
887	there is little probability you can be tracked (easily) by the server
888	you are copying things from. You are however much easier to track
889	from the logs in the proxy server. And a court order is quite likely
890	to get you tracked in spite of any precautions you take.
891
892	<p>W3mir does not support cookies and thus you cannot be tracked with
893	the help of that mechanism.
894
895	<hr>
896
897	<h3><a name="ns">How do I pretend that I'm using Netscape, Internet
898	Explorer or Lynx?</a></h3>
899
900	<p>Some web sites give you different documents when you ask for a
901	specific URL based on what browser you use, or even what OS you appear
902	to be using. w3mir identifies itself with a string that looks like
903	this:
904
905	<p><tt>w3mir/<em>version</em>-<em>release-date</em></tt>
906
907	<p>Netscape identifies itself with strings that look something like
908	this:
909
910	<p><tt>Mozilla/3.01 (X11; I; Linux 2.0.30 i586)</tt>
911
912	<p>and Internet Explorer says it's something like this:
913
914	<p><tt>Mozilla/2.0 (compatible; MSIE 3.02; Windows NT)</tt>
915
916	<p>and Lynx says something like this
917
918	<p><tt>Lynx/2.6 libwww-FM/2.14</tt>
919
920	<p>You can change w3mirs identification with <tt>-agent 'string'</tt>
921	on the command line. In the configuration file you put
922
923	<pre>
924	Agent: Mozilla/3.01 (X11; I; Linux 2.0.30 i586)
925	</pre>
926
927	<p>to pretend w3mir is netscape 3.01.
928
929	<hr>
930
931	<h3><a name="other">How do I do other things?</a></h3>
932
933	<p>This document is by no means a complete list of the things you can
934	do with w3mir. The w3mir man page (<tt>man w3mir</tt> or <tt>perldoc
935	w3mir</tt> lists more things, and goes into more detail of how things
936	work so you can use the knowledge to do neat things. There are
937	several things mentioned only in the man-page that helps you with
938	tricky multi-server mirroring, and gives you better control of what to
939	get and not to get and under what name to save it on disk. And a
940	couple of other things...
941
942	<hr>
943	<address>Nicolai Langfeldt 9/7/1998</address>

Note: See TracBrowser for help on using the repository browser.

Download in other formats: