1 | PROPAGANDA:
|
---|
2 |
|
---|
3 | See http://www.math.uio.no/~janl/w3mir/ for propaganda.
|
---|
4 |
|
---|
5 | --------------------------------------------------------------------------
|
---|
6 | Q: Where can I get a new version of w3mir?
|
---|
7 | A: The w3mir homepage is at http://www.math.uio.no/~janl/w3mir/.
|
---|
8 | W3mir is also distributed on CPAN: http://www.perl.com/
|
---|
9 |
|
---|
10 | Q: Are there any mailing lists?
|
---|
11 | A: Yes, see below.
|
---|
12 |
|
---|
13 | Q: Should I subscribe to any of the mailinglists?
|
---|
14 | A: Yes, if you use w3mir at all you should subscribe to
|
---|
15 | [email protected], send e-mail to [email protected] to be
|
---|
16 | subscribed.
|
---|
17 |
|
---|
18 | Q: I found a bug!
|
---|
19 | A: See below.
|
---|
20 |
|
---|
21 | Q: How do it...
|
---|
22 | A: There is a w3mir-HOWTO.html file in the distribution. Read it.
|
---|
23 |
|
---|
24 | --------------------------------------------------------------------------
|
---|
25 | BUGS:
|
---|
26 |
|
---|
27 | - -lc switch does not work too well.
|
---|
28 |
|
---|
29 | Please see below for how to report bugs.
|
---|
30 |
|
---|
31 | --------------------------------------------------------------------------
|
---|
32 | FEATURES (NOT bugs):
|
---|
33 |
|
---|
34 | - URLs with two /es ('//') in the path component does not work
|
---|
35 | as some might expect. According to my reading of the http/url spec.
|
---|
36 | it is an illegal construct, which is a Good Thing, because I don't
|
---|
37 | know how to handle it if it's legal.
|
---|
38 | - If you start at http://foo/bar/ then index.html might be gotten twice.
|
---|
39 | - Some documents point to a point above the server root, i.e.,
|
---|
40 | http://some.server/../stuff.html. Netscape, and other browsers, in
|
---|
41 | defiance of the URL standard documents will change the URL to
|
---|
42 | http://some.server/stuff.html. W3mir will not.
|
---|
43 |
|
---|
44 | --------------------------------------------------------------------------
|
---|
45 | MAIL LISTS, REPORTING BUGS:
|
---|
46 |
|
---|
47 | Please send ideas (see todo lists further down please) and bug reports
|
---|
48 | to [email protected], please include URL and command line/config
|
---|
49 | file that triggered the bug. [email protected] is mainly used for
|
---|
50 | announcing new versions or bugs to w3mir users. You will get more
|
---|
51 | information following that list than anywhere else. And it's very low
|
---|
52 | volume. To subscribe to these lists email [email protected]. The
|
---|
53 | w3mir-core list is intended for w3mir hackers only.
|
---|
54 |
|
---|
55 | --------------------------------------------------------------------------
|
---|
56 | COPYRIGTHS:
|
---|
57 |
|
---|
58 | w3mir, w3http.pm, w3pdfuri.pm and htmlop.pm are free but it is
|
---|
59 | Copyrighted by the various involved hackers. If you want to copy,
|
---|
60 | hack or distribte w3mir you can do that providing you comply with the
|
---|
61 | 'Artistic License' enclosed in the w3mir distribution in the file
|
---|
62 | named Artistic.
|
---|
63 |
|
---|
64 | --------------------------------------------------------------------------
|
---|
65 | CREDITS:
|
---|
66 |
|
---|
67 | - Oscar Nierstrasz: Wrote htget
|
---|
68 | - Gorm Haug Eriksen: Started w3mir on the foundations of htget,
|
---|
69 | contributed code later.
|
---|
70 | - Nicolai Langfeldt: Learning from Oscar and Gorms mistakes, rewrote
|
---|
71 | everything.
|
---|
72 | - Chris Szurgot: Adapting to win32, good ideas and code contribs,
|
---|
73 | Debugging. And criticism.
|
---|
74 | - Rik Faith: Uses w3mir extensively, not shy about complaining and
|
---|
75 | commenting and suggesting.
|
---|
76 | - The libwww-perl author(s) that made adding some new featres
|
---|
77 | ridicolously easy.
|
---|
78 |
|
---|
79 | --------------------------------------------------------------------------
|
---|
80 | TODO LIST:
|
---|
81 |
|
---|
82 | * TODO, version 1.1:
|
---|
83 |
|
---|
84 | Some of these are speculative, some others are very useful.
|
---|
85 |
|
---|
86 | - CSS parsing/support at the same level as HTML
|
---|
87 | - Full support for APPLETS/OBJECT tags.
|
---|
88 | - Alias rules. These would enable w3mir to map ukoln.bath.ac.uk and
|
---|
89 | bubl.bath.ac.uk to www.ukoln.ac.uk and know that the objects contained
|
---|
90 | in these are all the same. Another use would be to mirror from a mirror
|
---|
91 | instead of the _real_ site, since the original site to which you have
|
---|
92 | references are on a slow link while the mirror is on a fast link.
|
---|
93 | - FTP support (easy if through a http style ftp proxy, but is that what
|
---|
94 | we want?)
|
---|
95 | - SHTTP/SSL support
|
---|
96 | - Add a 3rd argument to URL/Also that denotes the depth of documents to
|
---|
97 | be retrived under each, this would autimatically generate fetch/ignore
|
---|
98 | rules. How well these work would be dependent on the order of the
|
---|
99 | URL/Also directives though. Must be documented.
|
---|
100 | - Retrive recursively until N-th order links are reached. This
|
---|
101 | differs siginificantly from directory recursion which we do now. When
|
---|
102 | this is done w3mir should also know the difference between inline and
|
---|
103 | external links, inline links should always be retrived. Trivia question:
|
---|
104 | What order is needed to reach every findable document on the web from
|
---|
105 | Yahoo?
|
---|
106 | - Integrate with cvs or rcs (or other version controll system) to make
|
---|
107 | retriver able to reproduce mirrored site for any given date.
|
---|
108 | - Some text processing: Adding and removing text/sgml comments when suitable
|
---|
109 | options and tags are found.
|
---|
110 | - Put retrival date-stamps as comments in html files, to document the when
|
---|
111 | and how of how this document was retrived.
|
---|
112 | - Example: If you're mirroring a site primarily to get to the papers, but
|
---|
113 | the site has n versions of each paper: foo.ps.gz, foo.ps.Z, foo.dvi.gz
|
---|
114 | foo.dvi.Z, foo.tar.gz, foo.zip and you only need one version. Implement
|
---|
115 | a way to get only one version of documents provided in multipele versions,
|
---|
116 | something like multi axis preference list to get only the most attractive
|
---|
117 | version of the doc.
|
---|
118 | - Logging of retrivals to file, need to change every print to a functioncall.
|
---|
119 | - Your suggestion here.
|
---|
120 |
|
---|
121 | * TODO, http related
|
---|
122 | - Use Keep-alive. Then we should probably stop using 30 second pauses
|
---|
123 | between document retrivals.
|
---|
124 | - HTTP/1.1? HTTP/1.1 servers should do keep-alive even with 1.0 requests.
|
---|
125 | - Separate quenes for each server, interleave requests.
|
---|
126 | - Authentication is only done if challenged now. If a document in a
|
---|
127 | directory needs authentication it is very likely that all other documents
|
---|
128 | in that directory and subdirectories require authentication.
|
---|
129 |
|
---|
130 | * If perl gets threads:
|
---|
131 | - Make the retrival and analysis engines separate threads, and have each
|
---|
132 | one retrival thread pr. method/server/port and do paralell retrivals.
|
---|
133 | - This makes http/1.1 easier too.
|
---|