source: main/trunk/greenstone2/build-src/packages/w3mir/w3mir-1.0.8/README@ 22694

Last change on this file since 22694 was 719, checked in by davidb, 25 years ago

added w3mir package

  • Property svn:keywords set to Author Date Id Revision
File size: 5.7 KB
Line 
1PROPAGANDA:
2
3See http://www.math.uio.no/~janl/w3mir/ for propaganda.
4
5--------------------------------------------------------------------------
6Q: Where can I get a new version of w3mir?
7A: The w3mir homepage is at http://www.math.uio.no/~janl/w3mir/.
8 W3mir is also distributed on CPAN: http://www.perl.com/
9
10Q: Are there any mailing lists?
11A: Yes, see below.
12
13Q: Should I subscribe to any of the mailinglists?
14A: Yes, if you use w3mir at all you should subscribe to
15 [email protected], send e-mail to [email protected] to be
16 subscribed.
17
18Q: I found a bug!
19A: See below.
20
21Q: How do it...
22A: There is a w3mir-HOWTO.html file in the distribution. Read it.
23
24--------------------------------------------------------------------------
25BUGS:
26
27- -lc switch does not work too well.
28
29Please see below for how to report bugs.
30
31--------------------------------------------------------------------------
32FEATURES (NOT bugs):
33
34- URLs with two /es ('//') in the path component does not work
35 as some might expect. According to my reading of the http/url spec.
36 it is an illegal construct, which is a Good Thing, because I don't
37 know how to handle it if it's legal.
38- If you start at http://foo/bar/ then index.html might be gotten twice.
39- Some documents point to a point above the server root, i.e.,
40 http://some.server/../stuff.html. Netscape, and other browsers, in
41 defiance of the URL standard documents will change the URL to
42 http://some.server/stuff.html. W3mir will not.
43
44--------------------------------------------------------------------------
45MAIL LISTS, REPORTING BUGS:
46
47Please send ideas (see todo lists further down please) and bug reports
48to [email protected], please include URL and command line/config
49file that triggered the bug. [email protected] is mainly used for
50announcing new versions or bugs to w3mir users. You will get more
51information following that list than anywhere else. And it's very low
52volume. To subscribe to these lists email [email protected]. The
53w3mir-core list is intended for w3mir hackers only.
54
55--------------------------------------------------------------------------
56COPYRIGTHS:
57
58w3mir, w3http.pm, w3pdfuri.pm and htmlop.pm are free but it is
59Copyrighted by the various involved hackers. If you want to copy,
60hack or distribte w3mir you can do that providing you comply with the
61'Artistic License' enclosed in the w3mir distribution in the file
62named Artistic.
63
64--------------------------------------------------------------------------
65CREDITS:
66
67- Oscar Nierstrasz: Wrote htget
68- Gorm Haug Eriksen: Started w3mir on the foundations of htget,
69 contributed code later.
70- Nicolai Langfeldt: Learning from Oscar and Gorms mistakes, rewrote
71 everything.
72- Chris Szurgot: Adapting to win32, good ideas and code contribs,
73 Debugging. And criticism.
74- Rik Faith: Uses w3mir extensively, not shy about complaining and
75 commenting and suggesting.
76- The libwww-perl author(s) that made adding some new featres
77 ridicolously easy.
78
79--------------------------------------------------------------------------
80TODO LIST:
81
82* TODO, version 1.1:
83
84Some of these are speculative, some others are very useful.
85
86- CSS parsing/support at the same level as HTML
87- Full support for APPLETS/OBJECT tags.
88- Alias rules. These would enable w3mir to map ukoln.bath.ac.uk and
89 bubl.bath.ac.uk to www.ukoln.ac.uk and know that the objects contained
90 in these are all the same. Another use would be to mirror from a mirror
91 instead of the _real_ site, since the original site to which you have
92 references are on a slow link while the mirror is on a fast link.
93- FTP support (easy if through a http style ftp proxy, but is that what
94 we want?)
95- SHTTP/SSL support
96- Add a 3rd argument to URL/Also that denotes the depth of documents to
97 be retrived under each, this would autimatically generate fetch/ignore
98 rules. How well these work would be dependent on the order of the
99 URL/Also directives though. Must be documented.
100- Retrive recursively until N-th order links are reached. This
101 differs siginificantly from directory recursion which we do now. When
102 this is done w3mir should also know the difference between inline and
103 external links, inline links should always be retrived. Trivia question:
104 What order is needed to reach every findable document on the web from
105 Yahoo?
106- Integrate with cvs or rcs (or other version controll system) to make
107 retriver able to reproduce mirrored site for any given date.
108- Some text processing: Adding and removing text/sgml comments when suitable
109 options and tags are found.
110- Put retrival date-stamps as comments in html files, to document the when
111 and how of how this document was retrived.
112- Example: If you're mirroring a site primarily to get to the papers, but
113 the site has n versions of each paper: foo.ps.gz, foo.ps.Z, foo.dvi.gz
114 foo.dvi.Z, foo.tar.gz, foo.zip and you only need one version. Implement
115 a way to get only one version of documents provided in multipele versions,
116 something like multi axis preference list to get only the most attractive
117 version of the doc.
118- Logging of retrivals to file, need to change every print to a functioncall.
119- Your suggestion here.
120
121* TODO, http related
122- Use Keep-alive. Then we should probably stop using 30 second pauses
123 between document retrivals.
124- HTTP/1.1? HTTP/1.1 servers should do keep-alive even with 1.0 requests.
125- Separate quenes for each server, interleave requests.
126- Authentication is only done if challenged now. If a document in a
127 directory needs authentication it is very likely that all other documents
128 in that directory and subdirectories require authentication.
129
130* If perl gets threads:
131- Make the retrival and analysis engines separate threads, and have each
132 one retrival thread pr. method/server/port and do paralell retrivals.
133- This makes http/1.1 easier too.
Note: See TracBrowser for help on using the repository browser.