Opened 15 years ago
Closed 14 years ago
#634 closed defect (fixed)
FTP URLs are broken?
Reported by: | kjdon | Owned by: | nobody |
---|---|---|---|
Priority: | moderate | Milestone: | 2.84 Release |
Component: | Collection Building | Severity: | minor |
Keywords: | Cc: |
Description (last modified by )
Reported by Stuart Yeates in 2001 on sourceforge. Is this still a problem?
I built a collection by mirroring HTTP and FTP servers using the supplied wget program. After import, the URLs for all documents from the FTP servers started with http:// not ftp://. I checked and it's happening on import not on build.
Some documents appear to be completely missing a URL, but this may be a seperate problem.
Comment from john McP on sourceforge:
Look in HTMLPlug, line 147(?): my $web_url = "http://$file"; and line 367(?): return ("http://" . $before_hash, $hash_part, 1);
Change History (2)
comment:1 by , 15 years ago
Description: | modified (diff) |
---|
comment:2 by , 14 years ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
Notes from Richard Managh at DL Consulting:
Yes, this is a problem. Unfortunately there is no fool proof way of telling (as far as I can see) whether wget has mirrored an FTP site vs an HTTP site.
Lets say we issue the following command with wget in our collection's import directory:
wget --mirror -p --convert-links ftp://ftp-stud.fht-esslingen.de
it creates a directory structure beginning with the
ftp-stud.fht-esslingen.de directory
Wget doesn't (and neither windows or linux supports the user to) create a
"ftp://ftp-stud.fht-esslingen.de" directory
And by default wget doesnt do anything helpful for greenstone like downloading into a ftp/ftp-stud.fht-esslingen.de/* directory.
In conclusion, there does not seem to be an infallible way of determining whether we have mirrored a HTML site vs an FTP site. As it is prior to the change below HTMLPlugin will always treat URLs as http://.
So all we can do it seems is look at the top level directory produced by the wget mirror operation and search for the string "ftp" in it. Often ftp servers give themselves away because they are ftp.something.com or ftp-blah.something.com. This - although by no means perfect - is better than nothing.
Fix committed to HTMLPlugin.pm.
from John McP on sourceforge: