Opened 15 years ago

Closed 14 years ago

#634 closed defect (fixed)

FTP URLs are broken?

Reported by: kjdon Owned by: nobody
Priority: moderate Milestone: 2.84 Release
Component: Collection Building Severity: minor
Keywords: Cc:

Description (last modified by kjdon)

Reported by Stuart Yeates in 2001 on sourceforge. Is this still a problem?

I built a collection by mirroring HTTP and FTP servers using the supplied wget program. After import, the URLs for all documents from the FTP servers started with http:// not ftp://. I checked and it's happening on import not on build.

Some documents appear to be completely missing a URL, but this may be a seperate problem.

Comment from john McP on sourceforge:

Look in HTMLPlug, line 147(?): my $web_url = "http://$file"; and line 367(?): return ("http://" . $before_hash, $hash_part, 1);

Change History (2)

comment:1 by kjdon, 15 years ago

Description: modified (diff)

from John McP on sourceforge:

comment:2 by mdewsnip, 14 years ago

Resolution: fixed
Status: newclosed

Notes from Richard Managh at DL Consulting:

Yes, this is a problem. Unfortunately there is no fool proof way of telling (as far as I can see) whether wget has mirrored an FTP site vs an HTTP site.

Lets say we issue the following command with wget in our collection's import directory:

wget --mirror -p --convert-links ftp://ftp-stud.fht-esslingen.de

it creates a directory structure beginning with the

ftp-stud.fht-esslingen.de directory

Wget doesn't (and neither windows or linux supports the user to) create a

"ftp://ftp-stud.fht-esslingen.de" directory

And by default wget doesnt do anything helpful for greenstone like downloading into a ftp/ftp-stud.fht-esslingen.de/* directory.

In conclusion, there does not seem to be an infallible way of determining whether we have mirrored a HTML site vs an FTP site. As it is prior to the change below HTMLPlugin will always treat URLs as http://.

So all we can do it seems is look at the top level directory produced by the wget mirror operation and search for the string "ftp" in it. Often ftp servers give themselves away because they are ftp.something.com or ftp-blah.something.com. This - although by no means perfect - is better than nothing.

Fix committed to HTMLPlugin.pm.

Note: See TracTickets for help on using tickets.