Ticket #634 (closed defect: fixed)

Opened 11 years ago

Last modified 10 years ago

FTP URLs are broken?

Reported by: kjdon Owned by: nobody
Priority: moderate Milestone: 2.84 Release
Component: Collection Building Severity: minor
Keywords: Cc:

Description (last modified by kjdon) (diff)

Reported by Stuart Yeates in 2001 on sourceforge. Is this still a problem?

I built a collection by mirroring HTTP and FTP servers using the supplied wget program. After import, the URLs for all documents from the FTP servers started with  http:// not  ftp://. I checked and it's happening on import not on build.

Some documents appear to be completely missing a URL, but this may be a seperate problem.

Comment from john McP on sourceforge:

Look in HTMLPlug, line 147(?): my $web_url = " http://$file"; and line 367(?): return (" http://" . $before_hash, $hash_part, 1);

Change History

Changed 11 years ago by kjdon

  • description modified (diff)

from John McP on sourceforge:

Changed 10 years ago by mdewsnip

  • status changed from new to closed
  • resolution set to fixed

Notes from Richard Managh at DL Consulting:

Yes, this is a problem. Unfortunately there is no fool proof way of telling (as far as I can see) whether wget has mirrored an FTP site vs an HTTP site.

Lets say we issue the following command with wget in our collection's import directory:

wget --mirror -p --convert-links  ftp://ftp-stud.fht-esslingen.de

it creates a directory structure beginning with the

ftp-stud.fht-esslingen.de directory

Wget doesn't (and neither windows or linux supports the user to) create a

" ftp://ftp-stud.fht-esslingen.de" directory

And by default wget doesnt do anything helpful for greenstone like downloading into a ftp/ftp-stud.fht-esslingen.de/* directory.

In conclusion, there does not seem to be an infallible way of determining whether we have mirrored a HTML site vs an FTP site. As it is prior to the change below HTMLPlugin will always treat URLs as  http://.

So all we can do it seems is look at the top level directory produced by the wget mirror operation and search for the string "ftp" in it. Often ftp servers give themselves away because they are ftp.something.com or ftp-blah.something.com. This - although by no means perfect - is better than nothing.

Fix committed to HTMLPlugin.pm.

Note: See TracTickets for help on using tickets.