Ticket #633 (closed defect: worksforme)

Opened 10 years ago

Last modified 9 years ago

indexing URLs with ?

Reported by: kjdon Owned by: nobody
Priority: moderate Milestone: 2.84 Release
Component: Collection Building Severity: minor
Keywords: Cc:

Description

Reported on sourceforge bug track system, Marcio Marchini ( mqm ) - 2003-02-03 14:47 Is this still a problem?

I am trying to index all documents reached from this URL:

 http://localhost/cgi-bin/e2html?class=string

This will return an HTML document with text and more hyperlinks, all of them in the form of:

 http://localhost/cgi-bin/e2html?class=[something here]

It turns out that Greenstone cannot index the full web/graph of pages.

I tried using wget manually and noticed it too has the limitation, it complains when it tries to save the file. You have to tell it -FILE or something, to specify the file name of the page to save, or it will fail because of the "?" is not a valid char for a file name.

So, I believe Greenstone needs to tell wget to use valid local file names. Not sure if this can be achieved when using wget in "get all pages" mode.

If the pages downloaded could be pipelined to GreenStone? as they are fetched, then this would be possible. In my case I want -no_text, I want GreenStone? to index but not keep a cache of the files, I want it to point at the original URLs. So, the names of the local pages shouldn't really matter.

Can't GreenStone? be instrumented to get the pages & index, in a loop, one after the other, from Perl ? It would index incrementally as it fetches.

Anyway, it would be very nice if it could index such URLs.

Change History

Changed 10 years ago by kjdon

Just tried wget on  http://nzdl.org/cgi-bin/library.cgi?a=p&p=home and it downloaded the page fine, using a ?.

Test in greenstone, and with multiple pages.

Changed 9 years ago by kjdon

  • status changed from new to closed
  • resolution set to worksforme

Now we have Download panel in GLI. wget will get all files as instructed, then the user can choose which ones to add to the collection.

Note: See TracTickets for help on using tickets.