Opened 15 years ago

Closed 14 years ago

#633 closed defect (worksforme)

indexing URLs with ?

Reported by: kjdon Owned by: nobody
Priority: moderate Milestone: 2.84 Release
Component: Collection Building Severity: minor
Keywords: Cc:

Description

Reported on sourceforge bug track system, Marcio Marchini ( mqm ) - 2003-02-03 14:47 Is this still a problem?

I am trying to index all documents reached from this URL:

http://localhost/cgi-bin/e2html?class=string

This will return an HTML document with text and more hyperlinks, all of them in the form of:

http://localhost/cgi-bin/e2html?class=[something here]

It turns out that Greenstone cannot index the full web/graph of pages.

I tried using wget manually and noticed it too has the limitation, it complains when it tries to save the file. You have to tell it -FILE or something, to specify the file name of the page to save, or it will fail because of the "?" is not a valid char for a file name.

So, I believe Greenstone needs to tell wget to use valid local file names. Not sure if this can be achieved when using wget in "get all pages" mode.

If the pages downloaded could be pipelined to GreenStone as they are fetched, then this would be possible. In my case I want -no_text, I want GreenStone to index but not keep a cache of the files, I want it to point at the original URLs. So, the names of the local pages shouldn't really matter.

Can't GreenStone be instrumented to get the pages & index, in a loop, one after the other, from Perl ? It would index incrementally as it fetches.

Anyway, it would be very nice if it could index such URLs.

Change History (2)

comment:1 by kjdon, 14 years ago

Just tried wget on http://nzdl.org/cgi-bin/library.cgi?a=p&p=home and it downloaded the page fine, using a ?.

Test in greenstone, and with multiple pages.

comment:2 by kjdon, 14 years ago

Resolution: worksforme
Status: newclosed

Now we have Download panel in GLI. wget will get all files as instructed, then the user can choose which ones to add to the collection.

Note: See TracTickets for help on using tickets.