Opened 16 years ago

Closed 16 years ago

#256 closed defect (fixed)

Non-Roman encoded filenames of non-text content not indexed for searching

Reported by: ak19 Owned by: kjdon
Priority: high Milestone: Release 2.81
Component: Greenstone2&3 Severity: blocker
Keywords: Languages, multilingual, encoding, UTF8 Cc:

Description

  1. Problem received in mailing list (Linux):

The problems:
a) I have a different number of texts in the aut</li>
b) All files with non-latin original filenames, e.g. "Ελεφάντης218.pdf" (greek text), do not appear in the indexes (in the web interface), or the search page.

For the complete message, see https://list.scms.waikato.ac.nz/mailman/private/greenstone-users/2008-March/006485.html

  1. I could reproduce such a problem in my Binary installation of Greenstone 2.80 (Linux):

I built a collection containing:
- a text file containing "Ελεφάντης" several times and with the filename "Ελεφάντης218.txt"
- two bitmap images with the filenames "Ελεφάντης218.bmp" and "Ελεφάντης210.bmp" I accepted all the default and offered plugins and built the collection.

In the browser, browsing the collection by title and filename shows the dc.title and filename for each. All the titles should be the same as the filenames. And they should be in the correct encoding for Greek. BUT:
- for the images, the dc.title and filenames were warbled and not in proper encoding;
- for the textfile, the dc.title was not in proper encoding, but the filename was in correct Greek encoding ("Ελεφάντης218.txt").

As expected,
- searching on the filename field for Ελεφάντης218 retrieved the text file, but not the bmp image file of the same name. And Ελεφάντης210 (the filename of one of the images) didn't turn up either in a search on the filename field.
- nothing turned up when searching on the title field.

  1. An improvement when I compiled and installed GS2 from source (svn checkout, Wed 5 March):

This was likely due to an update to the Basplug.pm code, lines 813 to 861 at http://trac.greenstone.org/browser/gsdl/trunk/perllib/plugins/BasPlug.pm#813 - changes which David Bainbridge had discussed and explained (the code uses the Locale to determine encoding). These lines of code were not there in my GS2.80 Binary installation's Basplug.pm. (May other similar code changes elsewhere also have affected the improvement?)

Using the same contents to create a collection as in (2) above, the results for browsing by titles and filenames is now as follows:
- for the images, the dc.title was in correct Greek encoding but not the filename;
- for the textfile it was as before in (2): the dc.title was not in Greek encoding but the filename was.

This improvement was reflected when searching:
- searching on the title field for the Greek titles retrieved the images. But searching on the filename field did not retrieve any images.
- searching on the filename field for the filename "Ελεφάντης218" returned the textfile. But searching on the title field for the same did not return the textfile.

Attachments (2)

GS2-80_BinaryInstall.png (126.5 KB ) - added by ak19 16 years ago.
Screenshot of browsing the collection created with GS 2.80 Binary install
GS2_FromSVN_5Mar08.png (107.6 KB ) - added by ak19 16 years ago.
Screenshot of browsing the collection created with GS svn source install of 05March08

Download all attachments as: .zip

Change History (7)

by ak19, 16 years ago

Attachment: GS2-80_BinaryInstall.png added

Screenshot of browsing the collection created with GS 2.80 Binary install

by ak19, 16 years ago

Attachment: GS2_FromSVN_5Mar08.png added

Screenshot of browsing the collection created with GS svn source install of 05March08

comment:1 by ak19, 16 years ago

Component: BuildingGreenstone2&3

comment:2 by kjdon, 16 years ago

Priority: moderatehigh
Severity: blocker

comment:3 by kjdon, 16 years ago

Owner: changed from nobody to kjdon
Status: newassigned

comment:4 by kjdon, 16 years ago

Keywords: multilingual added; multi-lingual removed

comment:5 by kjdon, 16 years ago

Resolution: fixed
Status: assignedclosed

I think this will be fixed based on fixes done for #294

Note: See TracTickets for help on using tickets.