Ticket #256 (closed defect: fixed)

Opened 11 years ago

Last modified 11 years ago

Non-Roman encoded filenames of non-text content not indexed for searching

Reported by: ak19 Owned by: kjdon
Priority: high Milestone: Release 2.81
Component: Greenstone2&3 Severity: blocker
Keywords: Languages, multilingual, encoding, UTF8 Cc:

Description

1. Problem received in mailing list (Linux):

The problems:
a) I have a different number of texts in the aut</li>
b) All files with non-latin original filenames, e.g. "Ελεφάντης218.pdf" (greek text), do not appear in the indexes (in the web interface), or the search page.

For the complete message, see  https://list.scms.waikato.ac.nz/mailman/private/greenstone-users/2008-March/006485.html

2. I could reproduce such a problem in my Binary installation of Greenstone 2.80 (Linux): I built a collection containing:
- a text file containing "Ελεφάντης" several times and with the filename "Ελεφάντης218.txt"
- two bitmap images with the filenames "Ελεφάντης218.bmp" and "Ελεφάντης210.bmp" I accepted all the default and offered plugins and built the collection.

In the browser, browsing the collection by title and filename shows the dc.title and filename for each. All the titles should be the same as the filenames. And they should be in the correct encoding for Greek. BUT:
- for the images, the dc.title and filenames were warbled and not in proper encoding;
- for the textfile, the dc.title was not in proper encoding, but the filename was in correct Greek encoding ("Ελεφάντης218.txt").

As expected,
- searching on the filename field for Ελεφάντης218 retrieved the text file, but not the bmp image file of the same name. And Ελεφάντης210 (the filename of one of the images) didn't turn up either in a search on the filename field.
- nothing turned up when searching on the title field.

3. An improvement when I compiled and installed GS2 from source (svn checkout, Wed 5 March):

This was likely due to an update to the Basplug.pm code, lines 813 to 861 at http://trac.greenstone.org/browser/gsdl/trunk/perllib/plugins/BasPlug.pm#813 - changes which David Bainbridge had discussed and explained (the code uses the Locale to determine encoding). These lines of code were not there in my GS2.80 Binary installation's Basplug.pm. (May other similar code changes elsewhere also have affected the improvement?)

Using the same contents to create a collection as in (2) above, the results for browsing by titles and filenames is now as follows:
- for the images, the dc.title was in correct Greek encoding but not the filename;
- for the textfile it was as before in (2): the dc.title was not in Greek encoding but the filename was.

This improvement was reflected when searching:
- searching on the title field for the Greek titles retrieved the images. But searching on the filename field did not retrieve any images.
- searching on the filename field for the filename "Ελεφάντης218" returned the textfile. But searching on the title field for the same did not return the textfile.

Attachments

GS2-80_BinaryInstall.png Download (126.5 KB) - added by ak19 11 years ago.
Screenshot of browsing the collection created with GS 2.80 Binary install
GS2_FromSVN_5Mar08.png Download (107.6 KB) - added by ak19 11 years ago.
Screenshot of browsing the collection created with GS svn source install of 05March08

Change History

Changed 11 years ago by ak19

Screenshot of browsing the collection created with GS 2.80 Binary install

Changed 11 years ago by ak19

Screenshot of browsing the collection created with GS svn source install of 05March08

Changed 11 years ago by ak19

  • component changed from Building to Greenstone2&3

Changed 11 years ago by kjdon

  • priority changed from moderate to high
  • severity set to blocker

Changed 11 years ago by kjdon

  • owner changed from nobody to kjdon
  • status changed from new to assigned

Changed 11 years ago by kjdon

  • keywords multilingual, added; multi-lingual, removed

Changed 11 years ago by kjdon

  • status changed from assigned to closed
  • resolution set to fixed

I think this will be fixed based on fixes done for #294

Note: See TracTickets for help on using tickets.