greenstone.org greenstone wiki greenstone trac planet greenstone

Ticket #256 (assigned defect)

Opened 5 months ago

Last modified 2 months ago

Non-Roman encoded filenames of non-text content not indexed for searching

Reported by: ak19 Assigned to: kjdon (accepted)
Priority: high Milestone: Release 2.81
Component: Greenstone2&3 Severity: blocker
Keywords: Languages, multilingual, encoding, UTF8 Cc:

Description

1. Problem received in mailing list (Linux):

The problems:
a) I have a different number of texts in the aut</li>
b) All files with non-latin original filenames, e.g. "Ελεφάντης218.pdf" (greek text), do not appear in the indexes (in the web interface), or the search page.

For the complete message, see https://list.scms.waikato.ac.nz/mailman/private/greenstone-users/2008-March/006485.html

2. I could reproduce such a problem in my Binary installation of Greenstone 2.80 (Linux): I built a collection containing:
- a text file containing "Ελεφάντης" several times and with the filename "Ελεφάντης218.txt"
- two bitmap images with the filenames "Ελεφάντης218.bmp" and "Ελεφάντης210.bmp" I accepted all the default and offered plugins and built the collection.

In the browser, browsing the collection by title and filename shows the dc.title and filename for each. All the titles should be the same as the filenames. And they should be in the correct encoding for Greek. BUT:
- for the images, the dc.title and filenames were warbled and not in proper encoding;
- for the textfile, the dc.title was not in proper encoding, but the filename was in correct Greek encoding ("Ελεφάντης218.txt").

As expected,
- searching on the filename field for Ελεφάντης218 retrieved the text file, but not the bmp image file of the same name. And Ελεφάντης210 (the filename of one of the images) didn't turn up either in a search on the filename field.
- nothing turned up when searching on the title field.

3. An improvement when I compiled and installed GS2 from source (svn checkout, Wed 5 March):

This was likely due to an update to the Basplug.pm code, lines 813 to 861 at http://trac.greenstone.org/browser/gsdl/trunk/perllib/plugins/BasPlug.pm#813 - changes which David Bainbridge had discussed and explained (the code uses the Locale to determine encoding). These lines of code were not there in my GS2.80 Binary installation's Basplug.pm. (May other similar code changes elsewhere also have affected the improvement?)

Using the same contents to create a collection as in (2) above, the results for browsing by titles and filenames is now as follows:
- for the images, the dc.title was in correct Greek encoding but not the filename;
- for the textfile it was as before in (2): the dc.title was not in Greek encoding but the filename was.

This improvement was reflected when searching:
- searching on the title field for the Greek titles retrieved the images. But searching on the filename field did not retrieve any images.
- searching on the filename field for the filename "Ελεφάντης218" returned the textfile. But searching on the title field for the same did not return the textfile.

Attachments

GS2-80_BinaryInstall.png (126.5 kB) - added by ak19 on 2008-03-06 17:51:05.
Screenshot of browsing the collection created with GS 2.80 Binary install
GS2_FromSVN_5Mar08.png (107.6 kB) - added by ak19 on 2008-03-06 17:52:16.
Screenshot of browsing the collection created with GS svn source install of 05March08

Change History

2008-03-06 17:51:05 changed by ak19

  • attachment GS2-80_BinaryInstall.png added.

Screenshot of browsing the collection created with GS 2.80 Binary install

2008-03-06 17:52:16 changed by ak19

  • attachment GS2_FromSVN_5Mar08.png added.

Screenshot of browsing the collection created with GS svn source install of 05March08

2008-03-06 17:56:07 changed by ak19

  • component changed from Building to Greenstone2&3.

2008-04-03 11:57:33 changed by kjdon

  • priority changed from moderate to high.
  • severity set to blocker.

2008-05-19 10:53:09 changed by kjdon

  • owner changed from nobody to kjdon.
  • status changed from new to assigned.

2008-05-19 10:53:18 changed by kjdon

  • keywords changed from Languages, multi-lingual, encoding, UTF8 to Languages, multilingual, encoding, UTF8.