Non-Roman encoded filenames of non-text content not indexed for searching
|Reported by:||ak19||Owned by:||kjdon|
|Keywords:||Languages, multilingual, encoding, UTF8||Cc:|
- Problem received in mailing list (Linux):
a) I have a different number of texts in the aut</li>
b) All files with non-latin original filenames, e.g. "Ελεφάντης218.pdf" (greek text), do not appear in the indexes (in the web interface), or the search page.
For the complete message, see https://list.scms.waikato.ac.nz/mailman/private/greenstone-users/2008-March/006485.html
- I could reproduce such a problem in my Binary installation of Greenstone 2.80 (Linux):
I built a collection containing:
- a text file containing "Ελεφάντης" several times and with the filename "Ελεφάντης218.txt"
- two bitmap images with the filenames "Ελεφάντης218.bmp" and "Ελεφάντης210.bmp" I accepted all the default and offered plugins and built the collection.
In the browser, browsing the collection by title and filename shows the dc.title and filename for each. All the titles should be the same as the filenames. And they should be in the correct encoding for Greek. BUT:
- for the images, the dc.title and filenames were warbled and not in proper encoding;
- for the textfile, the dc.title was not in proper encoding, but the filename was in correct Greek encoding ("Ελεφάντης218.txt").
- searching on the filename field for Ελεφάντης218 retrieved the text file, but not the bmp image file of the same name. And Ελεφάντης210 (the filename of one of the images) didn't turn up either in a search on the filename field.
- nothing turned up when searching on the title field.
- An improvement when I compiled and installed GS2 from source (svn checkout, Wed 5 March):
This was likely due to an update to the Basplug.pm code, lines 813 to 861 at http://trac.greenstone.org/browser/gsdl/trunk/perllib/plugins/BasPlug.pm#813 - changes which David Bainbridge had discussed and explained (the code uses the Locale to determine encoding). These lines of code were not there in my GS2.80 Binary installation's Basplug.pm. (May other similar code changes elsewhere also have affected the improvement?)
Using the same contents to create a collection as in (2) above, the results for browsing by titles and filenames is now as follows:
- for the images, the dc.title was in correct Greek encoding but not the filename;
- for the textfile it was as before in (2): the dc.title was not in Greek encoding but the filename was.
This improvement was reflected when searching:
- searching on the title field for the Greek titles retrieved the images. But searching on the filename field did not retrieve any images.
- searching on the filename field for the filename "Ελεφάντης218" returned the textfile. But searching on the title field for the same did not return the textfile.