Opened 15 years ago

Closed 13 years ago

#357 closed defect (fixed)

Dragging and dropping files with alien file encodings on Linux

Reported by: ak19 Owned by: ak19
Priority: moderate Milestone: Collection building wishlist
Component: GLI Severity: major
Keywords: GLI, encoding, multilingual Cc:


The problem: On Linux, GLI does not recognise that files with Latin-1 (ISO-8859-1) filenames even exist. Dragging and dropping does not work therefore. Perl and Linux are able to copy such files and delete them.

At present, people who want to go through GLI to build collections containing such files need to manually put them into the import folder using Linux' file explorer.

The problem lies in the fact that Java's File class stores File info (the pathname) as a String. On Linux, it starts presuming that filenames must therefore be UTF8. Instead of preserving the bytevalues or URL-encoded URI of a filename, it replaces those bytevalues that make for invalid UTF8 with UTF8's "invalid character". This char is the same for all chars that are invalid for UTF8. Therefore the conversion from bytes to UTF8 was a destructive operation and the String filename stored in the File datastructure is wrong.

The proposed solution (Dr Bainbridge's idea):

  1. Another listFiles() should be implemented in Perl and return an array of URL encoded file and dir names. This should be called in all places listFiles() was called before, instead of Java's default File.listFiles()
  1. The FileNode and FileJob/FileQueue classes of GLI will not only have to call the new custom listFiles, they will also have to call Perl code for copying, moving and deleting files (and checking whether they exist). All calls to these operations have to go through the Perl code rather than through Java's File class.
  1. Since invoking Perl will be more timeconsuming than using Java's File, we can provide an option in GLI called "Recognise alien file system encodings" for filenames. That way the specific processing that is only ever required for specially encoded filenames need not be done unless the GLI user is sure that they are working with such files.

Change History (3)

comment:1 by ak19, 15 years ago

Milestone: Next Release (2 or 3)Release 3.05

comment:2 by ak19, 15 years ago

This is a general problem that occurs on Linux machines because the assumptions Java makes about Linux filenames are actually not really wrong. (See further below.)

  1. On Max's suggestion, we looked for and downloaded various File Explorers written in Java and tried them out on the Linux machine. None of these File Explorers were able to display the files with the filenames in ISO-8859-1 encoding. Viewing the contents of the directories containing these files through the File Explorers just skipped these files.
  1. The same was the case with Mozilla Firefox: pointing Firefox to the directory containing these specially encoded filenames made it show all the other files but not the ones with ISO-8859-1 names.

However, the URL encoded versions of the filenames were displayable in Firefox. This could not be replicated with GLI's file display since it uses Java's File class. The File constructor which takes the URI does not help even when we encode the special filename with URL encoding, because the File object still stores the given filename as a regular (non-URL-encoded) String.

The reason the above problem is occurring is that whenever you use straightforward ways to transfer files onto a Linux machine from Windows or elsewhere, the Linux OS automatically converts all the filenames into UTF8: the filenames are now in UTF8 but look exactly as they looked on the Windows machine where you transferred them from. (This means that UTF8 filenames transferred to Windows and retransferred to Linux will now look funny on Linux, because it's renamed them to be UTF8 but to LOOK like the funny characters displayed for UTF8 on Windows.)

Max installed Ubuntu on the research Win machine and we encountered the same: the files with special filename encodings were immediately renamed into UTF8 filenames when we used a flash drive to transfer them.

This renaming-upon-transfer that Linux does happens most of the time (when using commonly used methods for file transfer like a flash drive, etc), however this renaming process was bypassed when using SSH to transfer the specially-named files onto Linux from Windows. At this point the original file-encodings were preserved because Linux did not do its usual renaming to make sure they were safe.

Java's File class therefore assumes that the Linux renaming procedure is not bypassed (illegally) and that therefore all filenames (Strings) dealt with on Linux are in fact in UTF8.

GLI is presently still able to build docs with ISO-8859-1 filenames on Linux, but users have to manually copy these files into the import folder.

One solution would be to do an (!isFile() && !isDirectory()) check in GLI on all files in the Workspace view. When this returns true, we know the file is not recognised on Linux. We can make the file icon for such files red in the Workspace view, to hopefully get users to be curious enough to hover over files marked like this. The tooltip could tell them these files need to be manually copied into the import folder, after which GLI can probably build them.

comment:3 by ak19, 13 years ago

Milestone: Greenstone 3 wishlistCollection building wishlist
Resolution: fixed
Status: newclosed

This has been solved differently. Actually, a Native Latin-1 filesystem on Linux is preferrable to it being in a UTF-8 setting. (Linux systems can be set to different character encodings, depending on what is installed and available and set as default.) If set to Latin-1 (ISO-8859-1), then underlying bytevalues are preserved, if the filesystem is set to UTF-8, then characters with bytevalues that are not valid UTF-8 in non-native files in other encodings will get replaced with the invalid UTF-8 character u+FFFD.

The solution on linux has been to set the filesystem to Latin-1, then the user can set the gs.filenameEncoding metadata in GLI's Enrich Pane so that Java can interpret the byte values of the filename in the correct encoding and display them correctly. Perl code can fortunately correctly deal with the filenames already, it was only GLI, which is written in Java, which would force other encodings to UTF-8 if the filesystem was UTF-8 and due to the replacement by the invalid UTF-9 character would not see underlying files whose characters got replaced in such a manner.

If the Linux filesystem is in UTF-8, the above solution will still not work for encodings other than native (UTF-8) filenames.

Note: See TracTickets for help on using tickets.