Ignore:
Timestamp:
2020-10-22T01:48:03+13:00 (4 years ago)
Author:
ak19
Message:

Redoing work of commit revision 34394: Redoing Bugfix 1 for GLI doc.xml metadata slowdown resulting from earlier bugfix to help GLI cope with filenames and assigned meta that have non-ASCII chars in them. The slowdown happened when gathered files got selected in GLI and was fixed in commit 34394, but the fix was not ideal for 2 reasons. 1. A new form of filename encoding (hexed unicode) going into doc.xml, instead of existing encodings like URL and base64, though those existing encodings weren't the right ones for my first solution. 2. The solution was specific to Windows to cope with special chars in filenames and relied on a new meta field gsdlfullsourcepath being written out to doc.xml by doc.pm. So a built collection moved from Linux to Windows won't show up doc.xml meta in GLI, as it won't have the new doc.xml meta field that Windows is expecting. Have a better solution for 1 that doesn't require the new field. But still can't fix all of point 2, as the existing gsdlsourcefilename meta field in doc.xml can contain Windows Short filenames when the coll is built on Windows and this won't be backwards compatible on Linux anyway. This problem existed before too, except I didn't realise it until now. But the new solution fixes more issues. Second step: modified DocXMLFile to no longer use the new field gsdlfullsourcepath, but return to using gsdlsourcefilename field. This time however, the code is optimised to detect a filename match between doc.xml and any file selected in GLI by storing gsdlsourcefilename in its Long filename form whenever doc.xml had stored it in Win 8.3 Short filename form. The Long filename can be obtained for any file that exists by calling getCanonicalPath(). Of course, the full filename was not stored in gsdlsourcefilename, rather the filename from import folder onwards. So to ensure a file by that filename in long form has a chance of existing, first prefixed the current collection folder and then checked for existence before obtaining the canonical form for it. This is then stored in the hashmap in place of any win short filename. Now a match is more readily found without using any hex encoded unicode filenames stored by doc.pm, and without using the older and inefficient method of making cmd calls to DOS to calculate the Win 8.3 Short filename for each selected file.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • main/trunk/gli/src/org/greenstone/gatherer/metadata/DocXMLFileManager.java

    r34394 r34507  
    5353        file_relative_path = file_relative_path.substring(import_index + "import".length() + 1);
    5454    }
    55     String searchFileName = DocXMLFile.isWin ? Utility.stringToHex(file_relative_path) : file_relative_path;
    56        
     55   
    5756    // Build up a list of metadata values extracted from this file
    5857    ArrayList metadata_values = new ArrayList();
     
    6261        DocXMLFile doc_xml_file = (DocXMLFile) doc_xml_files.get(i);
    6362        ///System.err.println("@@@@ Looking at doc.xml file: " + doc_xml_files.get(i));
    64         metadata_values.addAll(doc_xml_file.getMetadataExtractedFromFile(file, searchFileName));
     63        metadata_values.addAll(doc_xml_file.getMetadataExtractedFromFile(file, file_relative_path));
    6564    }
    6665
Note: See TracChangeset for help on using the changeset viewer.