Ignore:
Timestamp:
2020-09-22T00:57:33+12:00 (4 years ago)
Author:
ak19
Message:

Bugfix for slowdown when assigning meta to multiple gathered docs in GLI's Enrich pane. Tested on Windows. This is the simplest way I could think of to solve the problem: XMLParsing always resolves html entities (unless possibly when using the StAX parser, but that may not return the Document object as code expects). Entities start with ampersand and are resolved upon parsing, so too standalone ampersand signs. The earlier code, a bugfix for metadata not sticking to filenames/import folder structures containing non-ASCII or ampersands or plus signs, had caused the slow-down, as after each XML parse of the current metadata.xml file, the code would loop through each FileName element of the metadata.xml file and reintroduce the resolved html entities. The best and simplest solution that worked is simply to escape ampersands with %26 when writing out values for the FileName element and compare against filenames that have a similar substitution done. Still to test on Linux, but this reincorporates recent ideas for the bugfix that had worked on Linux (but then broke on Windows) so I feel somewhat confident that this commit is likely to largely work on Linux when I test it tomorrow.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • main/trunk/gli/src/org/greenstone/gatherer/metadata/FilenameEncoding.java

    r33748 r34415  
    7777    public static final Pattern HEX_PATTERN = Pattern.compile("(&#x[0-9a-zA-Z]{1,4}+;)");
    7878
    79     /** The hex entity version of the ampersand character.
     79    /** The hex version of the ampersand character: previously hex entity (&#x26) now hex url encoded (%26).
    8080     * We use this in place of the ampersand character in filenames in metadata.xml files to
    8181     * preserve the reference to the literal ampersand in the real file name on the file system.
    8282     */
    83     public static final String HEX_ENTITY_AMPERSAND = FilenameEncoding.hexEntityForChar("&"); //"&";
     83    public static final String HEX_AMPERSAND = "%26"; //= FilenameEncoding.hexEntityForChar("&"); //"&";
    8484   
    8585
     
    257257   
    258258    /** URL encoded version of the byte codes of the given file's name */
    259     public static String calcURLEncodedFilePath(File file) {   
    260         if(!MULTIPLE_FILENAME_ENCODINGS_SUPPORTED) {
    261             return file.getAbsolutePath();
    262         }
    263         else {
    264             String filename = fileToURLEncoding(file);
    265             return filename;
    266         }
     259    public static String calcURLEncodedFilePath(File file) {
     260        return fileToURLEncoding(file);
    267261    }
    268262
     
    380374    // just return input filename param, but with any & in the filename replaced with its hex entity
    381375        if(!MULTIPLE_FILENAME_ENCODINGS_SUPPORTED) {
    382             // protect ampersands in filenames by converting it to its hex entity
    383376            String filepath = file.getAbsolutePath();
    384             filepath = filepath.replace("&", HEX_ENTITY_AMPERSAND);
    385377            return filepath;
    386378        }
     
    430422           
    431423            // Before proceeding, protect & in the filename too.
    432             // &'s ASCII code is 36 in decimal, and 26 in hex, so replace with & (HEX_ENTITY_AMPERSAND)
     424            // &'s ASCII code is 36 in decimal, and 26 in hex, so replace with & (HEX_AMPERSAND)
    433425            // But dangerous to do simple replace if there are &#x...; entities in the filename already!
    434426            // That is, we'll want to protect & by replacing with &'s hex value, but we don't want to replace the & in "&#x....;" with the same!
     
    445437            //filename_url_encoded = filename_url_encoded.replace("%2B", "+"); // Don't do this, won't get regex escaped when converted back to a + by caller
    446438            filename_url_encoded = filename_url_encoded.replace("%2B", "+"); // + signs are special, as they will need to be escaped since the caller wants the filename representing a regex
    447             filename_url_encoded = filename_url_encoded.replace("%26", HEX_ENTITY_AMPERSAND); // convert URL encoding for ampersand into hex entity for ampersand
     439            filename_url_encoded = filename_url_encoded.replace("%26", "&"); // now putting back ampersands too, instead of replacing with HEX_ENTITY_AMPERSAND (&)
    448440        }
    449441        catch (Exception e) {
     
    544536        // just return input filename param, but with any & in the filename replaced with its hex entity
    545537        if(!MULTIPLE_FILENAME_ENCODINGS_SUPPORTED) {
    546             return filename.replace("&", HEX_ENTITY_AMPERSAND);
     538            return filename; //return filename.replace("&", HEX_AMPERSAND);
    547539        }
    548540       
     
    567559    public static String relativeFilenameToURLEncoding(String filename) {
    568560        if(!MULTIPLE_FILENAME_ENCODINGS_SUPPORTED) { // on a UTF-8 file system, DO NOT do the stuff below, just return input param
    569             return filename.replace("&", HEX_ENTITY_AMPERSAND);
     561            return filename; // return filename.replace("&", HEX_AMPERSAND);
    570562        }
    571563       
     
    580572    public static String filenameToURLEncodingWithPrefixRemoved(String filename, String removeFilePathPrefix) {
    581573        if(!MULTIPLE_FILENAME_ENCODINGS_SUPPORTED) { // on a UTF-8 file system, DO NOT do the stuff below, just return input param
    582             return filename.replace("&", HEX_ENTITY_AMPERSAND);
     574            return filename; //return filename.replace("&", HEX_AMPERSAND);
    583575        }
    584576       
Note: See TracChangeset for help on using the changeset viewer.