Ignore:
Timestamp:
2019-11-05T21:04:09+13:00 (4 years ago)
Author:
ak19
Message:
  1. Incorporated Dr Nichols earlier suggestion of storing page modified time and char-encoding metadata if present in the crawl dump output. Have done so, but neither modifiedTime nor fetchTime metadata of the dump file appear to be a webpage's actual modified time, as they're from 2019 and set around the period we've been crawling. 2. Moved getDomainFromURL() function from CCWETProcessor.java to Utility.java since it's been reused. 3. MongoDBAccess class successfully connects (at least, no exceptions) and uses the newly added properties in config.properties to make the connection.
File:
1 edited

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/src/org/greenstone/atea/Utility.java

    r33604 r33623  
    8282    }
    8383   
     84    /** Work out the 'domain' for a given url.
     85     * This retains any www. or subdomain prefix.
     86     */
     87    public static String getDomainForURL(String url, boolean withProtocol) {
     88    int startIndex = startIndex = url.indexOf("//"); // for http:// or https:// prefix
     89    startIndex = (startIndex == -1) ? 0 : (startIndex+2); // skip past the protocol's // portion
     90    // the keep the URL around in case param withProtocol=true
     91    String protocol = (startIndex == -1) ? "" : url.substring(0, startIndex);
     92   
     93    String domain = url.substring(startIndex);
     94    int endIndex = domain.indexOf("/");
     95    if(endIndex == -1) endIndex = domain.length();
     96    domain = domain.substring(0, endIndex);
     97
     98    if(withProtocol) {
     99        // now that we have the domain (everything to the first / when there is no protocol)
     100        // can glue the protocol back on
     101        domain = protocol + domain;
     102    }
     103   
     104    return domain;
     105    }
     106   
    84107    public static boolean isDomainInCountry(String domainWithProtocol,
    85108                        String countryCode, File geoLiteCityDatFile)
Note: See TracChangeset for help on using the changeset viewer.