Ignore:
Timestamp:
2019-11-05T21:04:09+13:00 (4 years ago)
Author:
ak19
Message:
  1. Incorporated Dr Nichols earlier suggestion of storing page modified time and char-encoding metadata if present in the crawl dump output. Have done so, but neither modifiedTime nor fetchTime metadata of the dump file appear to be a webpage's actual modified time, as they're from 2019 and set around the period we've been crawling. 2. Moved getDomainFromURL() function from CCWETProcessor.java to Utility.java since it's been reused. 3. MongoDBAccess class successfully connects (at least, no exceptions) and uses the newly added properties in config.properties to make the connection.
File:
1 edited

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/src/org/greenstone/atea/TextDumpPage.java

    r33615 r33623  
    8484                String k = line.substring(0, endIndex);
    8585                String v = line.substring(endIndex+1);
     86                if(k.startsWith("metadata")) {
     87                k = k.substring("metadata".length());
     88                }
     89               
    8690                tuples.put(k.trim(), v.trim());
    8791            } else {
     
    134138    }
    135139
     140    /* Dr Nichols suggested storing timestamp and char encoding. Not sure which timestamp
     141       or encoding he meant, but storing 2 of several timestamps and selecting
     142       original character encoding (presumably the char encoding of the page) out of 2
     143       pieces of char encoding metadata to store. */
     144    public String getModifiedTime() {   
     145    // is this the webpage's last mod time?
     146    String time = tuples.get("modifiedTime");
     147    time = time.equals("0") ? "" : time; // zero will be assumed to be epoch, rather than unset
     148    return time;
     149    }   
     150    public String getFetchTime() {
     151    // is this the nutch crawl time
     152    String time = tuples.get("fetchTime");
     153    time = time.equals("0") ? "" : time; // zero will be assumed to be epoch, rather than unset
     154    return time;
     155   
     156    }
     157    public String getOriginalCharEncoding() {
     158    // is this the web page's char-encoding?
     159    return tuples.get("OriginalCharEncoding");
     160    }
     161   
    136162    public String get(String key) {
    137163    return tuples.get(key);
Note: See TracChangeset for help on using the changeset viewer.