Ignore:
Timestamp:
2019-09-24T20:30:40+12:00 (5 years ago)
Author:
ak19
Message:
  1. Blacklists were introduced so that too many instances of camelcased words need no longer disqualify WET records from inclusion in the keep pile. Still check camelcasing of words as such words don't get counted as valid words, in the valid word count that determines if there's sufficient content in a WET record. 2. Some more commenting.
File:
1 edited

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java

    r33515 r33517  
    7777    public final File greyListedFile;
    7878
     79    /** Possible values stored in the blackList/whiteList/greyList Maps */
    7980    private final Integer LIST_ENTRY_CONTAINS = new Integer(0);
    8081    private final Integer LIST_ENTRY_STARTSWITH = new Integer(1);
    8182    private final Integer LIST_ENTRY_ENDSWITH = new Integer(2);
    8283    private final Integer LIST_ENTRY_MATCHES = new Integer(3);
    83    
     84
     85    /**
     86     * Store url patterns as keys and values indicated whether a url should
     87     * match it exactly, start/end with it, or contain it
     88     */
    8489    private HashMap<String, Integer> blackList;
    8590    private HashMap<String, Integer> greyList;
    8691    private HashMap<String, Integer> whiteList;
    8792
     93    /** Map of domains we keep and the full urls we're keeping that are of that domain.
     94     * Choosing a TreeMap to preserve natural (alphabetical) ordering of keys,
     95     * since a HashMap has no notion of ordering.
     96     */
     97    private TreeMap<String, TreeSet<String>> domainsToURLsMap;
     98   
    8899    // Keep a count of all the records that all WETProcessors instantiated
    89100    // by our main method combined have processed
     
    144155    }
    145156
     157    // prepare our blacklist, greylist (for inspection) and whitelist
    146158    System.err.println("Loading blacklist.");
    147159    blackList = new HashMap<String, Integer>();
    148160    initURLFilterList(blackList, "url-blacklist-filter.txt");
     161   
    149162    System.err.println("Loading greylist.");
    150163    greyList = new HashMap<String, Integer>();
    151164    initURLFilterList(greyList, "url-greylist-filter.txt");
     165   
    152166    System.err.println("Loading whitelist.");
    153167    whiteList = new HashMap<String, Integer>();
     
    159173   
    160174    /**
    161      * Takes as input the keepURLs.txt file generated by running WETProcessor instances.
    162      * As output produces the URL seed list and regex-urlfilter text files required by nutch,
     175     * Using the keepURLs.txt file generated by running WETProcessor instances, produces
     176     * as output the URL seed list and regex-urlfilter text files required by nutch, see
    163177     * https://cwiki.apache.org/confluence/display/nutch/NutchTutorial
    164178     */
     
    448462    File urlFilterFile = new File(outFolder, "regex-urlfilter.txt");
    449463    ccWETFilesProcessor.createSeedURLsFiles(seedURLsFile, urlFilterFile);
     464
     465    System.out.println("\n*** Inspect urls in greylist at " + ccWETFilesProcessor.greyListedFile + "\n");
     466   
    450467    } catch(Exception e) {
    451468    // can get an exception when instantiating CCWETProcessor instance
Note: See TracChangeset for help on using the changeset viewer.