Changeset 33561

Show
Ignore:
Timestamp:
11.10.2019 20:49:05 (5 weeks ago)
Author:
ak19
Message:

1. sites-too-big-to-exhaustively-crawl.txt is now a comma separated list. 2. After the discussion with Dr Bainbridge that SINGLEPAGE is not what we want for docs.google.com, I found that the tentative switch to SUBDOMAIN-COPY for docs.google.com will not work precisely because of the important change we had to make yesterday: if SUBDOMAIN-COPY, then only copy SUBdomains, and not root domains. If root domain with SUBDOMAIN-COPY, then the seedURL gets written out to unprocessed-topsite-matches.txt and its site doesn't get crawled. 3. This revealed a lacuna in sites-too-big-to-exhaustively-crawl.txt possible list of values and I had to invent a new value which I introduce and have tested with this commit: FOLLOW_LINKS_WITHIN_TOPSITE. This value so far applies only to docs.google.com and will keep following any links originating in a seedURL on docs.google.com but only as long as it's within that topsite domain (docs.google.com). 4. Tidied some old fashioned use of Iterator, replaced with newer style of for loops that work with Types. Comitting before update code to use the apache csv API.

Location:
gs3-extensions/maori-lang-detection
Files:
2 modified

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt

    r33559 r33561  
    1212 
    1313# FORMAT OF THIS FILE'S CONTENTS: 
    14 #    <topsite-base-url><tabspace><value> 
     14#    <topsite-base-url>,<value> 
    1515# where <value> can be empty or one of SUBDOMAIN-COPY, SINGLEPAGE, <url-form-without-protocol> 
    1616# 
     
    2929#     However, if the seedurl's domain is an exact match on topsite-base-url, the seedurl will go 
    3030#     into the file unprocessed-topsite-matches.txt and the site/page won't be crawled. 
     31#   - FOLLOW-LINKS-WITHIN-TOPSITE: if pages linked from the seedURL page can be followed and 
     32#     downloaded, as long as it's within the same subdomain matching the topsite-base-url. 
     33#     This is different from SUBDOMAIN-COPY, as that can download all of a specific subdomain but 
     34#     restricts against downloading the entire domain (e.g. all pinky.blogspot.com and not anything 
     35#     else within blogspot.com). FOLLOW-LINKS-WITHIN-TOPSITE can download all linked pages (at 
     36#     depth specified for the nutch crawl) as long as they're within the topsite-base-url. 
     37#     e.g. seedURLs on docs.google.com containing links will have those linked pages and any 
     38#     they link to etc. downloaded as long as they're on docs.google.com. 
    3139#   - <url-form-without-protocol>: if a seedurl contains topsite-base-url, then the provided 
    3240#     url-form-without-protocol will make up the urlfilter, again preventing leaking into a 
     
    3846#     Remember to leave out any protocol <from url-form-without-protocol>. 
    3947 
    40  
    41  
    42 docs.google.com  SINGLEPAGE 
    43 drive.google.com    SINGLEPAGE 
    44 forms.office.com    SINGLEPAGE 
    45 player.vimeo.com    SINGLEPAGE 
    46 static-promote.weebly.com   SINGLEPAGE 
     48# column 3: whether nutch should do fetch all or not 
     49# column 4: number of crawl iterations 
     50 
     51# docs.google.com is a special case: not all pages are public and any interlinking is likely to 
     52# be intentional. But SUBDOMAIN-COPY does not work: as seedURL's domain becomes docs.google.com 
     53# which, when combined with SUBDOMAIN-COPY, the Java code treats as a special case so that 
     54# any seedURL on docs.google.com ends up pushed out into the "unprocessed....txt" text file. 
     55#docs.google.com,SUBDOMAIN-COPY 
     56docs.google.com,FOLLOW-LINKS-WITHIN-TOPSITE 
     57 
     58drive.google.com,SINGLEPAGE 
     59forms.office.com,SINGLEPAGE 
     60player.vimeo.com,SINGLEPAGE 
     61static-promote.weebly.com,SINGLEPAGE 
    4762 
    4863# Special case of yale.edu: its Rapa-Nui pages are on blacklist, but we want this page + its photos 
    4964# The page's containing folder is whitelisted in case the photos are there. 
    50 korora.econ.yale.edu        SINGLEPAGE 
     65korora.econ.yale.edu,,SINGLEPAGE 
    5166 
    5267000webhost.com 
     
    115130blackberry.com 
    116131blogger.com 
    117 blogspot.com    SUBDOMAIN-COPY 
     132blogspot.com,SUBDOMAIN-COPY 
    118133bloomberg.com 
    119134booking.com 
     
    171186dreniq.com 
    172187dribbble.com 
    173 dropbox.com SINGLEPAGE 
     188dropbox.com,SINGLEPAGE 
    174189dropboxusercontent.com 
    175190dw.com 
     
    303318lonelyplanet.com 
    304319lycos.com 
    305 m.wikipedia.org mi.m.wikipedia.org 
     320m.wikipedia.org,mi.m.wikipedia.org 
    306321mail.ru 
    307322marketwatch.com 
     
    315330merriam-webster.com 
    316331metro.co.uk 
    317 microsoft.com   microsoft.com/mi-nz/ 
     332microsoft.com,microsoft.com/mi-nz/ 
    318333microsoftonline.com 
    319334mirror.co.uk 
     
    382397photobucket.com 
    383398php.net 
    384 pinterest.com   SINGLEPAGE 
     399pinterest.com,SINGLEPAGE 
    385400pixabay.com 
    386401playstation.com 
     
    456471stores.jp 
    457472storify.com 
    458 stuff.co.nz SINGLEPAGE 
     473stuff.co.nz,SINGLEPAGE 
    459474surveymonkey.com 
    460475symantec.com 
     
    534549wikihow.com 
    535550wikimedia.org 
    536 wikipedia.org   mi.wikipedia.org 
    537 wiktionary.org  mi.wiktionary.org 
     551wikipedia.org,mi.wikipedia.org 
     552wiktionary.org,mi.wiktionary.org 
    538553wiley.com 
    539554windowsphone.com 
    540555wired.com 
    541556wix.com 
    542 wordpress.org   SUBDOMAIN-COPY 
     557wordpress.org,SUBDOMAIN-COPY 
    543558worldbank.org 
    544559wp.com 
  • gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java

    r33560 r33561  
    6969    public final String SUBDOMAIN_COPY = "SUBDOMAIN-COPY"; 
    7070    public final String SINGLEPAGE = "SINGLEPAGE"; 
     71    public final String FOLLOW_LINKS_WITHIN_TOPSITE = "FOLLOW-LINKS-WITHIN-TOPSITE"; 
    7172     
    7273    /** 
     
    7475     * https://stackoverflow.com/questions/399078/what-special-characters-must-be-escaped-in-regular-expressions 
    7576     * https://www.regular-expressions.info/refcharacters.html 
    76     */ 
    77     //public final String[] ESCAPE_CHARS_FOR_RE = [".", "^", "$", "*", "+", "?", "(", ")", "[", "{", "\\", "|"]; 
    78     // put the \\ at start so we don't the escape character for chars escaped earlier 
     77     * Put the \\ (escape char) at start so we don't double-escape chars already escaped, 
     78     * as would happen for any chars appearing earlier in this list than \\ 
     79    */     
    7980    public final String ESCAPE_CHARS_FOR_RE = "\\.^$*+?()[{|"; 
     81    //public final String[] ESCAPE_CHARS_FOR_RE = ["\\", ".", "^", "$", "*", "+", "?", "(", ")", "[", "{", "|"]; 
    8082     
    8183    private Properties configProperties = new Properties(); 
     
    212214        } 
    213215 
    214         int tabindex = str.indexOf("\t"); 
    215         if(tabindex == -1) { 
     216        // comma separated list of values 
     217        int splitindex = str.indexOf(","); 
     218        if(splitindex == -1) { 
    216219            topSitesMap.put(str, ""); 
    217220        } else { 
    218             String topsite = str.substring(0, tabindex).trim(); 
    219             String allowed_url_pattern = str.substring(tabindex+1).trim(); 
     221            String topsite = str.substring(0, splitindex).trim(); 
     222            String allowed_url_pattern = str.substring(splitindex+1).trim(); 
    220223            topSitesMap.put(topsite, allowed_url_pattern); 
    221224        } 
     
    352355        while(domainIterator.hasNext()) { 
    353356        String domainWithProtocol = domainIterator.next(); 
     357        // Also get domain without protocol prefix 
    354358        int startIndex = domainWithProtocol.indexOf("//"); // http:// or https:// prefix 
    355359        startIndex = (startIndex == -1) ? 0 : (startIndex+2); // skip past the protocol's // portion 
    356360        String domain = domainWithProtocol.substring(startIndex); 
    357          
    358         System.err.println("domain with protocol: " + domainWithProtocol); 
    359         System.err.println("domain: " + domain); 
     361 
     362        /*if(domain.contains("docs.google.com")) { 
     363            System.err.println("domain with protocol: " + domainWithProtocol); 
     364            System.err.println("domain: " + domain); 
     365            }*/ 
    360366         
    361367        String allowedURLPatternRegex = isURLinTopSitesMap(domain);      
     
    372378 
    373379            Set<String> urlsForDomainSet = domainsToURLsMap.get(domainWithProtocol); 
    374             Iterator<String> urlIterator = urlsForDomainSet.iterator(); 
    375             while(urlIterator.hasNext()) { 
    376             String url = urlIterator.next(); 
     380            for(String url : urlsForDomainSet) { 
    377381            topSiteMatchesWriter.write("\t" + url + "\n");  
    378382            } 
    379              
     383            
    380384            continue; // done with this domain 
    381385        } 
     
    451455                siteRegexWriter.write(regexed_url + "\n"); 
    452456                } 
     457            } else if(allowedURLPatternRegex.equals(FOLLOW_LINKS_WITHIN_TOPSITE)) { 
     458                 
     459                // DON'T write out domain into siteURLs file, 
     460                // BUT DO write it into urlFilter file 
     461                String regexed_domain = PROTOCOL_REGEX_PREFIX + escapeStringForRegex(domain) + "/"; 
     462 
     463                urlFilterWriter.write(regexed_domain + "\n"); 
     464                siteRegexWriter.write(regexed_domain + "\n"); 
    453465            } else { // allowedURLPatternRegex is a url-form - convert to regex 
    454466                if(!allowedURLPatternRegex.endsWith("/")) { 
     
    467479            // also write into the global seeds file (with a tab prefixed to each?) 
    468480            Set<String> urlsForDomainSet = domainsToURLsMap.get(domainWithProtocol); 
    469             Iterator<String> urlIterator = urlsForDomainSet.iterator(); 
    470             while(urlIterator.hasNext()) { 
    471             String url = urlIterator.next(); 
     481            for(String url : urlsForDomainSet) { 
    472482            seedURLsWriter.write(url + "\n"); // global seedURLs file 
    473483            siteURLsWriter.write(url + "\n"); 
    474484            } 
     485             
    475486        } catch (IOException ioe) { 
    476487            ioe.printStackTrace();