Context Navigation

← Previous Changeset
Next Changeset →

Changeset 33561

Timestamp:

2019-10-11T20:49:05+13:00 (5 years ago)

Author:

ak19

Message:

sites-too-big-to-exhaustively-crawl.txt is now a comma separated list. 2. After the discussion with Dr Bainbridge that SINGLEPAGE is not what we want for docs.google.com, I found that the tentative switch to SUBDOMAIN-COPY for docs.google.com will not work precisely because of the important change we had to make yesterday: if SUBDOMAIN-COPY, then only copy SUBdomains, and not root domains. If root domain with SUBDOMAIN-COPY, then the seedURL gets written out to unprocessed-topsite-matches.txt and its site doesn't get crawled. 3. This revealed a lacuna in sites-too-big-to-exhaustively-crawl.txt possible list of values and I had to invent a new value which I introduce and have tested with this commit: FOLLOW_LINKS_WITHIN_TOPSITE. This value so far applies only to docs.google.com and will keep following any links originating in a seedURL on docs.google.com but only as long as it's within that topsite domain (docs.google.com). 4. Tidied some old fashioned use of Iterator, replaced with newer style of for loops that work with Types. Comitting before update code to use the apache csv API.

Location:

gs3-extensions/maori-lang-detection

Files:

: 2 edited

conf/sites-too-big-to-exhaustively-crawl.txt (modified) (10 diffs)
src/org/greenstone/atea/CCWETProcessor.java (modified) (7 diffs)

Legend:

: Unmodified
: Added
: Removed

gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt

-              r33559
+              r33561
 # FORMAT OF THIS FILE'S CONTENTS:
 #    <topsite-base-url><tabspace><value>
+#    <topsite-base-url>,<value>
 # where <value> can be empty or one of SUBDOMAIN-COPY, SINGLEPAGE, <url-form-without-protocol>
+#
 …
 #     However, if the seedurl's domain is an exact match on topsite-base-url, the seedurl will go
 #     into the file unprocessed-topsite-matches.txt and the site/page won't be crawled.
+#   - FOLLOW-LINKS-WITHIN-TOPSITE: if pages linked from the seedURL page can be followed and
+#     downloaded, as long as it's within the same subdomain matching the topsite-base-url.
+#     This is different from SUBDOMAIN-COPY, as that can download all of a specific subdomain but
+#     restricts against downloading the entire domain (e.g. all pinky.blogspot.com and not anything
+#     else within blogspot.com). FOLLOW-LINKS-WITHIN-TOPSITE can download all linked pages (at
+#     depth specified for the nutch crawl) as long as they're within the topsite-base-url.
+#     e.g. seedURLs on docs.google.com containing links will have those linked pages and any
+#     they link to etc. downloaded as long as they're on docs.google.com.
 #   - <url-form-without-protocol>: if a seedurl contains topsite-base-url, then the provided
 #     url-form-without-protocol will make up the urlfilter, again preventing leaking into a
 …
 #     Remember to leave out any protocol <from url-form-without-protocol>.
+docs.google.com  SINGLEPAGE
+drive.google.com    SINGLEPAGE
+forms.office.com    SINGLEPAGE
+player.vimeo.com    SINGLEPAGE
+static-promote.weebly.com   SINGLEPAGE
+# column 3: whether nutch should do fetch all or not
+# column 4: number of crawl iterations
+# docs.google.com is a special case: not all pages are public and any interlinking is likely to
+# be intentional. But SUBDOMAIN-COPY does not work: as seedURL's domain becomes docs.google.com
+# which, when combined with SUBDOMAIN-COPY, the Java code treats as a special case so that
+# any seedURL on docs.google.com ends up pushed out into the "unprocessed....txt" text file.
+#docs.google.com,SUBDOMAIN-COPY
+docs.google.com,FOLLOW-LINKS-WITHIN-TOPSITE
+drive.google.com,SINGLEPAGE
+forms.office.com,SINGLEPAGE
+player.vimeo.com,SINGLEPAGE
+static-promote.weebly.com,SINGLEPAGE
 # Special case of yale.edu: its Rapa-Nui pages are on blacklist, but we want this page + its photos
 # The page's containing folder is whitelisted in case the photos are there.
 korora.econ.yale.edu        SINGLEPAGE
+korora.econ.yale.edu,,SINGLEPAGE
 webhost.com
 …
 blackberry.com
 blogger.com
 blogspot.com    SUBDOMAIN-COPY
+blogspot.com,SUBDOMAIN-COPY
 bloomberg.com
 booking.com
 …
 dreniq.com
 dribbble.com
 dropbox.com SINGLEPAGE
+dropbox.com,SINGLEPAGE
 dropboxusercontent.com
 dw.com
 …
 lonelyplanet.com
 lycos.com
 m.wikipedia.org mi.m.wikipedia.org
+m.wikipedia.org,mi.m.wikipedia.org
 mail.ru
 marketwatch.com
 …
 merriam-webster.com
 metro.co.uk
 microsoft.com   microsoft.com/mi-nz/
+microsoft.com,microsoft.com/mi-nz/
 microsoftonline.com
 mirror.co.uk
 …
 photobucket.com
 php.net
 pinterest.com   SINGLEPAGE
+pinterest.com,SINGLEPAGE
 pixabay.com
 playstation.com
 …
 stores.jp
 storify.com
 stuff.co.nz SINGLEPAGE
+stuff.co.nz,SINGLEPAGE
 surveymonkey.com
 symantec.com
 …
 wikihow.com
 wikimedia.org
 wikipedia.org   mi.wikipedia.org
 wiktionary.org  mi.wiktionary.org
+wikipedia.org,mi.wikipedia.org
+wiktionary.org,mi.wiktionary.org
 wiley.com
 windowsphone.com
 wired.com
 wix.com
 wordpress.org   SUBDOMAIN-COPY
+wordpress.org,SUBDOMAIN-COPY
 worldbank.org
 wp.com

gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java

-              r33560
+              r33561
     public final String SUBDOMAIN_COPY = "SUBDOMAIN-COPY";
     public final String SINGLEPAGE = "SINGLEPAGE";
+    public final String FOLLOW_LINKS_WITHIN_TOPSITE = "FOLLOW-LINKS-WITHIN-TOPSITE";
     /**
 …
      * https://stackoverflow.com/questions/399078/what-special-characters-must-be-escaped-in-regular-expressions
      * https://www.regular-expressions.info/refcharacters.html
     */
     //public final String[] ESCAPE_CHARS_FOR_RE = [".", "^", "$", "*", "+", "?", "(", ")", "[", "{", "\\", "|"];
     // put the \\ at start so we don't the escape character for chars escaped earlier
+     * Put the \\ (escape char) at start so we don't double-escape chars already escaped,
+     * as would happen for any chars appearing earlier in this list than \\
+    */
     public final String ESCAPE_CHARS_FOR_RE = "\\.^$*+?()[{|";
+    //public final String[] ESCAPE_CHARS_FOR_RE = ["\\", ".", "^", "$", "*", "+", "?", "(", ")", "[", "{", "|"];
     private Properties configProperties = new Properties();
 …
+        }
+        int tabindex = str.indexOf("\t");
+        if(tabindex == -1) {
+        // comma separated list of values
+        int splitindex = str.indexOf(",");
+        if(splitindex == -1) {
             topSitesMap.put(str, "");
         } else {
             String topsite = str.substring(0, tabindex).trim();
             String allowed_url_pattern = str.substring(tabindex+1).trim();
+            String topsite = str.substring(0, splitindex).trim();
+            String allowed_url_pattern = str.substring(splitindex+1).trim();
             topSitesMap.put(topsite, allowed_url_pattern);
+        }
 …
         while(domainIterator.hasNext()) {
         String domainWithProtocol = domainIterator.next();
+        // Also get domain without protocol prefix
         int startIndex = domainWithProtocol.indexOf("//"); // http:// or https:// prefix
         startIndex = (startIndex == -1) ? 0 : (startIndex+2); // skip past the protocol's // portion
         String domain = domainWithProtocol.substring(startIndex);
+        System.err.println("domain with protocol: " + domainWithProtocol);
+        System.err.println("domain: " + domain);
+        /*if(domain.contains("docs.google.com")) {
+            System.err.println("domain with protocol: " + domainWithProtocol);
+            System.err.println("domain: " + domain);
+            }*/
         String allowedURLPatternRegex = isURLinTopSitesMap(domain);
 …
             Set<String> urlsForDomainSet = domainsToURLsMap.get(domainWithProtocol);
+            Iterator<String> urlIterator = urlsForDomainSet.iterator();
+            while(urlIterator.hasNext()) {
+            String url = urlIterator.next();
+            for(String url : urlsForDomainSet) {
             topSiteMatchesWriter.write("\t" + url + "\n");
+            }
             continue; // done with this domain
+        }
 …
                 siteRegexWriter.write(regexed_url + "\n");
+                }
+            } else if(allowedURLPatternRegex.equals(FOLLOW_LINKS_WITHIN_TOPSITE)) {
+                // DON'T write out domain into siteURLs file,
+                // BUT DO write it into urlFilter file
+                String regexed_domain = PROTOCOL_REGEX_PREFIX + escapeStringForRegex(domain) + "/";
+                urlFilterWriter.write(regexed_domain + "\n");
+                siteRegexWriter.write(regexed_domain + "\n");
             } else { // allowedURLPatternRegex is a url-form - convert to regex
                 if(!allowedURLPatternRegex.endsWith("/")) {
 …
             // also write into the global seeds file (with a tab prefixed to each?)
             Set<String> urlsForDomainSet = domainsToURLsMap.get(domainWithProtocol);
+            Iterator<String> urlIterator = urlsForDomainSet.iterator();
+            while(urlIterator.hasNext()) {
+            String url = urlIterator.next();
+            for(String url : urlsForDomainSet) {
             seedURLsWriter.write(url + "\n"); // global seedURLs file
             siteURLsWriter.write(url + "\n");
+            }
         } catch (IOException ioe) {
             ioe.printStackTrace();

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 33561

Legend:

gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt

gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java

Download in other formats: