Context Navigation

← Previous Change
Next Change →

CCWETProcessor.java

Timestamp:

2019-10-11T20:49:05+13:00 (5 years ago)

Author:

ak19

Message:

sites-too-big-to-exhaustively-crawl.txt is now a comma separated list. 2. After the discussion with Dr Bainbridge that SINGLEPAGE is not what we want for docs.google.com, I found that the tentative switch to SUBDOMAIN-COPY for docs.google.com will not work precisely because of the important change we had to make yesterday: if SUBDOMAIN-COPY, then only copy SUBdomains, and not root domains. If root domain with SUBDOMAIN-COPY, then the seedURL gets written out to unprocessed-topsite-matches.txt and its site doesn't get crawled. 3. This revealed a lacuna in sites-too-big-to-exhaustively-crawl.txt possible list of values and I had to invent a new value which I introduce and have tested with this commit: FOLLOW_LINKS_WITHIN_TOPSITE. This value so far applies only to docs.google.com and will keep following any links originating in a seedURL on docs.google.com but only as long as it's within that topsite domain (docs.google.com). 4. Tidied some old fashioned use of Iterator, replaced with newer style of for loops that work with Types. Comitting before update code to use the apache csv API.

File:

: 1 edited

gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java (modified) (7 diffs)

Legend:

: Unmodified
: Added
: Removed

gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java

-              r33560
+              r33561
     public final String SUBDOMAIN_COPY = "SUBDOMAIN-COPY";
     public final String SINGLEPAGE = "SINGLEPAGE";
+    public final String FOLLOW_LINKS_WITHIN_TOPSITE = "FOLLOW-LINKS-WITHIN-TOPSITE";
     /**
 …
      * https://stackoverflow.com/questions/399078/what-special-characters-must-be-escaped-in-regular-expressions
      * https://www.regular-expressions.info/refcharacters.html
     */
     //public final String[] ESCAPE_CHARS_FOR_RE = [".", "^", "$", "*", "+", "?", "(", ")", "[", "{", "\\", "|"];
     // put the \\ at start so we don't the escape character for chars escaped earlier
+     * Put the \\ (escape char) at start so we don't double-escape chars already escaped,
+     * as would happen for any chars appearing earlier in this list than \\
+    */
     public final String ESCAPE_CHARS_FOR_RE = "\\.^$*+?()[{|";
+    //public final String[] ESCAPE_CHARS_FOR_RE = ["\\", ".", "^", "$", "*", "+", "?", "(", ")", "[", "{", "|"];
     private Properties configProperties = new Properties();
 …
+        }
+        int tabindex = str.indexOf("\t");
+        if(tabindex == -1) {
+        // comma separated list of values
+        int splitindex = str.indexOf(",");
+        if(splitindex == -1) {
             topSitesMap.put(str, "");
         } else {
             String topsite = str.substring(0, tabindex).trim();
             String allowed_url_pattern = str.substring(tabindex+1).trim();
+            String topsite = str.substring(0, splitindex).trim();
+            String allowed_url_pattern = str.substring(splitindex+1).trim();
             topSitesMap.put(topsite, allowed_url_pattern);
+        }
 …
         while(domainIterator.hasNext()) {
         String domainWithProtocol = domainIterator.next();
+        // Also get domain without protocol prefix
         int startIndex = domainWithProtocol.indexOf("//"); // http:// or https:// prefix
         startIndex = (startIndex == -1) ? 0 : (startIndex+2); // skip past the protocol's // portion
         String domain = domainWithProtocol.substring(startIndex);
+        System.err.println("domain with protocol: " + domainWithProtocol);
+        System.err.println("domain: " + domain);
+        /*if(domain.contains("docs.google.com")) {
+            System.err.println("domain with protocol: " + domainWithProtocol);
+            System.err.println("domain: " + domain);
+            }*/
         String allowedURLPatternRegex = isURLinTopSitesMap(domain);
 …
             Set<String> urlsForDomainSet = domainsToURLsMap.get(domainWithProtocol);
+            Iterator<String> urlIterator = urlsForDomainSet.iterator();
+            while(urlIterator.hasNext()) {
+            String url = urlIterator.next();
+            for(String url : urlsForDomainSet) {
             topSiteMatchesWriter.write("\t" + url + "\n");
+            }
             continue; // done with this domain
+        }
 …
                 siteRegexWriter.write(regexed_url + "\n");
+                }
+            } else if(allowedURLPatternRegex.equals(FOLLOW_LINKS_WITHIN_TOPSITE)) {
+                // DON'T write out domain into siteURLs file,
+                // BUT DO write it into urlFilter file
+                String regexed_domain = PROTOCOL_REGEX_PREFIX + escapeStringForRegex(domain) + "/";
+                urlFilterWriter.write(regexed_domain + "\n");
+                siteRegexWriter.write(regexed_domain + "\n");
             } else { // allowedURLPatternRegex is a url-form - convert to regex
                 if(!allowedURLPatternRegex.endsWith("/")) {
 …
             // also write into the global seeds file (with a tab prefixed to each?)
             Set<String> urlsForDomainSet = domainsToURLsMap.get(domainWithProtocol);
+            Iterator<String> urlIterator = urlsForDomainSet.iterator();
+            while(urlIterator.hasNext()) {
+            String url = urlIterator.next();
+            for(String url : urlsForDomainSet) {
             seedURLsWriter.write(url + "\n"); // global seedURLs file
             siteURLsWriter.write(url + "\n");
+            }
         } catch (IOException ioe) {
             ioe.printStackTrace();

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 33561 for gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java

Legend:

gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java

Download in other formats: