Context Navigation

← Previous Change
Next Change →

Changeset 33666 for other-projects/maori-lang-detection/src/org

Timestamp:

2019-11-13T23:08:37+13:00 (5 years ago)

Author:

ak19

Message:

Having finished sending all the crawl data to mongodb 1. Recrawled the 2 sites which I had earlier noted required recrawling 00152, 00332. 00152 required changes to how it needed to be crawled. MP3 files needed to be blocked, as there were HBase error messages about key values being too large. 2. Modified the regex-urlfilter.GS_TEMPLATE file for this to block mp3 files in general for future crawls too (in the location of the file where jpg etc were already blocked by nutch's default regex url filters). 3. Further had to control the 00152 site to only be crawled under its /maori/ sub-domain. Since the seedURL maori.html was not off a /maori/ url, this revealed that the CCWETProcessor code didn't already consider allowing filters to okay seedURLs even where the crawl was controlled to run over a subdomain (as expressed in conf/sites-too-big-to-exhaustively-crawl file) but where the seedURL didn't match these controlled regex filters. So now, in such cases, the CCWETProcessor adds seedURLs that don't match to the filters too (so we get just the single file of the seedURL pages) besides a filter on the requested subdomain, so we follow all pages linked by the seedURLs that match the subdomain expression. 4. Adding to_crawl.tar.gz to svn, the tarball of the sites to_crawl that I actually ran nutch over, of all the sites folders with their seedURL.txt and regex-urlfilter.txt files that the batchcrawl.sh runs over. This didn't use the latest version of the sites folder and blacklist/whitelist files generated by CCWETProcessor, since the latest version was regenerated after the final modifications to CCWETProcessor which was after crawling was finished. But to_crawl.tar.gz does have a manually modified 00152, wit the correct regex-urlfilter file and uses the newer regex-urlfilter.GS_TEMPLATE file that blocks mp3 files. 5. crawledNode6.tar.gz now contains the dump output for sites 00152 and 00332, which were crawled on node6 today (after which their processed dump.txt file results were added into MongoDB). 7. MoreReading/mongodb.txt now contains the results of some queries I ran against the total nutch-crawled data.

Location:

other-projects/maori-lang-detection/src/org/greenstone/atea

Files:

: 2 edited

CCWETProcessor.java (modified) (3 diffs)
Utility.java (modified) (3 diffs)

Legend:

: Unmodified
: Added
: Removed

other-projects/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java

-              r33624
+              r33666
  * e.g. (from maori-lang-detection/src)
+ *
  *    - java -cp ".:../conf:../lib/*" org.greenstone.atea.CCWETProcessor ../ccrawl-data /Scratch/ak19/gs3-extensions/maori-lang-detection/to_crawl
  *    - java -cp ".:../conf:../lib/*" org.greenstone.atea.CCWETProcessor ../ccrawl-data /Scratch/ak19/gs3-extensions/maori-lang-detection/to_crawl 2&gt;&amp;1 | less
+ *    - java -cp ".:../conf:../lib/*" org.greenstone.atea.CCWETProcessor ../ccrawl-data /Scratch/ak19/maori-lang-detection/to_crawl
+ *    - java -cp ".:../conf:../lib/*" org.greenstone.atea.CCWETProcessor ../ccrawl-data /Scratch/ak19/maori-lang-detection/to_crawl 2&gt;&amp;1 | less
+ *
 */
 …
             // Only write urls and no domain into single global seedurls file
             // But write domain and tabbed urls into individual sites/0000#/seedURLs.txt
+            // But write domain and tab-spaced urls into individual sites/0000#/seedURLs.txt
             // files (and write regexed domain into each sites/0000#/regex-urlfilter.txt)
             // If we ever run nutch on a single seedURLs listing containing
 …
                 allowedURLPatternRegex += "/";
+                }
                 String regexed_pattern = PROTOCOL_REGEX_PREFIX+escapeStringForRegex(allowedURLPatternRegex);
+                String regexed_pattern = FILTER_REGEX_PREFIX+escapeStringForRegex(allowedURLPatternRegex);
                 //String regexed_pattern = PROTOCOL_REGEX_PREFIX+allowedURLPatternRegex.replace(".", "\\.");
+                // In case any of the seedURLs themselves are not within the
+                // allowedURLPatternRegex part of the site, FIRST write out such
+                // seedURLs as allowed regex patterns, so they get downloaded
+                // as single pages.
+                urlsForDomainSet = domainsToURLsMap.get(domainWithProtocol);
+                for(String urlInDomain : urlsForDomainSet) {
+                String urlWithoutProtocolAndWWW = Utility.stripProtocolAndWWWFromURL(urlInDomain);
+                String allowedURLPatternWithoutProtocolAndWWW = Utility.stripProtocolAndWWWFromURL(allowedURLPatternRegex);
+                if(!urlWithoutProtocolAndWWW.startsWith(allowedURLPatternWithoutProtocolAndWWW)) {
+                    // don't append slash to end this time
+                    String regexed_url = "+^"+escapeStringForRegex(urlInDomain);
+                    urlFilterWriter.write(regexed_url + "\n");
+                    siteRegexWriter.write(regexed_url + "\n");
+                }
+                }
                 siteURLsWriter.write(domainWithProtocol + "\n");
+                // write out allowedURLPatternRegex istead of the domain
+                //siteURLsWriter.write(allowedURLPatternRegex + "\n");
+                // Now restrict any other URLs found to be within the allowedURLPattern
+                // part of the site
                 urlFilterWriter.write(regexed_pattern + "\n");
+                siteRegexWriter.write(regexed_pattern + "\n");
+                siteRegexWriter.write(regexed_pattern + "\n");
+            }
+            }

other-projects/maori-lang-detection/src/org/greenstone/atea/Utility.java

-              r33623
+              r33666
     throws Exception
+    {
     int startIndex = domainWithProtocol.indexOf("//"); // http:// or https:// prefix
     startIndex = (startIndex == -1) ? 0 : (startIndex+2); // skip past the protocol's // portion
     String domain = domainWithProtocol.substring(startIndex);
+    //int startIndex = domainWithProtocol.indexOf("//"); // http:// or https:// prefix
+    //startIndex = (startIndex == -1) ? 0 : (startIndex+2); // skip past the protocol's // portion
+    String domain = stripProtocolFromURL(domainWithProtocol); //domainWithProtocol.substring(startIndex);
     // pass in the GeoLiteCity.dat file to be able to do the location lookup for domain's IP
 …
+    }
+    public static String stripProtocolAndWWWFromURL(String url) {
+    url = stripProtocolFromURL(url);
+    if(url.startsWith("www.")) { // strip any "wwww." at start as well too
+        url = url.substring(4);
+    }
+    return url;
+    }
+    public static String stripProtocolFromURL(String url) {
+    int startIndex = url.indexOf("//"); // for http:// or https:// prefix
+    startIndex = (startIndex == -1) ? 0 : (startIndex+2); // skip past the protocol's // portion
+    return url.substring(startIndex);
+    }
     /** Work out the 'domain' for a given url.
      * This retains any www. or subdomain prefix.
 …
     int startIndex = startIndex = url.indexOf("//"); // for http:// or https:// prefix
     startIndex = (startIndex == -1) ? 0 : (startIndex+2); // skip past the protocol's // portion
     // the keep the URL around in case param withProtocol=true
+    // keep the protocol around in case param withProtocol=true
     String protocol = (startIndex == -1) ? "" : url.substring(0, startIndex);

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 33666 for other-projects/maori-lang-detection/src/org

Legend:

other-projects/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java

other-projects/maori-lang-detection/src/org/greenstone/atea/Utility.java

Download in other formats: