Context Navigation

← Previous Change
Next Change →

Changeset 33565 for gs3-extensions

Timestamp:

2019-10-14T21:04:58+13:00 (5 years ago)

Author:

ak19

Message:

CCWETProcessor: domain url now goes in as a seedURL after the individual seedURLs, after Dr Bainbridge explained why the original ordering didn't make sense. 2. conf: we inspected the first site to be crawled. It was a non-top site, but we still wanted to control the crawling of it in the same way we control topsites. 3. Documented use of the nutch command for testing which urls pass and fail the existing regex-urlfilter checks.

Location:

gs3-extensions/maori-lang-detection

Files:

: 3 edited

MoreReading/crawling-Nutch.txt (modified) (1 diff)
conf/sites-too-big-to-exhaustively-crawl.txt (modified) (1 diff)
src/org/greenstone/atea/CCWETProcessor.java (modified) (3 diffs)

Legend:

: Unmodified
: Added
: Removed

gs3-extensions/maori-lang-detection/MoreReading/crawling-Nutch.txt

-              r33558
+              r33565
 ---
+----------------------------------------------------------------------
+    Testing URLFilters: testing a URL to see if it's accepted
+----------------------------------------------------------------------
+Use the command
+    ./bin/nutch org.apache.nutch.net.URLFilterChecker -allCombined
+(mentioned at https://lucene.472066.n3.nabble.com/Correct-syntax-for-regex-urlfilter-txt-trying-to-exclude-single-path-results-td3600376.html)
+Use as follows:
+    cd apache-nutch-2.3.1/runtime/local
+    ./bin/nutch org.apache.nutch.net.URLFilterChecker -allCombined
+Then paste the URL you want to test, press Enter.
+    A + in front of response means accepted
+    A - in front of response means rejected.
+Can continue pasting URLs to test against filters until you send Ctrl-D to terminate input.

gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt

-              r33562
+              r33565
 #   column 3: whether nutch should do fetch all or not
 #   column 4: number of crawl iterations
+# NOT TOP SITES, BUT SITES WE INSPECTED AND WANT TO CONTROL SIMILARLY TO TOP SITES
+.gs,SINGLEPAGE
+# TOP SITES
 # docs.google.com is a special case: not all pages are public and any interlinking is likely to

gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java

-              r33562
+              r33565
             // files (and write regexed domain into each sites/0000#/regex-urlfilter.txt)
             // If we ever run nutch on a single seedURLs listing containing
+            // all seed pages to crawl sites from, the above two files will work for that.
+            // all seed pages to crawl sites from, the above two files will work for that.
+            // first write out the urls for the domain into the sites/0000x/seedURLs.txt file
+            // also write into the global seeds file (with a tab prefixed to each?)
+            Set<String> urlsForDomainSet = domainsToURLsMap.get(domainWithProtocol);
+            for(String url : urlsForDomainSet) {
+            seedURLsWriter.write(url + "\n"); // global seedURLs file
+            siteURLsWriter.write(url + "\n");
+            }
             if(allowedURLPatternRegex == null) { // entire site can be crawled
 …
                 // since we will only be downloading the single page
                 Set<String> urlsForDomainSet = domainsToURLsMap.get(domainWithProtocol);
+                urlsForDomainSet = domainsToURLsMap.get(domainWithProtocol);
                 for(String urlInDomain : urlsForDomainSet) {
                 // don't append slash to end this time
 …
+            }
+            }
-            // next write out the urls for the domain into the sites/0000x/seedURLs.txt file
-            // also write into the global seeds file (with a tab prefixed to each?)
-            Set<String> urlsForDomainSet = domainsToURLsMap.get(domainWithProtocol);
-            for(String url : urlsForDomainSet) {
-            seedURLsWriter.write(url + "\n"); // global seedURLs file
-            siteURLsWriter.write(url + "\n");
+            }

Note: See TracChangeset for help on using the changeset viewer.