Ignore:
Timestamp:
2019-11-13T23:08:37+13:00 (4 years ago)
Author:
ak19
Message:

Having finished sending all the crawl data to mongodb 1. Recrawled the 2 sites which I had earlier noted required recrawling 00152, 00332. 00152 required changes to how it needed to be crawled. MP3 files needed to be blocked, as there were HBase error messages about key values being too large. 2. Modified the regex-urlfilter.GS_TEMPLATE file for this to block mp3 files in general for future crawls too (in the location of the file where jpg etc were already blocked by nutch's default regex url filters). 3. Further had to control the 00152 site to only be crawled under its /maori/ sub-domain. Since the seedURL maori.html was not off a /maori/ url, this revealed that the CCWETProcessor code didn't already consider allowing filters to okay seedURLs even where the crawl was controlled to run over a subdomain (as expressed in conf/sites-too-big-to-exhaustively-crawl file) but where the seedURL didn't match these controlled regex filters. So now, in such cases, the CCWETProcessor adds seedURLs that don't match to the filters too (so we get just the single file of the seedURL pages) besides a filter on the requested subdomain, so we follow all pages linked by the seedURLs that match the subdomain expression. 4. Adding to_crawl.tar.gz to svn, the tarball of the sites to_crawl that I actually ran nutch over, of all the sites folders with their seedURL.txt and regex-urlfilter.txt files that the batchcrawl.sh runs over. This didn't use the latest version of the sites folder and blacklist/whitelist files generated by CCWETProcessor, since the latest version was regenerated after the final modifications to CCWETProcessor which was after crawling was finished. But to_crawl.tar.gz does have a manually modified 00152, wit the correct regex-urlfilter file and uses the newer regex-urlfilter.GS_TEMPLATE file that blocks mp3 files. 5. crawledNode6.tar.gz now contains the dump output for sites 00152 and 00332, which were crawled on node6 today (after which their processed dump.txt file results were added into MongoDB). 7. MoreReading/mongodb.txt now contains the results of some queries I ran against the total nutch-crawled data.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • other-projects/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java

    r33624 r33666  
    5757 * e.g. (from maori-lang-detection/src)
    5858 *   
    59  *    - java -cp ".:../conf:../lib/*" org.greenstone.atea.CCWETProcessor ../ccrawl-data /Scratch/ak19/gs3-extensions/maori-lang-detection/to_crawl
    60  *    - java -cp ".:../conf:../lib/*" org.greenstone.atea.CCWETProcessor ../ccrawl-data /Scratch/ak19/gs3-extensions/maori-lang-detection/to_crawl 2>&1 | less
     59 *    - java -cp ".:../conf:../lib/*" org.greenstone.atea.CCWETProcessor ../ccrawl-data /Scratch/ak19/maori-lang-detection/to_crawl
     60 *    - java -cp ".:../conf:../lib/*" org.greenstone.atea.CCWETProcessor ../ccrawl-data /Scratch/ak19/maori-lang-detection/to_crawl 2>&1 | less
    6161 *
    6262*/
     
    452452
    453453            // Only write urls and no domain into single global seedurls file
    454             // But write domain and tabbed urls into individual sites/0000#/seedURLs.txt
     454            // But write domain and tab-spaced urls into individual sites/0000#/seedURLs.txt
    455455            // files (and write regexed domain into each sites/0000#/regex-urlfilter.txt)
    456456            // If we ever run nutch on a single seedURLs listing containing
     
    515515                allowedURLPatternRegex += "/";
    516516                }
    517                 String regexed_pattern = PROTOCOL_REGEX_PREFIX+escapeStringForRegex(allowedURLPatternRegex);
     517                String regexed_pattern = FILTER_REGEX_PREFIX+escapeStringForRegex(allowedURLPatternRegex);
    518518                //String regexed_pattern = PROTOCOL_REGEX_PREFIX+allowedURLPatternRegex.replace(".", "\\.");
     519               
     520                // In case any of the seedURLs themselves are not within the
     521                // allowedURLPatternRegex part of the site, FIRST write out such
     522                // seedURLs as allowed regex patterns, so they get downloaded
     523                // as single pages.
     524                urlsForDomainSet = domainsToURLsMap.get(domainWithProtocol);
     525                for(String urlInDomain : urlsForDomainSet) {
     526               
     527                String urlWithoutProtocolAndWWW = Utility.stripProtocolAndWWWFromURL(urlInDomain);
     528                String allowedURLPatternWithoutProtocolAndWWW = Utility.stripProtocolAndWWWFromURL(allowedURLPatternRegex);
     529                if(!urlWithoutProtocolAndWWW.startsWith(allowedURLPatternWithoutProtocolAndWWW)) {
     530                    // don't append slash to end this time
     531                    String regexed_url = "+^"+escapeStringForRegex(urlInDomain);
     532                    urlFilterWriter.write(regexed_url + "\n");
     533                    siteRegexWriter.write(regexed_url + "\n");
     534                }
     535                }
     536
    519537                siteURLsWriter.write(domainWithProtocol + "\n");
     538                // write out allowedURLPatternRegex istead of the domain
     539                //siteURLsWriter.write(allowedURLPatternRegex + "\n");
     540               
     541                // Now restrict any other URLs found to be within the allowedURLPattern
     542                // part of the site
    520543                urlFilterWriter.write(regexed_pattern + "\n");
    521                 siteRegexWriter.write(regexed_pattern + "\n");
    522 
     544                siteRegexWriter.write(regexed_pattern + "\n");             
    523545            }
    524546            }
Note: See TracChangeset for help on using the changeset viewer.