root/other-projects/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt @ 33666

Revision 33666, 11.2 KB (checked in by ak19, 2 months ago)

Having finished sending all the crawl data to mongodb 1. Recrawled the 2 sites which I had earlier noted required recrawling 00152, 00332. 00152 required changes to how it needed to be crawled. MP3 files needed to be blocked, as there were HBase error messages about key values being too large. 2. Modified the regex-urlfilter.GS_TEMPLATE file for this to block mp3 files in general for future crawls too (in the location of the file where jpg etc were already blocked by nutch's default regex url filters). 3. Further had to control the 00152 site to only be crawled under its /maori/ sub-domain. Since the seedURL maori.html was not off a /maori/ url, this revealed that the CCWETProcessor code didn't already consider allowing filters to okay seedURLs even where the crawl was controlled to run over a subdomain (as expressed in conf/sites-too-big-to-exhaustively-crawl file) but where the seedURL didn't match these controlled regex filters. So now, in such cases, the CCWETProcessor adds seedURLs that don't match to the filters too (so we get just the single file of the seedURL pages) besides a filter on the requested subdomain, so we follow all pages linked by the seedURLs that match the subdomain expression. 4. Adding to_crawl.tar.gz to svn, the tarball of the sites to_crawl that I actually ran nutch over, of all the sites folders with their seedURL.txt and regex-urlfilter.txt files that the runs over. This didn't use the latest version of the sites folder and blacklist/whitelist files generated by CCWETProcessor, since the latest version was regenerated after the final modifications to CCWETProcessor which was after crawling was finished. But to_crawl.tar.gz does have a manually modified 00152, wit the correct regex-urlfilter file and uses the newer regex-urlfilter.GS_TEMPLATE file that blocks mp3 files. 5. crawledNode6.tar.gz now contains the dump output for sites 00152 and 00332, which were crawled on node6 today (after which their processed dump.txt file results were added into MongoDB). 7. MoreReading?/mongodb.txt now contains the results of some queries I ran against the total nutch-crawled data.

1# Mapping of top sites in base url forms to value
3# This file contains sites that are too large to crawl exhaustively.
4# The domains are from Alexa top sites (where only the first 50 were visible)
5# Added further top sites from
6# Finally also added by downloading its CSV file and
7# adding its URLs to the existing listing here from alexa/wiki.
8# Then used LibreOffice's Calc spreadsheet software to sort alphabetically and remove duplicates.
9# Then in Gedit, used regex search and replace to remove <subdomain>.<site>.ext variants, keeping
10# just <site>.ext
11# And finally, re-sorted the reduced list alphabetically and pasted into here.
14#    <topsite-base-url>,<value>
15# where <value> can or is one of
16#    empty, SUBDOMAIN-COPY, FOLLOW-LINKS-WITHIN-TOPSITE, SINGLEPAGE, <url-form-without-protocol>
18#   - if value is left empty: if seedurl contains topsite-base-url, the seedurl will go into the
19#     file unprocessed-topsite-matches.txt and the site/page won't be crawled.
20#     The user will be notified to inspect the file unprocessed-topsite-matches.txt.
21#   - SINGLEPAGE: if seedurl matches topsite-base-url, then only download the page at that seedurl.
22#     For example, if the seedurl is, then it
23#     matches the topsite-base-url of and its value of SINGLEPAGE will add the
24#     seedurl itself as the regex url-filter, to restrict the crawl to just the specified page.
25#   - SUBDOMAIN-COPY: if seedurl CONTAINS topsite-base-url, then whatever the seedurl's subdomain
26#     or else domain is, will make up the urlfilter, so we don't leak out into a larger domain.
27#     Use SUBDOMAIN-COPY to restrict to a domain prefix/subdomain. For example, if seedurl is
28#, it will match the topsite-base-url of, but SUBDOMAIN-COPY
29#     will ensure we restrict crawling to pages on
30#     However, if the seedurl's domain is an exact match on topsite-base-url, the seedurl will go
31#     into the file unprocessed-topsite-matches.txt and the site/page won't be crawled.
32#   - FOLLOW-LINKS-WITHIN-TOPSITE: download seedURL pages and pages linked from each seedURL
33#     page should be followed and downloaded too, as long as they're within the same subdomain
34#     matching the topsite-base-url.
35#     This is different from SUBDOMAIN-COPY, as that can download all of a specific subdomain but
36#     restricts against downloading the entire domain (e.g. all and not anything
37#     else within FOLLOW-LINKS-WITHIN-TOPSITE can download all linked pages (at
38#     depth specified for the nutch crawl) as long as they're within the topsite-base-url.
39#     e.g. seedURLs on containing links will have those linked pages and any
40#     they link to etc. downloaded as long as they're on
41#   - <url-form-without-protocol>: if a seedurl contains topsite-base-url, then the provided
42#     url-form-without-protocol will make up the urlfilter, again preventing leaking into a
43#     larger part of the domain. For example, if the seedurl is, it will
44#     match the topsite-base-url of for which the <url-form-without-protocol>
45#     value is, which should be all that's accepted for The
46#     <url-form-without-protocol> ends up in the regex urlfilter file, thereby restricting the
47#     crawl to just
48#     Remember to leave out any protocol <from url-form-without-protocol>.
50# TODO If useful:
51#   column 3: whether nutch should do fetch all or not
52#   column 4: number of crawl iterations
57# May be a large site with only seedURLs of real relevance,SINGLEPAGE,SINGLEPAGE
60# 2 pages of declarations of human rights in Maori, rest in other languages,SINGLEPAGE
62# special case,SINGLEPAGE
65# we want the seed URL but also
66# pages within the following subsection,
79# is a special case: not all pages are public and any interlinking is likely to
80# be intentional. Grab all linked pages, for link depth set with nutch's crawl, as long as the
81# links are within the given topsite-base-url,FOLLOW-LINKS-WITHIN-TOPSITE
84# Just crawl a single page for these:,SINGLEPAGE,SINGLEPAGE,SINGLEPAGE,SINGLEPAGE
90# Special case of its Rapa-Nui pages are on blacklist, but we want this page + its photos
91# The page's containing folder is whitelisted in case the photos are there.,SINGLEPAGE
Note: See TracBrowser for help on using the browser.