Context Navigation

← Previous Change
Next Change →

conf

Timestamp:

2019-10-11T20:49:05+13:00 (5 years ago)

Author:

ak19

Message:

sites-too-big-to-exhaustively-crawl.txt is now a comma separated list. 2. After the discussion with Dr Bainbridge that SINGLEPAGE is not what we want for docs.google.com, I found that the tentative switch to SUBDOMAIN-COPY for docs.google.com will not work precisely because of the important change we had to make yesterday: if SUBDOMAIN-COPY, then only copy SUBdomains, and not root domains. If root domain with SUBDOMAIN-COPY, then the seedURL gets written out to unprocessed-topsite-matches.txt and its site doesn't get crawled. 3. This revealed a lacuna in sites-too-big-to-exhaustively-crawl.txt possible list of values and I had to invent a new value which I introduce and have tested with this commit: FOLLOW_LINKS_WITHIN_TOPSITE. This value so far applies only to docs.google.com and will keep following any links originating in a seedURL on docs.google.com but only as long as it's within that topsite domain (docs.google.com). 4. Tidied some old fashioned use of Iterator, replaced with newer style of for loops that work with Types. Comitting before update code to use the apache csv API.

File:

: 1 edited

gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt (modified) (10 diffs)

Legend:

: Unmodified
: Added
: Removed

gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt

-              r33559
+              r33561
 # FORMAT OF THIS FILE'S CONTENTS:
 #    <topsite-base-url><tabspace><value>
+#    <topsite-base-url>,<value>
 # where <value> can be empty or one of SUBDOMAIN-COPY, SINGLEPAGE, <url-form-without-protocol>
+#
 …
 #     However, if the seedurl's domain is an exact match on topsite-base-url, the seedurl will go
 #     into the file unprocessed-topsite-matches.txt and the site/page won't be crawled.
+#   - FOLLOW-LINKS-WITHIN-TOPSITE: if pages linked from the seedURL page can be followed and
+#     downloaded, as long as it's within the same subdomain matching the topsite-base-url.
+#     This is different from SUBDOMAIN-COPY, as that can download all of a specific subdomain but
+#     restricts against downloading the entire domain (e.g. all pinky.blogspot.com and not anything
+#     else within blogspot.com). FOLLOW-LINKS-WITHIN-TOPSITE can download all linked pages (at
+#     depth specified for the nutch crawl) as long as they're within the topsite-base-url.
+#     e.g. seedURLs on docs.google.com containing links will have those linked pages and any
+#     they link to etc. downloaded as long as they're on docs.google.com.
 #   - <url-form-without-protocol>: if a seedurl contains topsite-base-url, then the provided
 #     url-form-without-protocol will make up the urlfilter, again preventing leaking into a
 …
 #     Remember to leave out any protocol <from url-form-without-protocol>.
+docs.google.com  SINGLEPAGE
+drive.google.com    SINGLEPAGE
+forms.office.com    SINGLEPAGE
+player.vimeo.com    SINGLEPAGE
+static-promote.weebly.com   SINGLEPAGE
+# column 3: whether nutch should do fetch all or not
+# column 4: number of crawl iterations
+# docs.google.com is a special case: not all pages are public and any interlinking is likely to
+# be intentional. But SUBDOMAIN-COPY does not work: as seedURL's domain becomes docs.google.com
+# which, when combined with SUBDOMAIN-COPY, the Java code treats as a special case so that
+# any seedURL on docs.google.com ends up pushed out into the "unprocessed....txt" text file.
+#docs.google.com,SUBDOMAIN-COPY
+docs.google.com,FOLLOW-LINKS-WITHIN-TOPSITE
+drive.google.com,SINGLEPAGE
+forms.office.com,SINGLEPAGE
+player.vimeo.com,SINGLEPAGE
+static-promote.weebly.com,SINGLEPAGE
 # Special case of yale.edu: its Rapa-Nui pages are on blacklist, but we want this page + its photos
 # The page's containing folder is whitelisted in case the photos are there.
 korora.econ.yale.edu        SINGLEPAGE
+korora.econ.yale.edu,,SINGLEPAGE
 webhost.com
 …
 blackberry.com
 blogger.com
 blogspot.com    SUBDOMAIN-COPY
+blogspot.com,SUBDOMAIN-COPY
 bloomberg.com
 booking.com
 …
 dreniq.com
 dribbble.com
 dropbox.com SINGLEPAGE
+dropbox.com,SINGLEPAGE
 dropboxusercontent.com
 dw.com
 …
 lonelyplanet.com
 lycos.com
 m.wikipedia.org mi.m.wikipedia.org
+m.wikipedia.org,mi.m.wikipedia.org
 mail.ru
 marketwatch.com
 …
 merriam-webster.com
 metro.co.uk
 microsoft.com   microsoft.com/mi-nz/
+microsoft.com,microsoft.com/mi-nz/
 microsoftonline.com
 mirror.co.uk
 …
 photobucket.com
 php.net
 pinterest.com   SINGLEPAGE
+pinterest.com,SINGLEPAGE
 pixabay.com
 playstation.com
 …
 stores.jp
 storify.com
 stuff.co.nz SINGLEPAGE
+stuff.co.nz,SINGLEPAGE
 surveymonkey.com
 symantec.com
 …
 wikihow.com
 wikimedia.org
 wikipedia.org   mi.wikipedia.org
 wiktionary.org  mi.wiktionary.org
+wikipedia.org,mi.wikipedia.org
+wiktionary.org,mi.wiktionary.org
 wiley.com
 windowsphone.com
 wired.com
 wix.com
 wordpress.org   SUBDOMAIN-COPY
+wordpress.org,SUBDOMAIN-COPY
 worldbank.org
 wp.com

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 33561 for gs3-extensions/maori-lang-detection/conf

Legend:

gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt

Download in other formats: