Context Navigation

← Previous Change
Next Change →

sites-too-big-to-exhaustively-crawl.txt

Timestamp:

2019-10-10T23:44:31+13:00 (5 years ago)

Author:

ak19

Message:

Special string COPY changed to SUBDOMAIN-COPY after Dr Bainbridge explained why it was more accurate to the behaviour. 2. Comments to explain how the sites-too-big-to-exhaustively-crawl.txt should be formatted, what values are expected and how they work. 3. Special blacklisting and whitelisting of urls on yale.edu, coupled with special treatment in topsites file too.

File:

: 1 edited

gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt (modified) (9 diffs)

Legend:

: Unmodified
: Added
: Removed

gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt

-              r33555
+              r33559
+# top sites - base url forms
+# Contains alexa top sites (where only the first 50 were visible)
+# Mapping of top sites in base url forms to value
+# This file contains sites that are too large to crawl exhaustively.
+# The domains are from Alexa top sites (where only the first 50 were visible)
 # Added further top sites from https://en.wikipedia.org/wiki/List_of_most_popular_websites
 # Finally also added https://moz.com/top500 by downloading its CSV file and
 …
 # And finally, re-sorted the reduced list alphabetically and pasted into here.
+# FORMAT OF THIS FILE'S CONTENTS:
+#    <topsite-base-url><tabspace><value>
+# where <value> can be empty or one of SUBDOMAIN-COPY, SINGLEPAGE, <url-form-without-protocol>
+#
+#   - if value is empty: if seedurl contains topsite-base-url, the seedurl will go into the file
+#     unprocessed-topsite-matches.txt and the site/page won't be crawled.
+#     The user will be notified to inspect the file unprocessed-topsite-matches.txt.
+#   - SINGLEPAGE: if seedurl matches topsite-base-url, then only download the page at that seedurl.
+#     For example, if the seedurl is http://docs.google.com/some-long-suffix-in-base64, then it
+#     matches the topsite-base-url of docs.google.com and its value of SINGLEPAGE will add the
+#     seedurl itself as the regex url-filter, to restrict the crawl to just the specified page.
+#   - SUBDOMAIN-COPY: if seedurl CONTAINS topsite-base-url, then whatever the seedurl's subdomain
+#     or else domain is, will make up the urlfilter, so we don't leak out into a larger domain.
+#     Use SUBDOMAIN-COPY to restrict to a domain prefix/subdomain. For example, if seedurl is
+#     pinky.blogspot.com, it will match the topsite-base-url of blogspot.com, but SUBDOMAIN-COPY
+#     will ensure we restrict crawling to pages on pinky.blogspot.com.
+#     However, if the seedurl's domain is an exact match on topsite-base-url, the seedurl will go
+#     into the file unprocessed-topsite-matches.txt and the site/page won't be crawled.
+#   - <url-form-without-protocol>: if a seedurl contains topsite-base-url, then the provided
+#     url-form-without-protocol will make up the urlfilter, again preventing leaking into a
+#     larger part of the domain. For example, if the seedurl is mi.wikipedia.org/SomePage, it will
+#     match the topsite-base-url of wikipedia.org for which the <url-form-without-protocol>
+#     value is mi.wikipedia.org, which should be all that's accepted for wikipedia.org. The
+#     <url-form-without-protocol> ends up in the regex urlfilter file, thereby restricting the
+#     crawl to just mi.wikipedia.org.
+#     Remember to leave out any protocol <from url-form-without-protocol>.
+docs.google.com  SINGLEPAGE
+drive.google.com    SINGLEPAGE
+forms.office.com    SINGLEPAGE
+player.vimeo.com    SINGLEPAGE
+static-promote.weebly.com   SINGLEPAGE
+# Special case of yale.edu: its Rapa-Nui pages are on blacklist, but we want this page + its photos
+# The page's containing folder is whitelisted in case the photos are there.
+korora.econ.yale.edu        SINGLEPAGE
 webhost.com
 …
 blackberry.com
 blogger.com
 blogspot.com
+blogspot.com    SUBDOMAIN-COPY
 bloomberg.com
 booking.com
 …
 dreniq.com
 dribbble.com
 dropbox.com
+dropbox.com SINGLEPAGE
 dropboxusercontent.com
 dw.com
 …
 lonelyplanet.com
 lycos.com
 m.wikipedia.org
+m.wikipedia.org mi.m.wikipedia.org
 mail.ru
 marketwatch.com
 …
 merriam-webster.com
 metro.co.uk
 microsoft.com
+microsoft.com   microsoft.com/mi-nz/
 microsoftonline.com
 mirror.co.uk
 …
 photobucket.com
 php.net
 pinterest.com
+pinterest.com   SINGLEPAGE
 pixabay.com
 playstation.com
 …
 stores.jp
 storify.com
 stuff.co.nz
+stuff.co.nz SINGLEPAGE
 surveymonkey.com
 symantec.com
 …
 wikihow.com
 wikimedia.org
 wikipedia.org
 wiktionary.org
+wikipedia.org   mi.wikipedia.org
+wiktionary.org  mi.wiktionary.org
 wiley.com
 windowsphone.com
 wired.com
 wix.com
 wordpress.org
+wordpress.org   SUBDOMAIN-COPY
 worldbank.org
 wp.com

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 33559 for gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt

Legend:

gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt

Download in other formats: