Context Navigation

← Previous Change
Next Change →

conf

Timestamp:

2019-10-11T21:52:40+13:00 (5 years ago)

Author:

ak19

Message:

The sites-too-big-to-exhaustively-crawl.txt is now a csv file of a semi-custom format, and the Java code now uses the Apache Commons CSV jar file (v1.7 for Java 8) to parse the contents thereof. 2. Tidied up code to reuse reference to ClassLoader.

File:

: 1 edited

gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt (modified) (3 diffs)

Legend:

: Unmodified
: Added
: Removed

gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt

-              r33561
+              r33562
 # FORMAT OF THIS FILE'S CONTENTS:
 #    <topsite-base-url>,<value>
+# where <value> can be empty or one of SUBDOMAIN-COPY, SINGLEPAGE, <url-form-without-protocol>
+# where <value> can or is one of
+#    empty, SUBDOMAIN-COPY, FOLLOW-LINKS-WITHIN-TOPSITE, SINGLEPAGE, <url-form-without-protocol>
+#
 #   - if value is empty: if seedurl contains topsite-base-url, the seedurl will go into the file
 #     unprocessed-topsite-matches.txt and the site/page won't be crawled.
+#   - if value is left empty: if seedurl contains topsite-base-url, the seedurl will go into the
+#     file unprocessed-topsite-matches.txt and the site/page won't be crawled.
 #     The user will be notified to inspect the file unprocessed-topsite-matches.txt.
 #   - SINGLEPAGE: if seedurl matches topsite-base-url, then only download the page at that seedurl.
 …
 #     crawl to just mi.wikipedia.org.
 #     Remember to leave out any protocol <from url-form-without-protocol>.
+# column 3: whether nutch should do fetch all or not
+# column 4: number of crawl iterations
+#
+# TODO If useful:
+#   column 3: whether nutch should do fetch all or not
+#   column 4: number of crawl iterations
 # docs.google.com is a special case: not all pages are public and any interlinking is likely to
+# be intentional. But SUBDOMAIN-COPY does not work: as seedURL's domain becomes docs.google.com
+# which, when combined with SUBDOMAIN-COPY, the Java code treats as a special case so that
+# any seedURL on docs.google.com ends up pushed out into the "unprocessed....txt" text file.
+#docs.google.com,SUBDOMAIN-COPY
+# be intentional. Grab all linked pages, for link depth set with nutch's crawl, as long as the
+# links are within the given topsite-base-url
 docs.google.com,FOLLOW-LINKS-WITHIN-TOPSITE
+# Just crawl a single page for these:
 drive.google.com,SINGLEPAGE
 forms.office.com,SINGLEPAGE
 …
 # Special case of yale.edu: its Rapa-Nui pages are on blacklist, but we want this page + its photos
 # The page's containing folder is whitelisted in case the photos are there.
 korora.econ.yale.edu,,SINGLEPAGE
+korora.econ.yale.edu,SINGLEPAGE
 webhost.com

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 33562 for gs3-extensions/maori-lang-detection/conf

Legend:

gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt

Download in other formats: