Changeset 33561 for gs3-extensions/maori-lang-detection/conf
- Timestamp:
- 2019-10-11T20:49:05+13:00 (5 years ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt
r33559 r33561 12 12 13 13 # FORMAT OF THIS FILE'S CONTENTS: 14 # <topsite-base-url> <tabspace><value>14 # <topsite-base-url>,<value> 15 15 # where <value> can be empty or one of SUBDOMAIN-COPY, SINGLEPAGE, <url-form-without-protocol> 16 16 # … … 29 29 # However, if the seedurl's domain is an exact match on topsite-base-url, the seedurl will go 30 30 # into the file unprocessed-topsite-matches.txt and the site/page won't be crawled. 31 # - FOLLOW-LINKS-WITHIN-TOPSITE: if pages linked from the seedURL page can be followed and 32 # downloaded, as long as it's within the same subdomain matching the topsite-base-url. 33 # This is different from SUBDOMAIN-COPY, as that can download all of a specific subdomain but 34 # restricts against downloading the entire domain (e.g. all pinky.blogspot.com and not anything 35 # else within blogspot.com). FOLLOW-LINKS-WITHIN-TOPSITE can download all linked pages (at 36 # depth specified for the nutch crawl) as long as they're within the topsite-base-url. 37 # e.g. seedURLs on docs.google.com containing links will have those linked pages and any 38 # they link to etc. downloaded as long as they're on docs.google.com. 31 39 # - <url-form-without-protocol>: if a seedurl contains topsite-base-url, then the provided 32 40 # url-form-without-protocol will make up the urlfilter, again preventing leaking into a … … 38 46 # Remember to leave out any protocol <from url-form-without-protocol>. 39 47 40 41 42 docs.google.com SINGLEPAGE 43 drive.google.com SINGLEPAGE 44 forms.office.com SINGLEPAGE 45 player.vimeo.com SINGLEPAGE 46 static-promote.weebly.com SINGLEPAGE 48 # column 3: whether nutch should do fetch all or not 49 # column 4: number of crawl iterations 50 51 # docs.google.com is a special case: not all pages are public and any interlinking is likely to 52 # be intentional. But SUBDOMAIN-COPY does not work: as seedURL's domain becomes docs.google.com 53 # which, when combined with SUBDOMAIN-COPY, the Java code treats as a special case so that 54 # any seedURL on docs.google.com ends up pushed out into the "unprocessed....txt" text file. 55 #docs.google.com,SUBDOMAIN-COPY 56 docs.google.com,FOLLOW-LINKS-WITHIN-TOPSITE 57 58 drive.google.com,SINGLEPAGE 59 forms.office.com,SINGLEPAGE 60 player.vimeo.com,SINGLEPAGE 61 static-promote.weebly.com,SINGLEPAGE 47 62 48 63 # Special case of yale.edu: its Rapa-Nui pages are on blacklist, but we want this page + its photos 49 64 # The page's containing folder is whitelisted in case the photos are there. 50 korora.econ.yale.edu 65 korora.econ.yale.edu,,SINGLEPAGE 51 66 52 67 000webhost.com … … 115 130 blackberry.com 116 131 blogger.com 117 blogspot.com 132 blogspot.com,SUBDOMAIN-COPY 118 133 bloomberg.com 119 134 booking.com … … 171 186 dreniq.com 172 187 dribbble.com 173 dropbox.com 188 dropbox.com,SINGLEPAGE 174 189 dropboxusercontent.com 175 190 dw.com … … 303 318 lonelyplanet.com 304 319 lycos.com 305 m.wikipedia.org 320 m.wikipedia.org,mi.m.wikipedia.org 306 321 mail.ru 307 322 marketwatch.com … … 315 330 merriam-webster.com 316 331 metro.co.uk 317 microsoft.com 332 microsoft.com,microsoft.com/mi-nz/ 318 333 microsoftonline.com 319 334 mirror.co.uk … … 382 397 photobucket.com 383 398 php.net 384 pinterest.com 399 pinterest.com,SINGLEPAGE 385 400 pixabay.com 386 401 playstation.com … … 456 471 stores.jp 457 472 storify.com 458 stuff.co.nz 473 stuff.co.nz,SINGLEPAGE 459 474 surveymonkey.com 460 475 symantec.com … … 534 549 wikihow.com 535 550 wikimedia.org 536 wikipedia.org 537 wiktionary.org 551 wikipedia.org,mi.wikipedia.org 552 wiktionary.org,mi.wiktionary.org 538 553 wiley.com 539 554 windowsphone.com 540 555 wired.com 541 556 wix.com 542 wordpress.org 557 wordpress.org,SUBDOMAIN-COPY 543 558 worldbank.org 544 559 wp.com
Note:
See TracChangeset
for help on using the changeset viewer.