Changeset 33553


Ignore:
Timestamp:
2019-10-04T22:19:20+13:00 (5 years ago)
Author:
ak19
Message:

Comments

File:
1 edited

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt

    r33551 r33553  
     1# URL blacklist
     2# FORMAT:
     3# precede URL by ^ to blacklist urls that match the given prefix
     4# succeed URL by $ to blacklist urls that match the given suffix
     5# ^url$ will blacklist urls that match the given url completely
     6# Without either ^ or $ symbol, urls containing the given url will get blacklisted
    17
    2 # Add alexa top sites (only 50 visible)
     8# Contains alexa top sites (where only the first 50 were visible)
    39# Added further top sites from https://en.wikipedia.org/wiki/List_of_most_popular_websites
    4 ## Finally also got the CSV from https://moz.com/top500 and added it to the list and added them in.
     10# Finally also added https://moz.com/top500 by downloading its CSV file and
     11# adding its URLs to the existing listing here from alexa/wiki.
    512# Then used LibreOffice's Calc spreadsheet software to sort alphabetically and remove duplicates.
    6 # Then in Gedit, used regex search and replace to remove <subdomain>.<site>.ext to keep just <site>.ext
    7 # And resorted alphabetically
     13# Then in Gedit, used regex search and replace to remove <subdomain>.<site>.ext variants, keeping
     14# just <site>.ext
     15# And finally, re-sorted the reduced list alphabetically and pasted into here.
    816
    917
Note: See TracChangeset for help on using the changeset viewer.