Changeset 33553

Show
Ignore:
Timestamp:
04.10.2019 22:19:20 (13 days ago)
Author:
ak19
Message:

Comments

Files:
1 modified

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt

    r33551 r33553  
     1# URL blacklist 
     2# FORMAT: 
     3# precede URL by ^ to blacklist urls that match the given prefix 
     4# succeed URL by $ to blacklist urls that match the given suffix 
     5# ^url$ will blacklist urls that match the given url completely 
     6# Without either ^ or $ symbol, urls containing the given url will get blacklisted 
    17 
    2 # Add alexa top sites (only 50 visible) 
     8# Contains alexa top sites (where only the first 50 were visible) 
    39# Added further top sites from https://en.wikipedia.org/wiki/List_of_most_popular_websites 
    4 ## Finally also got the CSV from https://moz.com/top500 and added it to the list and added them in. 
     10# Finally also added https://moz.com/top500 by downloading its CSV file and 
     11# adding its URLs to the existing listing here from alexa/wiki. 
    512# Then used LibreOffice's Calc spreadsheet software to sort alphabetically and remove duplicates. 
    6 # Then in Gedit, used regex search and replace to remove <subdomain>.<site>.ext to keep just <site>.ext 
    7 # And resorted alphabetically 
     13# Then in Gedit, used regex search and replace to remove <subdomain>.<site>.ext variants, keeping 
     14# just <site>.ext 
     15# And finally, re-sorted the reduced list alphabetically and pasted into here. 
    816 
    917