Changeset 33553
- Timestamp:
- 2019-10-04T22:19:20+13:00 (5 years ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
gs3-extensions/maori-lang-detection/conf/sites-too-big-to-exhaustively-crawl.txt
r33551 r33553 1 # URL blacklist 2 # FORMAT: 3 # precede URL by ^ to blacklist urls that match the given prefix 4 # succeed URL by $ to blacklist urls that match the given suffix 5 # ^url$ will blacklist urls that match the given url completely 6 # Without either ^ or $ symbol, urls containing the given url will get blacklisted 1 7 2 # Add alexa top sites (only 50visible)8 # Contains alexa top sites (where only the first 50 were visible) 3 9 # Added further top sites from https://en.wikipedia.org/wiki/List_of_most_popular_websites 4 ## Finally also got the CSV from https://moz.com/top500 and added it to the list and added them in. 10 # Finally also added https://moz.com/top500 by downloading its CSV file and 11 # adding its URLs to the existing listing here from alexa/wiki. 5 12 # Then used LibreOffice's Calc spreadsheet software to sort alphabetically and remove duplicates. 6 # Then in Gedit, used regex search and replace to remove <subdomain>.<site>.ext to keep just <site>.ext 7 # And resorted alphabetically 13 # Then in Gedit, used regex search and replace to remove <subdomain>.<site>.ext variants, keeping 14 # just <site>.ext 15 # And finally, re-sorted the reduced list alphabetically and pasted into here. 8 16 9 17
Note:
See TracChangeset
for help on using the changeset viewer.