source: other-projects/maori-lang-detection/feasibility.txt@ 33666

Last change on this file since 33666 was 33394, checked in by ak19, 5 years ago
  1. Started a file on feasibility with the data now available and some links that have interesting or useful information. 2. Minor simplification to get_commoncrawl_nz_urls.sh script. 3. config.props file to be used by Java. Can't find wget configuration settings to limit mirroring of a site to a certain number of pages, but can limit overall download to size (--quote or -Q).
File size: 761 bytes
Line 
1Total URLs with tld .nz at commoncrawl for July 2019's crawl:
26862400
3
4Total uniq domains: 126990
5
6About 54 pages on average *crawled* per site.
7
8
9https://www.quora.com/What-is-the-average-number-of-webpages-on-a-website
10https://www.internetlivestats.com/total-number-of-websites/
11https://www.webnots.com/3-ways-to-find-number-of-pages-on-a-website/
12"There are over 1.5 billion websites on the world wide web today. Of these, less than 200 million are active."
13
14"It must be noted that around 75% of websites today are not active, but parked domains or similar. [1]"
15
16https://en.wikipedia.org/wiki/Domain_parking
17"Domain parking is the registration of an Internet domain name without that domain being associated with any services such as e-mail or a website."
Note: See TracBrowser for help on using the repository browser.