Last change
on this file since 33556 was 33394, checked in by ak19, 5 years ago |
- Started a file on feasibility with the data now available and some links that have interesting or useful information. 2. Minor simplification to get_commoncrawl_nz_urls.sh script. 3. config.props file to be used by Java. Can't find wget configuration settings to limit mirroring of a site to a certain number of pages, but can limit overall download to size (--quote or -Q).
|
File size:
761 bytes
|
Line | |
---|
1 | Total URLs with tld .nz at commoncrawl for July 2019's crawl:
|
---|
2 | 6862400
|
---|
3 |
|
---|
4 | Total uniq domains: 126990
|
---|
5 |
|
---|
6 | About 54 pages on average *crawled* per site.
|
---|
7 |
|
---|
8 |
|
---|
9 | https://www.quora.com/What-is-the-average-number-of-webpages-on-a-website
|
---|
10 | https://www.internetlivestats.com/total-number-of-websites/
|
---|
11 | https://www.webnots.com/3-ways-to-find-number-of-pages-on-a-website/
|
---|
12 | "There are over 1.5 billion websites on the world wide web today. Of these, less than 200 million are active."
|
---|
13 |
|
---|
14 | "It must be noted that around 75% of websites today are not active, but parked domains or similar. [1]"
|
---|
15 |
|
---|
16 | https://en.wikipedia.org/wiki/Domain_parking
|
---|
17 | "Domain parking is the registration of an Internet domain name without that domain being associated with any services such as e-mail or a website."
|
---|
Note:
See
TracBrowser
for help on using the repository browser.