Last change
on this file since 33440 was 33394, checked in by ak19, 5 years ago |
- Started a file on feasibility with the data now available and some links that have interesting or useful information. 2. Minor simplification to get_commoncrawl_nz_urls.sh script. 3. config.props file to be used by Java. Can't find wget configuration settings to limit mirroring of a site to a certain number of pages, but can limit overall download to size (--quote or -Q).
|
File size:
761 bytes
|
Rev | Line | |
---|
[33394] | 1 | Total URLs with tld .nz at commoncrawl for July 2019's crawl:
|
---|
| 2 | 6862400
|
---|
| 3 |
|
---|
| 4 | Total uniq domains: 126990
|
---|
| 5 |
|
---|
| 6 | About 54 pages on average *crawled* per site.
|
---|
| 7 |
|
---|
| 8 |
|
---|
| 9 | https://www.quora.com/What-is-the-average-number-of-webpages-on-a-website
|
---|
| 10 | https://www.internetlivestats.com/total-number-of-websites/
|
---|
| 11 | https://www.webnots.com/3-ways-to-find-number-of-pages-on-a-website/
|
---|
| 12 | "There are over 1.5 billion websites on the world wide web today. Of these, less than 200 million are active."
|
---|
| 13 |
|
---|
| 14 | "It must be noted that around 75% of websites today are not active, but parked domains or similar. [1]"
|
---|
| 15 |
|
---|
| 16 | https://en.wikipedia.org/wiki/Domain_parking
|
---|
| 17 | "Domain parking is the registration of an Internet domain name without that domain being associated with any services such as e-mail or a website."
|
---|
Note:
See
TracBrowser
for help on using the repository browser.