Ignore:
Timestamp:
2020-03-10T20:45:18+13:00 (4 years ago)
Author:
ak19
Message:

Piechart data for sites prepared for crawling and the piecharts for these

File:
1 edited

Legend:

Unmodified
Added
Removed
  • other-projects/maori-lang-detection/mongodb-data/piechart_data2.txt

    r34007 r34011  
    10610607. 11316 empty status unknown
    107107
     108
     109
     110============
     111
     112
     1131463 sites prepared for crawling
     1141447 sites crawled (16 were autotranslated or otherwise irrelevant)
     1151446 crawled sites contained dump.txt files (1 site was missing dump.txt) - 1446 sites in mongodb
     116619 sites not finished crawling
     1171027 sites where dump.txt contained text:start denoting text content, so 419 sites with no text content
     118
     119
     120
     12116 uncrawled irrelevant sites pruned away
     1221 failed crawl of site (text dump missing)
     1231446 crawled sites in MongoDB
     124
     125
     126Graph title: Breakdown of the 1463 sites prepared for crawling
     127* 16 uncrawled irrelevant sites pruned away
     128* 1 sites failed to properly crawl (text dump missing)
     129* 619 incompletely crawled sites
     130* 827 completely crawled sites
     131
     132
     133Graph title: Breakdown of the 1463 sites prepared for crawling
     134* 16 uncrawled irrelevant sites pruned away
     135* 1 sites failed to properly crawl (text dump missing)
     136* 419 crawled sites with no text content
     137    - 150 crawled sites with 0-size dump.txt files [crawled sites with empty dump.txt files] See below.
     138    - 269 crawled sites where dump.txt had no text content
     139* 1027 crawled sites with text content (WebPages collection in MongoDB will have webpage documents for these sites)
     140
     141
     142
     143 
     144# All the dump.txt files that are 0 bytes (no content):
     145# https://stackoverflow.com/questions/15703664/find-all-zero-byte-files-in-directory-and-subdirectories
     146wharariki:[396]/Scratch/ak19/maori-lang-detection/crawled>find . -name "dump.txt" -size 0 | sort | wc
     147    150     150    2550
     148
     149
Note: See TracChangeset for help on using the changeset viewer.