Changeset 34011 for other-projects


Ignore:
Timestamp:
2020-03-10T20:45:18+13:00 (4 years ago)
Author:
ak19
Message:

Piechart data for sites prepared for crawling and the piecharts for these

Location:
other-projects/maori-lang-detection/mongodb-data
Files:
6 added
2 edited

Legend:

Unmodified
Added
Removed
  • other-projects/maori-lang-detection/mongodb-data/piechart_data.txt

    r34007 r34011  
    487487    metadata _csh_ :        ^@^@^@^@
    488488
     489
     490------------
     491
     492
     493Would like to do something like:
     494wharariki:[378]/Scratch/ak19/maori-lang-detection/crawled>find . -name UNFINISHED | grep -l text:start */dump.txt | wc
     495
     496
     497Would like to find how many and which of the unfinished websites had a dump.txt with no text content
     498AND how many of the completely crawled websites had a dump.txt with no text content.
     499
     500
     501--------------
     502
     503
     504
     505wharariki:[393]/Scratch/ak19/maori-lang-detection/crawled>grep -l "text:start" */dump.txt
     506
     507wharariki:[388]/Scratch/ak19/maori-lang-detection/crawled>less 01461/dump.txt
     508wharariki:[389]/Scratch/ak19/maori-lang-detection/crawled>less 01453/dump.txt
     509wharariki:[390]/Scratch/ak19/maori-lang-detection/crawled>less 01447/dump.txt
     510wharariki:[391]/Scratch/ak19/maori-lang-detection/crawled>less 01446/dump.txt
     511wharariki:[392]/Scratch/ak19/maori-lang-detection/crawled>less 01445/dump.txt
     512 
     513# All the dump.txt files that are 0 bytes (no content):
     514# https://stackoverflow.com/questions/15703664/find-all-zero-byte-files-in-directory-and-subdirectories
     515wharariki:[396]/Scratch/ak19/maori-lang-detection/crawled>find . -name "dump.txt" -size 0 | sort | wc
     516    150     150    2550
     517
     518
     519Examples of empty dump.txt files (listed with: find . -name "dump.txt" -size 0 | sort):
     520    wharariki:[400]/Scratch/ak19/maori-lang-detection/crawled>less ../tmp/to_crawl.THE_VERSION_USED/sites/00014/seedURLs.txt
     521    wharariki:[401]/Scratch/ak19/maori-lang-detection/crawled>less ../tmp/to_crawl.THE_VERSION_USED/sites/01461/seedURLs.txt
     522    wharariki:[402]/Scratch/ak19/maori-lang-detection/crawled>less ../tmp/to_crawl.THE_VERSION_USED/sites/01447/seedURLs.txt
     523    wharariki:[403]/Scratch/ak19/maori-lang-detection/crawled>less ../tmp/to_crawl.THE_VERSION_USED/sites/01422/seedURLs.txt
     524   
     525
     526
     527
     528=======
  • other-projects/maori-lang-detection/mongodb-data/piechart_data2.txt

    r34007 r34011  
    10610607. 11316 empty status unknown
    107107
     108
     109
     110============
     111
     112
     1131463 sites prepared for crawling
     1141447 sites crawled (16 were autotranslated or otherwise irrelevant)
     1151446 crawled sites contained dump.txt files (1 site was missing dump.txt) - 1446 sites in mongodb
     116619 sites not finished crawling
     1171027 sites where dump.txt contained text:start denoting text content, so 419 sites with no text content
     118
     119
     120
     12116 uncrawled irrelevant sites pruned away
     1221 failed crawl of site (text dump missing)
     1231446 crawled sites in MongoDB
     124
     125
     126Graph title: Breakdown of the 1463 sites prepared for crawling
     127* 16 uncrawled irrelevant sites pruned away
     128* 1 sites failed to properly crawl (text dump missing)
     129* 619 incompletely crawled sites
     130* 827 completely crawled sites
     131
     132
     133Graph title: Breakdown of the 1463 sites prepared for crawling
     134* 16 uncrawled irrelevant sites pruned away
     135* 1 sites failed to properly crawl (text dump missing)
     136* 419 crawled sites with no text content
     137    - 150 crawled sites with 0-size dump.txt files [crawled sites with empty dump.txt files] See below.
     138    - 269 crawled sites where dump.txt had no text content
     139* 1027 crawled sites with text content (WebPages collection in MongoDB will have webpage documents for these sites)
     140
     141
     142
     143 
     144# All the dump.txt files that are 0 bytes (no content):
     145# https://stackoverflow.com/questions/15703664/find-all-zero-byte-files-in-directory-and-subdirectories
     146wharariki:[396]/Scratch/ak19/maori-lang-detection/crawled>find . -name "dump.txt" -size 0 | sort | wc
     147    150     150    2550
     148
     149
Note: See TracChangeset for help on using the changeset viewer.