Changeset 34011

Show
Ignore:
Timestamp:
10.03.2020 20:45:18 (3 weeks ago)
Author:
ak19
Message:

Piechart data for sites prepared for crawling and the piecharts for these

Location:
other-projects/maori-lang-detection/mongodb-data
Files:
6 added
2 modified

Legend:

Unmodified
Added
Removed
  • other-projects/maori-lang-detection/mongodb-data/piechart_data.txt

    r34007 r34011  
    487487    metadata _csh_ :        ^@^@^@^@ 
    488488 
     489 
     490------------ 
     491 
     492 
     493Would like to do something like: 
     494wharariki:[378]/Scratch/ak19/maori-lang-detection/crawled>find . -name UNFINISHED | grep -l text:start */dump.txt | wc 
     495 
     496 
     497Would like to find how many and which of the unfinished websites had a dump.txt with no text content 
     498AND how many of the completely crawled websites had a dump.txt with no text content. 
     499 
     500 
     501-------------- 
     502 
     503 
     504 
     505wharariki:[393]/Scratch/ak19/maori-lang-detection/crawled>grep -l "text:start" */dump.txt 
     506 
     507wharariki:[388]/Scratch/ak19/maori-lang-detection/crawled>less 01461/dump.txt  
     508wharariki:[389]/Scratch/ak19/maori-lang-detection/crawled>less 01453/dump.txt  
     509wharariki:[390]/Scratch/ak19/maori-lang-detection/crawled>less 01447/dump.txt  
     510wharariki:[391]/Scratch/ak19/maori-lang-detection/crawled>less 01446/dump.txt  
     511wharariki:[392]/Scratch/ak19/maori-lang-detection/crawled>less 01445/dump.txt  
     512  
     513# All the dump.txt files that are 0 bytes (no content): 
     514# https://stackoverflow.com/questions/15703664/find-all-zero-byte-files-in-directory-and-subdirectories 
     515wharariki:[396]/Scratch/ak19/maori-lang-detection/crawled>find . -name "dump.txt" -size 0 | sort | wc 
     516    150     150    2550 
     517 
     518 
     519Examples of empty dump.txt files (listed with: find . -name "dump.txt" -size 0 | sort): 
     520    wharariki:[400]/Scratch/ak19/maori-lang-detection/crawled>less ../tmp/to_crawl.THE_VERSION_USED/sites/00014/seedURLs.txt  
     521    wharariki:[401]/Scratch/ak19/maori-lang-detection/crawled>less ../tmp/to_crawl.THE_VERSION_USED/sites/01461/seedURLs.txt  
     522    wharariki:[402]/Scratch/ak19/maori-lang-detection/crawled>less ../tmp/to_crawl.THE_VERSION_USED/sites/01447/seedURLs.txt  
     523    wharariki:[403]/Scratch/ak19/maori-lang-detection/crawled>less ../tmp/to_crawl.THE_VERSION_USED/sites/01422/seedURLs.txt  
     524     
     525 
     526 
     527 
     528======= 
  • other-projects/maori-lang-detection/mongodb-data/piechart_data2.txt

    r34007 r34011  
    10610607. 11316 empty status unknown 
    107107 
     108 
     109 
     110============ 
     111 
     112 
     1131463 sites prepared for crawling 
     1141447 sites crawled (16 were autotranslated or otherwise irrelevant) 
     1151446 crawled sites contained dump.txt files (1 site was missing dump.txt) - 1446 sites in mongodb 
     116619 sites not finished crawling 
     1171027 sites where dump.txt contained text:start denoting text content, so 419 sites with no text content 
     118 
     119 
     120 
     12116 uncrawled irrelevant sites pruned away 
     1221 failed crawl of site (text dump missing) 
     1231446 crawled sites in MongoDB 
     124 
     125 
     126Graph title: Breakdown of the 1463 sites prepared for crawling 
     127* 16 uncrawled irrelevant sites pruned away 
     128* 1 sites failed to properly crawl (text dump missing) 
     129* 619 incompletely crawled sites 
     130* 827 completely crawled sites 
     131 
     132 
     133Graph title: Breakdown of the 1463 sites prepared for crawling 
     134* 16 uncrawled irrelevant sites pruned away 
     135* 1 sites failed to properly crawl (text dump missing) 
     136* 419 crawled sites with no text content 
     137    - 150 crawled sites with 0-size dump.txt files [crawled sites with empty dump.txt files] See below. 
     138    - 269 crawled sites where dump.txt had no text content 
     139* 1027 crawled sites with text content (WebPages collection in MongoDB will have webpage documents for these sites) 
     140 
     141 
     142 
     143  
     144# All the dump.txt files that are 0 bytes (no content): 
     145# https://stackoverflow.com/questions/15703664/find-all-zero-byte-files-in-directory-and-subdirectories 
     146wharariki:[396]/Scratch/ak19/maori-lang-detection/crawled>find . -name "dump.txt" -size 0 | sort | wc 
     147    150     150    2550 
     148 
     149