Ignore:
Timestamp:
2020-03-10T20:45:18+13:00 (4 years ago)
Author:
ak19
Message:

Piechart data for sites prepared for crawling and the piecharts for these

File:
1 edited

Legend:

Unmodified
Added
Removed
  • other-projects/maori-lang-detection/mongodb-data/piechart_data.txt

    r34007 r34011  
    487487    metadata _csh_ :        ^@^@^@^@
    488488
     489
     490------------
     491
     492
     493Would like to do something like:
     494wharariki:[378]/Scratch/ak19/maori-lang-detection/crawled>find . -name UNFINISHED | grep -l text:start */dump.txt | wc
     495
     496
     497Would like to find how many and which of the unfinished websites had a dump.txt with no text content
     498AND how many of the completely crawled websites had a dump.txt with no text content.
     499
     500
     501--------------
     502
     503
     504
     505wharariki:[393]/Scratch/ak19/maori-lang-detection/crawled>grep -l "text:start" */dump.txt
     506
     507wharariki:[388]/Scratch/ak19/maori-lang-detection/crawled>less 01461/dump.txt
     508wharariki:[389]/Scratch/ak19/maori-lang-detection/crawled>less 01453/dump.txt
     509wharariki:[390]/Scratch/ak19/maori-lang-detection/crawled>less 01447/dump.txt
     510wharariki:[391]/Scratch/ak19/maori-lang-detection/crawled>less 01446/dump.txt
     511wharariki:[392]/Scratch/ak19/maori-lang-detection/crawled>less 01445/dump.txt
     512 
     513# All the dump.txt files that are 0 bytes (no content):
     514# https://stackoverflow.com/questions/15703664/find-all-zero-byte-files-in-directory-and-subdirectories
     515wharariki:[396]/Scratch/ak19/maori-lang-detection/crawled>find . -name "dump.txt" -size 0 | sort | wc
     516    150     150    2550
     517
     518
     519Examples of empty dump.txt files (listed with: find . -name "dump.txt" -size 0 | sort):
     520    wharariki:[400]/Scratch/ak19/maori-lang-detection/crawled>less ../tmp/to_crawl.THE_VERSION_USED/sites/00014/seedURLs.txt
     521    wharariki:[401]/Scratch/ak19/maori-lang-detection/crawled>less ../tmp/to_crawl.THE_VERSION_USED/sites/01461/seedURLs.txt
     522    wharariki:[402]/Scratch/ak19/maori-lang-detection/crawled>less ../tmp/to_crawl.THE_VERSION_USED/sites/01447/seedURLs.txt
     523    wharariki:[403]/Scratch/ak19/maori-lang-detection/crawled>less ../tmp/to_crawl.THE_VERSION_USED/sites/01422/seedURLs.txt
     524   
     525
     526
     527
     528=======
Note: See TracChangeset for help on using the changeset viewer.