Changeset 33986 for other-projects


Ignore:
Timestamp:
2020-02-28T22:07:29+13:00 (4 years ago)
Author:
ak19
Message:

Dr Bainbridge investigated the original data set more

File:
1 edited

Legend:

Unmodified
Added
Removed
  • other-projects/maori-lang-detection/mongodb-data/piechart_data.txt

    r33985 r33986  
     1"UPPER BOUND"
     2
    13blacklisted
    24greylisted
     
    173175119874
    174176
     177---------------------------
    175178
     179#Number of crawled pages with 0 content in dump.txt because the page was inaccessible when crawling (protocolStatus: NOTFOUND)
     180wharariki:[646]/Scratch/ak19/maori-lang-detection/crawled>fgrep -a  'NOTFOUND' 0*/dump.txt | grep protocolStatus | wc
     181   3276    9828  419259
     182
     183#Number of dump.txt files (sites) that had text:start in them vs those that didn't:
     184wharariki:[647]/Scratch/ak19/maori-lang-detection/crawled>fgrep -l text:start */dump.txt | wc
     185   1027    1027   15405
     186wharariki:[648]/Scratch/ak19/maori-lang-detection/crawled>fgrep text:start */dump.txt | wc
     187   1027    4108   35945
     188
     189# number of dump.txt files
     190wharariki:[652]/Scratch/ak19/maori-lang-detection/crawled>find . -name "dump.txt" | wc
     191   1446    1446   24582
     192wharariki:[653]/Scratch/ak19/maori-lang-detection/crawled>
     193
     194
     195Look to see if commoncrawl has a field for how much text there is on the page.
     196Else this is a useful feature for them to add.
     197
     198
Note: See TracChangeset for help on using the changeset viewer.