Changeset 33986

Show
Ignore:
Timestamp:
28.02.2020 22:07:29 (5 weeks ago)
Author:
ak19
Message:

Dr Bainbridge investigated the original data set more

Files:
1 modified

Legend:

Unmodified
Added
Removed
  • other-projects/maori-lang-detection/mongodb-data/piechart_data.txt

    r33985 r33986  
     1"UPPER BOUND" 
     2 
    13blacklisted 
    24greylisted 
     
    173175119874 
    174176 
     177--------------------------- 
    175178 
     179#Number of crawled pages with 0 content in dump.txt because the page was inaccessible when crawling (protocolStatus: NOTFOUND) 
     180wharariki:[646]/Scratch/ak19/maori-lang-detection/crawled>fgrep -a  'NOTFOUND' 0*/dump.txt | grep protocolStatus | wc 
     181   3276    9828  419259 
     182 
     183#Number of dump.txt files (sites) that had text:start in them vs those that didn't: 
     184wharariki:[647]/Scratch/ak19/maori-lang-detection/crawled>fgrep -l text:start */dump.txt | wc 
     185   1027    1027   15405 
     186wharariki:[648]/Scratch/ak19/maori-lang-detection/crawled>fgrep text:start */dump.txt | wc 
     187   1027    4108   35945 
     188 
     189# number of dump.txt files 
     190wharariki:[652]/Scratch/ak19/maori-lang-detection/crawled>find . -name "dump.txt" | wc 
     191   1446    1446   24582 
     192wharariki:[653]/Scratch/ak19/maori-lang-detection/crawled> 
     193 
     194 
     195Look to see if commoncrawl has a field for how much text there is on the page. 
     196Else this is a useful feature for them to add. 
     197 
     198