Changeset 33986
- Timestamp:
- 2020-02-28T22:07:29+13:00 (4 years ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
other-projects/maori-lang-detection/mongodb-data/piechart_data.txt
r33985 r33986 1 "UPPER BOUND" 2 1 3 blacklisted 2 4 greylisted … … 173 175 119874 174 176 177 --------------------------- 175 178 179 #Number of crawled pages with 0 content in dump.txt because the page was inaccessible when crawling (protocolStatus: NOTFOUND) 180 wharariki:[646]/Scratch/ak19/maori-lang-detection/crawled>fgrep -a 'NOTFOUND' 0*/dump.txt | grep protocolStatus | wc 181 3276 9828 419259 182 183 #Number of dump.txt files (sites) that had text:start in them vs those that didn't: 184 wharariki:[647]/Scratch/ak19/maori-lang-detection/crawled>fgrep -l text:start */dump.txt | wc 185 1027 1027 15405 186 wharariki:[648]/Scratch/ak19/maori-lang-detection/crawled>fgrep text:start */dump.txt | wc 187 1027 4108 35945 188 189 # number of dump.txt files 190 wharariki:[652]/Scratch/ak19/maori-lang-detection/crawled>find . -name "dump.txt" | wc 191 1446 1446 24582 192 wharariki:[653]/Scratch/ak19/maori-lang-detection/crawled> 193 194 195 Look to see if commoncrawl has a field for how much text there is on the page. 196 Else this is a useful feature for them to add. 197 198
Note:
See TracChangeset
for help on using the changeset viewer.