- Timestamp:
- 2020-03-10T17:27:07+13:00 (4 years ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
other-projects/maori-lang-detection/mongodb-data/piechart_data.txt
r34001 r34004 1 https://www.rapidtables.com/tools/pie-chart.html 2 https://www.meta-chart.com/pie#/data 3 4 "11.5 billion CC URLs" 5 38724 CC URLs in "MRI" 6 10290 URLs discarded (blacklisted and too little text) 7 2751 URLs greylisted 8 25683-4 URLs retained = 25679 seed URLs for crawling 9 10 1463 sites prepared for crawling 11 1447 sites crawled (16 were autotranslated or otherwise irrelevant) 12 1446 crawled sites contained dump.txt files (1 site was missing dump.txt) - 1446 sites in mongodb 13 619 sites not finished crawling 14 1027 sites where dump.txt contained text:start denoting text content, so 419 sites with no text content 15 16 17 119874 crawled web pages in mongodb 18 19 3276 crawled pages with no text content in dump.txt because the page was inaccessible when crawling (protocolStatus: NOTFOUND) 20 21 ---------- 22 1 23 The 12 month period CommonCrawl crawl data that we used: 2 24 … … 39 61 40 62 = 9100 million or 9.1 billion new URLs not contained in any crawl archive before 41 + taking the first crawl month's figure of 2.8 billion - 500 million new URL in first crawl month= 11.4 billion URLs? At least?63 + taking the first crawl month's figure of 2.8 billion - 500 million new URLs in 1st month crawled = 11.4 billion URLs? At least? 42 64 --------------------------------------------- 43 65 … … 139 161 d. Of the keepURLs, 4 more webpages ultimately irrelevant sites at unprocessed-topsite-matches.txt. 140 162 141 3 not in MRI butof the same domain, one is just a gallery of holiday pictures.163 3 are not in MRI but are of the same domain, one is just a gallery of holiday pictures. 142 164 143 165 > less unprocessed-topsite-matches.txt … … 242 264 243 265 266 wharariki:[143]/Scratch/ak19/maori-lang-detection/src>wc -l ../mongodb-data/InfoOnEmptyPagesNotInMongoDB.txt 267 589179 ../mongodb-data/InfoOnEmptyPagesNotInMongoDB.txt 268 269 - 17 lines at start that aren't about empty web pages in dump.txt = 589162 empty web pages 270 271 272 273 ================================ 274 Inspecting the csv file: 275 276 wharariki:[198]/Scratch/ak19/maori-lang-detection/src>wc -l InfoOnEmptyPagesNotInMongoDB.txt 277 587082 InfoOnEmptyPagesNotInMongoDB.txt 278 -1 for column headings = 279 587081 empty pages 280 281 wharariki:[183]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.txt | wc 282 3441 21326 579499 283 284 OF WHICH fetched but parseException: 285 wharariki:[187]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.txt | fgrep "ParseException" | wc 286 939 9390 219818 287 288 wharariki:[214]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.txt | fgrep -v "ParseException" | wc 289 2502 11936 359681 290 291 ONLY OTHER OPTION FOR status_fetched IS SUCCESS: 292 wharariki:[211]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.txt | egrep -v "ParseException|SUCCESS" | wc 293 0 0 0 294 295 wharariki:[188]/Scratch/ak19/maori-lang-detection/src>fgrep "status_unfetched" InfoOnEmptyPagesNotInMongoDB.txt | wc 296 555167 1117894 60067623 297 298 wharariki:[191]/Scratch/ak19/maori-lang-detection/src>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.txt | wc 299 5907 17929 1059096 300 wharariki:[192]/Scratch/ak19/maori-lang-detection/src>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.txt | fgrep "NOTFOUND" | wc 301 3276 9828 695839 302 303 For status_gone, alternative values to NOTFOUND are GONE and ROBOTS_DENIED and ACCESS_DENIED: 304 wharariki:[200]/Scratch/ak19/maori-lang-detection/src>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.txt | fgrep -v "NOTFOUND" | less 305 wharariki:[204]/Scratch/ak19/maori-lang-detection/src>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.txt | egrep -v "NOTFOUND|GONE|ROBOTS_DENIED" | less 306 307 308 wharariki:[196]/Scratch/ak19/maori-lang-detection/src>fgrep "status_notmodified" InfoOnEmptyPagesNotInMongoDB.txt | wc 309 291 873 51684 310 wharariki:[197]/Scratch/ak19/maori-lang-detection/src>fgrep "status_notmodified" InfoOnEmptyPagesNotInMongoDB.txt | fgrep "NOTMODIFIED" | wc 311 291 873 51684 312 313 314 ======== 315 316 wharariki:[222]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.txt | fgrep -v "success/ok" | wc 317 1376 11001 289780 318 wharariki:[223]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.txt | fgrep "success/ok" | fgrep "ParseException" | wc 319 0 0 0 320 321 322 wharariki:[226]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.txt | fgrep -v "success/ok" | fgrep -v "ParseException" | less 323 wharariki:[227]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.txt | fgrep -v "success/ok" | fgrep -v "ParseException" | wc 324 437 1611 69962 325 326 - "success/ok" 327 - "success/redirect" 328 - "failed/exception" for ParseException 329 All failed/exception are ParseExceptions: 330 wharariki:[233]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.txt | fgrep "failed/exception" | fgrep -v "ParseException" | wc 331 0 0 0 332 333 ALL THE status_fetched: 334 wharariki:[234]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.txt | wc 335 3441 21326 579499 336 wharariki:[244]/Scratch/ak19/maori-lang-detection/src>egrep "success/redirect|success/ok|failed/exception" InfoOnEmptyPagesNotInMongoDB.txt | wc 337 3154 20465 542771 338 wharariki:[245]/Scratch/ak19/maori-lang-detection/src>egrep -v "success/redirect|success/ok|failed/exception" InfoOnEmptyPagesNotInMongoDB.txt | less 339 wharariki:[246]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" | egrep -v "success/redirect|success/ok|failed/exception" InfoOnEmptyPagesNotInMongoDB.txt | less 340 341 wharariki:[247]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.txt | egrep -v "success/redirect|success/ok|failed/exception" | lesswharariki:[248]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.txt | egrep -v "success/redirect|success/ok|failed/exception" | wc 342 287 861 36728 343 344 (No equivalent info to success/ok, success/redirect, failed/exception) 345
Note:
See TracChangeset
for help on using the changeset viewer.