- Timestamp:
- 2020-03-10T18:51:05+13:00 (4 years ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
other-projects/maori-lang-detection/mongodb-data/piechart_data.txt
r34004 r34006 1 1 https://www.rapidtables.com/tools/pie-chart.html 2 https://www.meta-chart.com/pie#/data 2 https://www.meta-chart.com/pie#/data (more powerful: can choose colours, display labels) 3 3 4 4 "11.5 billion CC URLs" … … 264 264 265 265 266 wharariki:[143]/Scratch/ak19/maori-lang-detection/src>wc -l ../mongodb-data/InfoOnEmptyPagesNotInMongoDB. txt267 589179 ../mongodb-data/InfoOnEmptyPagesNotInMongoDB. txt266 wharariki:[143]/Scratch/ak19/maori-lang-detection/src>wc -l ../mongodb-data/InfoOnEmptyPagesNotInMongoDB.csv 267 589179 ../mongodb-data/InfoOnEmptyPagesNotInMongoDB.csv 268 268 269 269 - 17 lines at start that aren't about empty web pages in dump.txt = 589162 empty web pages … … 274 274 Inspecting the csv file: 275 275 276 wharariki:[198]/Scratch/ak19/maori-lang-detection/src>wc -l InfoOnEmptyPagesNotInMongoDB.txt 277 587082 InfoOnEmptyPagesNotInMongoDB.txt 276 277 wharariki:[198]/Scratch/ak19/maori-lang-detection/src>wc -l InfoOnEmptyPagesNotInMongoDB.csv 278 587082 InfoOnEmptyPagesNotInMongoDB.csv 278 279 -1 for column headings = 279 280 587081 empty pages 280 281 281 wharariki:[183]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.txt | wc 282 283 # Listing of the nutch crawl status values: 284 # https://nutch.apache.org/apidocs/apidocs-2.0/org/apache/nutch/crawl/CrawlStatus.html 285 # But the only ones used are: status_unfetched|status_fetched|status_gone|status_redir|status_notmodified 286 # Remainder are status (null). See examples in siteID 00154 later in this file. 287 288 289 wharariki:[298]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_unfetched" InfoOnEmptyPagesNotInMongoDB.csv | wc 290 555167 1117894 60067623 291 wharariki:[299]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | wc 292 3441 21326 579499 293 wharariki:[300]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv | wc 294 5907 17929 1059096 295 wharariki:[301]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_notmodified" InfoOnEmptyPagesNotInMongoDB.csv | wc 296 291 873 51684 297 wharariki:[302]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_redir" InfoOnEmptyPagesNotInMongoDB.csv | wc 298 10959 32941 1927067 299 300 UNKNOWN STATUS (no status, protocolStatus or parseStatus info) forthe remainder: 301 wharariki:[291]/Scratch/ak19/maori-lang-detection/mongodb-data>egrep -v "status_unfetched|status_fetched|status_gone|status_redir|status_notmodified" InfoOnEmptyPagesNotInMongoDB.csv | less 302 303 wharariki:[304]/Scratch/ak19/maori-lang-detection/mongodb-data>egrep -v "status_unfetched|status_fetched|status_gone|status_redir|status_notmodified" InfoOnEmptyPagesNotInMongoDB.csv | wc 304 11317-1 (column heading) 22633 874662 305 306 => unfetched + fetched + gone + notmodified + redir + (UNKNOWN cause) 307 => 555167+3441+5907+291+10959+11316 = 587081 empty pages (CHECKED) 308 309 wharariki:[183]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | wc 282 310 3441 21326 579499 283 311 284 OF WHICH fetched but parseException: 285 wharariki:[187]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.txt | fgrep "ParseException" | wc 312 wharariki:[315]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | grep "success/ok" | wc 313 2065 10325 289719 314 315 wharariki:[317]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | grep "success/redirect" | wc 316 150 750 33234 317 318 wharariki:[316]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | grep "failed/exception" | wc 319 939 9390 219818 320 [ 321 all status_fetched with failed/exception are parseExceptions: 322 wharariki:[187]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | fgrep "ParseException" | wc 286 323 939 9390 219818 287 288 wharariki:[214]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.txt | fgrep -v "ParseException" | wc 289 2502 11936 359681 290 291 ONLY OTHER OPTION FOR status_fetched IS SUCCESS: 292 wharariki:[211]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.txt | egrep -v "ParseException|SUCCESS" | wc 293 0 0 0 294 295 wharariki:[188]/Scratch/ak19/maori-lang-detection/src>fgrep "status_unfetched" InfoOnEmptyPagesNotInMongoDB.txt | wc 324 ] 325 326 All other kinds of status_fetched have no information besides SUCCESS (despite resulting in empty pages): 327 wharariki:[319]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | egrep -v "success/ok|success/redirect|failed/exception" | wc 328 287 861 36728 329 330 331 All status_fetched that are not parseExceptions were SUCCESS: 332 333 wharariki:[214]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | fgrep -v "ParseException" | wc 334 2502 11936 359681 335 336 ONLY OTHER OPTION FOR status_fetched IS SUCCESS: 337 wharariki:[211]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | egrep -v "ParseException|SUCCESS" | wc 338 0 0 0 339 340 341 wharariki:[188]/Scratch/ak19/maori-lang-detection/src>fgrep "status_unfetched" InfoOnEmptyPagesNotInMongoDB.csv | wc 296 342 555167 1117894 60067623 297 343 298 wharariki:[191]/Scratch/ak19/maori-lang-detection/src>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.txt | wc 344 status_unfetched includes 345 - EXCEPTIONs like http error code 403 (Forbidden), 402 (Payment Required), 429 (Too Many Requests), 502 (Bad Gateway) 346 IOExceptions like unzipping issues (unzipBestEffort returned null) 347 Unknown Host Exceptions, SocketTimeoutException, ConnectionException connection refused, 348 SSL Exceptions like fatal alert/internal error, SSLHandshakeException (SSL security issues / invalid certificate), 349 (EXCEPTION, args=[javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target]) 350 - (null) 351 352 353 wharariki:[309]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_unfetched" InfoOnEmptyPagesNotInMongoDB.csv | grep "EXCEPTION" | wc 354 1847 11254 381055 355 356 357 358 status_redir_temp, status_redir_perm 359 - MOVED 360 - TEMP_MOVED 361 362 wharariki:[327]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_redir" InfoOnEmptyPagesNotInMongoDB.csv | wc 363 10959 32941 1927067 364 wharariki:[328]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_redir_temp" InfoOnEmptyPagesNotInMongoDB.csv | wc 365 4872 14625 906162 366 wharariki:[329]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_redir_perm" InfoOnEmptyPagesNotInMongoDB.csv | wc 367 6087 18316 1020905 368 369 370 wharariki:[191]/Scratch/ak19/maori-lang-detection/src>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv | wc 299 371 5907 17929 1059096 300 wharariki:[192]/Scratch/ak19/maori-lang-detection/src>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.txt | fgrep "NOTFOUND" | wc 372 373 [ 374 For status_gone, alternative values to NOTFOUND are GONE and ROBOTS_DENIED and ACCESS_DENIED: 375 wharariki:[200]/Scratch/ak19/maori-lang-detection/src>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv | fgrep -v "NOTFOUND" | less 376 wharariki:[204]/Scratch/ak19/maori-lang-detection/src>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv | egrep -v "NOTFOUND|GONE|ROBOTS_DENIED" | less 377 378 wharariki:[342]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv | egrep -v "NOTFOUND|GONE|ROBOTS_DENIED|ACCESS_DENIED" | wc 379 0 0 0 380 ] 381 382 wharariki:[192]/Scratch/ak19/maori-lang-detection/src>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv | fgrep "NOTFOUND" | wc 301 383 3276 9828 695839 302 384 303 For status_gone, alternative values to NOTFOUND are GONE and ROBOTS_DENIED and ACCESS_DENIED: 304 wharariki:[200]/Scratch/ak19/maori-lang-detection/src>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.txt | fgrep -v "NOTFOUND" | less 305 wharariki:[204]/Scratch/ak19/maori-lang-detection/src>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.txt | egrep -v "NOTFOUND|GONE|ROBOTS_DENIED" | less 306 307 308 wharariki:[196]/Scratch/ak19/maori-lang-detection/src>fgrep "status_notmodified" InfoOnEmptyPagesNotInMongoDB.txt | wc 385 wharariki:[337]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv | egrep "GONE" | wc 386 374 1322 93428 387 wharariki:[338]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv | egrep "ROBOTS_DENIED" | wc 388 2253 6759 269069 389 wharariki:[339]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv | egrep "ACCESS_DENIED" | wc 390 4 20 760 391 392 = 5907 393 394 wharariki:[196]/Scratch/ak19/maori-lang-detection/src>fgrep "status_notmodified" InfoOnEmptyPagesNotInMongoDB.csv | wc 309 395 291 873 51684 310 wharariki:[197]/Scratch/ak19/maori-lang-detection/src>fgrep "status_notmodified" InfoOnEmptyPagesNotInMongoDB. txt| fgrep "NOTMODIFIED" | wc396 wharariki:[197]/Scratch/ak19/maori-lang-detection/src>fgrep "status_notmodified" InfoOnEmptyPagesNotInMongoDB.csv | fgrep "NOTMODIFIED" | wc 311 397 291 873 51684 312 398 … … 314 400 ======== 315 401 316 wharariki:[222]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB. txt| fgrep -v "success/ok" | wc402 wharariki:[222]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | fgrep -v "success/ok" | wc 317 403 1376 11001 289780 318 wharariki:[223]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB. txt| fgrep "success/ok" | fgrep "ParseException" | wc404 wharariki:[223]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | fgrep "success/ok" | fgrep "ParseException" | wc 319 405 0 0 0 320 406 321 407 322 wharariki:[226]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB. txt| fgrep -v "success/ok" | fgrep -v "ParseException" | less323 wharariki:[227]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB. txt| fgrep -v "success/ok" | fgrep -v "ParseException" | wc408 wharariki:[226]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | fgrep -v "success/ok" | fgrep -v "ParseException" | less 409 wharariki:[227]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | fgrep -v "success/ok" | fgrep -v "ParseException" | wc 324 410 437 1611 69962 325 411 … … 328 414 - "failed/exception" for ParseException 329 415 All failed/exception are ParseExceptions: 330 wharariki:[233]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB. txt| fgrep "failed/exception" | fgrep -v "ParseException" | wc416 wharariki:[233]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | fgrep "failed/exception" | fgrep -v "ParseException" | wc 331 417 0 0 0 332 418 333 419 ALL THE status_fetched: 334 wharariki:[234]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB. txt| wc420 wharariki:[234]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | wc 335 421 3441 21326 579499 336 wharariki:[244]/Scratch/ak19/maori-lang-detection/src>egrep "success/redirect|success/ok|failed/exception" InfoOnEmptyPagesNotInMongoDB. txt| wc422 wharariki:[244]/Scratch/ak19/maori-lang-detection/src>egrep "success/redirect|success/ok|failed/exception" InfoOnEmptyPagesNotInMongoDB.csv | wc 337 423 3154 20465 542771 338 wharariki:[245]/Scratch/ak19/maori-lang-detection/src>egrep -v "success/redirect|success/ok|failed/exception" InfoOnEmptyPagesNotInMongoDB. txt| less339 wharariki:[246]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" | egrep -v "success/redirect|success/ok|failed/exception" InfoOnEmptyPagesNotInMongoDB. txt| less340 341 wharariki:[247]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB. txt | egrep -v "success/redirect|success/ok|failed/exception" | lesswharariki:[248]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.txt| egrep -v "success/redirect|success/ok|failed/exception" | wc424 wharariki:[245]/Scratch/ak19/maori-lang-detection/src>egrep -v "success/redirect|success/ok|failed/exception" InfoOnEmptyPagesNotInMongoDB.csv | less 425 wharariki:[246]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" | egrep -v "success/redirect|success/ok|failed/exception" InfoOnEmptyPagesNotInMongoDB.csv | less 426 427 wharariki:[247]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | egrep -v "success/redirect|success/ok|failed/exception" | lesswharariki:[248]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | egrep -v "success/redirect|success/ok|failed/exception" | wc 342 428 287 861 36728 343 429 344 430 (No equivalent info to success/ok, success/redirect, failed/exception) 345 431 432 ----------------------------- 433 No status information for many pages on site 00154, from the following point onwards (crawled too much of the site?): 434 http://m.biblepub.com/bibles/mb/19/81 key: com.biblepub.m:http/bibles/mb/19/81 435 baseUrl: null 436 status: 2 (status_fetched) 437 fetchTime: 1573978084279 438 prevFetchTime: 1571385510616 439 fetchInterval: 2592000 440 retriesSinceFetch: 0 441 modifiedTime: 0 442 prevModifiedTime: 0 443 protocolStatus: SUCCESS, args=[] 444 signature: 3e214d69ab677a676e40c2b91901acc9 445 parseStatus: success/ok (1/0), args=[] 446 title: Psalm 81 - Maori Bible - Bibles - BiblePub Mobile 447 score: 1.0 448 marker _injmrk_ : y 449 marker _updmrk_ : 1571386061-31026 450 marker dist : 0 451 reprUrl: null 452 batchId: 1571386061-31026 453 metadata CharEncodingForConversion : utf-8 454 metadata OriginalCharEncoding : utf-8 455 metadata _rs_ : ^@^@^By 456 metadata _csh_ : ^@^@^@^@ 457 text:start: 458 Psalm 81 - Maori Bible - Bibles - BiblePub Mobile Maori Bible Books next back Psalm 81 1 Ki te tino kaiwhakatangi. Kititi. Na Ahapa. Kia kaha te waiata ki te Atua, ki to tatou kaha: kia hari te hamama ki 459 te Atua o Hakopa. 2 Whakahuatia te himene, maua mai ki konei te timipera, te hapa reka me te hatere. 3 Whakatangihia te tetere i te kowhititanga marama, i te kinga o te marama, i to tatou ra hakari. 4 Ko 460 te tikanga hoki tenei ma Iharaira, he mea whakarite na te Atua o Hakopa. 5 I whakatakotoria tenei e ia ma Hohepa hei whakaaturanga, i tona haerenga puta noa i te whenua o Ihipa: i rongo ai ahau ki reira i 461 tetahi reo, kahore ahau i matau. 6 I tangohia mai e ahau tona pokohiwi i te pikaunga: whakarerea ake e ona ringa te kete. 7 I karanga koe ki ahau i te pouritanga, a kua ora koe i ahau; i whakahoki kupu a 462 hau ki a koe i te wahi ngaro o te whatitiri; i whakamatau i a koe ki nga wai o Meripa. (Hera. 8 Whakarongo, e taku iwi, a ka whakaatu ahau ki a koe: e Iharaira, ki te whakarongo koe ki ahau; 9 Aua tetahi 463 atua ke i roto i a koe; kaua ano e koropiko ki te atua ke. 10 Ko Ihowa ahau, ko tou Atua, i arahina mai ai koe i te whenua o Ihipa: kia nui te kowhera o tou mangai, a maku e whakaki. 11 Otiia kihai taku i 464 wi i pai ki te whakarongo ki toku reo: kihai ano a Iharaira i aro ki ahau. 12 Na tukua atu ana ratou e ahau ki te maro o o ratou ngakau: a haere ana ratou i runga i o ratou whakaaro. 13 Aue, te whakarongo 465 taku iwi ki ahau! Te haere a Iharaira i aku ara! 14 Penei e kore e aha kua whati i ahau te tara o o ratou hoariri: kua tahuri ano toku ringa ki o ratou hoariri. 15 Ko te hunga e kino ana ki a Ihowa kua n 466 gohengohe ki a ia: ko to ratou taima ia kua mau tonu. 16 Kua whangainga hoki ratou e ia ki te witi pai rawa, kua whakamakonatia ano koe e ahau ki te honi i roto i te kohatu. next back Contact Us - Full Si 467 te © 2013 BiblePub 468 text:end: 469 470 http://m.biblepub.com/bibles/mb/19/82 key: com.biblepub.m:http/bibles/mb/19/82 471 baseUrl: null 472 status: 1 (status_unfetched) 473 fetchTime: 1571386117381 474 prevFetchTime: 0 475 fetchInterval: 2592000 476 retriesSinceFetch: 0 477 modifiedTime: 0 478 prevModifiedTime: 0 479 protocolStatus: (null) 480 parseStatus: (null) 481 title: null 482 score: 0.0 483 marker dist : 1 484 reprUrl: null 485 metadata _csh_ : ^@^@^@^@ 486
Note:
See TracChangeset
for help on using the changeset viewer.