Changeset 34006

Show
Ignore:
Timestamp:
10.03.2020 18:51:05 (3 weeks ago)
Author:
ak19
Message:

Committing more data I've collected for generating pie charts and the pie-charts for the first dataset, which is how the seed URLs for crawling were obtained.

Location:
other-projects/maori-lang-detection/mongodb-data
Files:
4 added
1 modified

Legend:

Unmodified
Added
Removed
  • other-projects/maori-lang-detection/mongodb-data/piechart_data.txt

    r34004 r34006  
    11https://www.rapidtables.com/tools/pie-chart.html 
    2 https://www.meta-chart.com/pie#/data 
     2https://www.meta-chart.com/pie#/data (more powerful: can choose colours, display labels) 
    33 
    44"11.5 billion CC URLs"  
     
    264264 
    265265 
    266 wharariki:[143]/Scratch/ak19/maori-lang-detection/src>wc -l ../mongodb-data/InfoOnEmptyPagesNotInMongoDB.txt  
    267 589179 ../mongodb-data/InfoOnEmptyPagesNotInMongoDB.txt 
     266wharariki:[143]/Scratch/ak19/maori-lang-detection/src>wc -l ../mongodb-data/InfoOnEmptyPagesNotInMongoDB.csv  
     267589179 ../mongodb-data/InfoOnEmptyPagesNotInMongoDB.csv 
    268268 
    269269- 17 lines at start that aren't about empty web pages in dump.txt = 589162 empty web pages 
     
    274274Inspecting the csv file: 
    275275 
    276 wharariki:[198]/Scratch/ak19/maori-lang-detection/src>wc -l InfoOnEmptyPagesNotInMongoDB.txt  
    277 587082 InfoOnEmptyPagesNotInMongoDB.txt 
     276 
     277wharariki:[198]/Scratch/ak19/maori-lang-detection/src>wc -l InfoOnEmptyPagesNotInMongoDB.csv  
     278587082 InfoOnEmptyPagesNotInMongoDB.csv 
    278279-1 for column headings =  
    279280587081 empty pages 
    280281 
    281 wharariki:[183]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.txt | wc 
     282 
     283# Listing of the nutch crawl status values: 
     284# https://nutch.apache.org/apidocs/apidocs-2.0/org/apache/nutch/crawl/CrawlStatus.html 
     285# But the only ones used are: status_unfetched|status_fetched|status_gone|status_redir|status_notmodified 
     286# Remainder are status (null). See examples in siteID 00154 later in this file. 
     287 
     288 
     289    wharariki:[298]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_unfetched" InfoOnEmptyPagesNotInMongoDB.csv | wc 
     290     555167 1117894 60067623 
     291    wharariki:[299]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | wc 
     292       3441   21326  579499 
     293    wharariki:[300]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv | wc 
     294       5907   17929 1059096 
     295    wharariki:[301]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_notmodified" InfoOnEmptyPagesNotInMongoDB.csv | wc 
     296        291     873   51684 
     297    wharariki:[302]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_redir" InfoOnEmptyPagesNotInMongoDB.csv | wc 
     298      10959   32941 1927067 
     299 
     300    UNKNOWN STATUS (no status, protocolStatus or parseStatus info) forthe remainder: 
     301    wharariki:[291]/Scratch/ak19/maori-lang-detection/mongodb-data>egrep -v "status_unfetched|status_fetched|status_gone|status_redir|status_notmodified" InfoOnEmptyPagesNotInMongoDB.csv | less 
     302 
     303    wharariki:[304]/Scratch/ak19/maori-lang-detection/mongodb-data>egrep -v "status_unfetched|status_fetched|status_gone|status_redir|status_notmodified" InfoOnEmptyPagesNotInMongoDB.csv | wc 
     304      11317-1 (column heading)   22633  874662 
     305 
     306=> unfetched + fetched + gone + notmodified + redir + (UNKNOWN cause) 
     307=> 555167+3441+5907+291+10959+11316 = 587081 empty pages (CHECKED) 
     308 
     309wharariki:[183]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | wc 
    282310   3441   21326  579499 
    283311 
    284     OF WHICH fetched but parseException: 
    285         wharariki:[187]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.txt | fgrep "ParseException" | wc 
     312    wharariki:[315]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | grep "success/ok" | wc 
     313       2065   10325  289719 
     314 
     315    wharariki:[317]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | grep "success/redirect" | wc 
     316        150     750   33234 
     317 
     318    wharariki:[316]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | grep "failed/exception" | wc 
     319        939    9390  219818 
     320[ 
     321    all status_fetched with failed/exception are parseExceptions: 
     322        wharariki:[187]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | fgrep "ParseException" | wc 
    286323            939    9390  219818 
    287  
    288     wharariki:[214]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.txt | fgrep -v "ParseException" | wc 
    289         2502   11936  359681 
    290  
    291     ONLY OTHER OPTION FOR status_fetched IS SUCCESS: 
    292         wharariki:[211]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.txt | egrep -v "ParseException|SUCCESS" | wc 
    293               0       0       0 
    294  
    295 wharariki:[188]/Scratch/ak19/maori-lang-detection/src>fgrep "status_unfetched" InfoOnEmptyPagesNotInMongoDB.txt | wc 
     324] 
     325 
     326All other kinds of status_fetched have no information besides SUCCESS (despite resulting in empty pages): 
     327    wharariki:[319]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | egrep -v "success/ok|success/redirect|failed/exception" | wc 
     328        287     861   36728 
     329 
     330 
     331    All status_fetched that are not parseExceptions were SUCCESS: 
     332 
     333        wharariki:[214]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | fgrep -v "ParseException" | wc 
     334            2502   11936  359681 
     335 
     336        ONLY OTHER OPTION FOR status_fetched IS SUCCESS: 
     337            wharariki:[211]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | egrep -v "ParseException|SUCCESS" | wc 
     338                  0       0       0 
     339 
     340 
     341wharariki:[188]/Scratch/ak19/maori-lang-detection/src>fgrep "status_unfetched" InfoOnEmptyPagesNotInMongoDB.csv | wc 
    296342 555167 1117894 60067623 
    297343 
    298     wharariki:[191]/Scratch/ak19/maori-lang-detection/src>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.txt | wc 
     344    status_unfetched includes 
     345    - EXCEPTIONs like http error code 403 (Forbidden), 402 (Payment Required), 429 (Too Many Requests), 502 (Bad Gateway) 
     346    IOExceptions like unzipping issues (unzipBestEffort returned null) 
     347    Unknown Host Exceptions, SocketTimeoutException, ConnectionException connection refused, 
     348    SSL Exceptions like fatal alert/internal error, SSLHandshakeException (SSL security issues / invalid certificate),  
     349    (EXCEPTION, args=[javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target]) 
     350    - (null) 
     351 
     352     
     353    wharariki:[309]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_unfetched" InfoOnEmptyPagesNotInMongoDB.csv | grep "EXCEPTION" | wc 
     354       1847   11254  381055 
     355 
     356     
     357 
     358status_redir_temp, status_redir_perm 
     359    - MOVED 
     360    - TEMP_MOVED  
     361 
     362    wharariki:[327]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_redir" InfoOnEmptyPagesNotInMongoDB.csv | wc 
     363      10959   32941 1927067 
     364    wharariki:[328]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_redir_temp" InfoOnEmptyPagesNotInMongoDB.csv | wc 
     365       4872   14625  906162 
     366    wharariki:[329]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_redir_perm" InfoOnEmptyPagesNotInMongoDB.csv | wc 
     367       6087   18316 1020905 
     368 
     369 
     370wharariki:[191]/Scratch/ak19/maori-lang-detection/src>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv | wc 
    299371       5907   17929 1059096 
    300     wharariki:[192]/Scratch/ak19/maori-lang-detection/src>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.txt  | fgrep "NOTFOUND" | wc 
     372 
     373[ 
     374For status_gone, alternative values to NOTFOUND are GONE and ROBOTS_DENIED and ACCESS_DENIED: 
     375    wharariki:[200]/Scratch/ak19/maori-lang-detection/src>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv  | fgrep -v "NOTFOUND" | less 
     376    wharariki:[204]/Scratch/ak19/maori-lang-detection/src>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv  | egrep -v "NOTFOUND|GONE|ROBOTS_DENIED" | less 
     377 
     378wharariki:[342]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv  | egrep -v "NOTFOUND|GONE|ROBOTS_DENIED|ACCESS_DENIED" | wc 
     379      0       0       0 
     380] 
     381 
     382    wharariki:[192]/Scratch/ak19/maori-lang-detection/src>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv  | fgrep "NOTFOUND" | wc 
    301383       3276    9828  695839 
    302384 
    303 For status_gone, alternative values to NOTFOUND are GONE and ROBOTS_DENIED and ACCESS_DENIED: 
    304     wharariki:[200]/Scratch/ak19/maori-lang-detection/src>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.txt  | fgrep -v "NOTFOUND" | less 
    305     wharariki:[204]/Scratch/ak19/maori-lang-detection/src>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.txt  | egrep -v "NOTFOUND|GONE|ROBOTS_DENIED" | less 
    306  
    307  
    308 wharariki:[196]/Scratch/ak19/maori-lang-detection/src>fgrep "status_notmodified" InfoOnEmptyPagesNotInMongoDB.txt  | wc 
     385    wharariki:[337]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv  | egrep "GONE" | wc 
     386        374    1322   93428 
     387    wharariki:[338]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv  | egrep "ROBOTS_DENIED" | wc 
     388       2253    6759  269069 
     389    wharariki:[339]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv  | egrep "ACCESS_DENIED" | wc 
     390          4      20     760 
     391 
     392= 5907 
     393 
     394wharariki:[196]/Scratch/ak19/maori-lang-detection/src>fgrep "status_notmodified" InfoOnEmptyPagesNotInMongoDB.csv  | wc 
    309395    291     873   51684 
    310 wharariki:[197]/Scratch/ak19/maori-lang-detection/src>fgrep "status_notmodified" InfoOnEmptyPagesNotInMongoDB.txt  | fgrep "NOTMODIFIED" | wc 
     396wharariki:[197]/Scratch/ak19/maori-lang-detection/src>fgrep "status_notmodified" InfoOnEmptyPagesNotInMongoDB.csv  | fgrep "NOTMODIFIED" | wc 
    311397    291     873   51684 
    312398 
     
    314400======== 
    315401 
    316 wharariki:[222]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.txt | fgrep -v "success/ok" | wc 
     402wharariki:[222]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | fgrep -v "success/ok" | wc 
    317403   1376   11001  289780 
    318 wharariki:[223]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.txt | fgrep "success/ok" | fgrep "ParseException" | wc 
     404wharariki:[223]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | fgrep "success/ok" | fgrep "ParseException" | wc 
    319405      0       0       0 
    320406 
    321407 
    322 wharariki:[226]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.txt | fgrep -v "success/ok" | fgrep -v "ParseException" | less 
    323 wharariki:[227]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.txt | fgrep -v "success/ok" | fgrep -v "ParseException" | wc 
     408wharariki:[226]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | fgrep -v "success/ok" | fgrep -v "ParseException" | less 
     409wharariki:[227]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | fgrep -v "success/ok" | fgrep -v "ParseException" | wc 
    324410    437    1611   69962 
    325411 
     
    328414- "failed/exception" for ParseException 
    329415All failed/exception are ParseExceptions: 
    330 wharariki:[233]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.txt | fgrep "failed/exception" | fgrep -v "ParseException" | wc 
     416wharariki:[233]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | fgrep "failed/exception" | fgrep -v "ParseException" | wc 
    331417      0       0       0 
    332418 
    333419ALL THE status_fetched: 
    334 wharariki:[234]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.txt | wc 
     420wharariki:[234]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | wc 
    335421   3441   21326  579499 
    336 wharariki:[244]/Scratch/ak19/maori-lang-detection/src>egrep "success/redirect|success/ok|failed/exception" InfoOnEmptyPagesNotInMongoDB.txt | wc 
     422wharariki:[244]/Scratch/ak19/maori-lang-detection/src>egrep "success/redirect|success/ok|failed/exception" InfoOnEmptyPagesNotInMongoDB.csv | wc 
    337423   3154   20465  542771 
    338 wharariki:[245]/Scratch/ak19/maori-lang-detection/src>egrep -v "success/redirect|success/ok|failed/exception" InfoOnEmptyPagesNotInMongoDB.txt | less 
    339 wharariki:[246]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" | egrep -v "success/redirect|success/ok|failed/exception" InfoOnEmptyPagesNotInMongoDB.txt | less 
    340  
    341 wharariki:[247]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.txt | egrep -v "success/redirect|success/ok|failed/exception" | lesswharariki:[248]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.txt | egrep -v "success/redirect|success/ok|failed/exception" | wc 
     424wharariki:[245]/Scratch/ak19/maori-lang-detection/src>egrep -v "success/redirect|success/ok|failed/exception" InfoOnEmptyPagesNotInMongoDB.csv | less 
     425wharariki:[246]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" | egrep -v "success/redirect|success/ok|failed/exception" InfoOnEmptyPagesNotInMongoDB.csv | less 
     426 
     427wharariki:[247]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | egrep -v "success/redirect|success/ok|failed/exception" | lesswharariki:[248]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | egrep -v "success/redirect|success/ok|failed/exception" | wc 
    342428    287     861   36728 
    343429 
    344430(No equivalent info to success/ok, success/redirect, failed/exception) 
    345431 
     432----------------------------- 
     433No status information for many pages on site 00154, from the following point onwards (crawled too much of the site?): 
     434    http://m.biblepub.com/bibles/mb/19/81   key:    com.biblepub.m:http/bibles/mb/19/81 
     435    baseUrl:        null 
     436    status: 2 (status_fetched) 
     437    fetchTime:      1573978084279 
     438    prevFetchTime:  1571385510616 
     439    fetchInterval:  2592000 
     440    retriesSinceFetch:      0 
     441    modifiedTime:   0 
     442    prevModifiedTime:       0 
     443    protocolStatus: SUCCESS, args=[] 
     444    signature:      3e214d69ab677a676e40c2b91901acc9 
     445    parseStatus:    success/ok (1/0), args=[] 
     446    title:  Psalm 81 - Maori Bible - Bibles - BiblePub Mobile 
     447    score:  1.0 
     448    marker _injmrk_ :       y 
     449    marker _updmrk_ :       1571386061-31026 
     450    marker dist :   0 
     451    reprUrl:        null 
     452    batchId:        1571386061-31026 
     453    metadata CharEncodingForConversion :    utf-8 
     454    metadata OriginalCharEncoding :         utf-8 
     455    metadata _rs_ :         ^@^@^By 
     456    metadata _csh_ :        ^@^@^@^@ 
     457    text:start: 
     458    Psalm 81 - Maori Bible - Bibles - BiblePub Mobile Maori Bible Books next back Psalm 81 1 Ki te tino kaiwhakatangi. Kititi. Na Ahapa. Kia kaha te waiata ki te Atua, ki to tatou kaha: kia hari te hamama ki  
     459    te Atua o Hakopa. 2 Whakahuatia te himene, maua mai ki konei te timipera, te hapa reka me te hatere. 3 Whakatangihia te tetere i te kowhititanga marama, i te kinga o te marama, i to tatou ra hakari. 4 Ko  
     460    te tikanga hoki tenei ma Iharaira, he mea whakarite na te Atua o Hakopa. 5 I whakatakotoria tenei e ia ma Hohepa hei whakaaturanga, i tona haerenga puta noa i te whenua o Ihipa: i rongo ai ahau ki reira i 
     461     tetahi reo, kahore ahau i matau. 6 I tangohia mai e ahau tona pokohiwi i te pikaunga: whakarerea ake e ona ringa te kete. 7 I karanga koe ki ahau i te pouritanga, a kua ora koe i ahau; i whakahoki kupu a 
     462    hau ki a koe i te wahi ngaro o te whatitiri; i whakamatau i a koe ki nga wai o Meripa. (Hera. 8 Whakarongo, e taku iwi, a ka whakaatu ahau ki a koe: e Iharaira, ki te whakarongo koe ki ahau; 9 Aua tetahi  
     463    atua ke i roto i a koe; kaua ano e koropiko ki te atua ke. 10 Ko Ihowa ahau, ko tou Atua, i arahina mai ai koe i te whenua o Ihipa: kia nui te kowhera o tou mangai, a maku e whakaki. 11 Otiia kihai taku i 
     464    wi i pai ki te whakarongo ki toku reo: kihai ano a Iharaira i aro ki ahau. 12 Na tukua atu ana ratou e ahau ki te maro o o ratou ngakau: a haere ana ratou i runga i o ratou whakaaro. 13 Aue, te whakarongo 
     465     taku iwi ki ahau! Te haere a Iharaira i aku ara! 14 Penei e kore e aha kua whati i ahau te tara o o ratou hoariri: kua tahuri ano toku ringa ki o ratou hoariri. 15 Ko te hunga e kino ana ki a Ihowa kua n 
     466    gohengohe ki a ia: ko to ratou taima ia kua mau tonu. 16 Kua whangainga hoki ratou e ia ki te witi pai rawa, kua whakamakonatia ano koe e ahau ki te honi i roto i te kohatu. next back Contact Us - Full Si 
     467    te © 2013 BiblePub 
     468    text:end: 
     469 
     470    http://m.biblepub.com/bibles/mb/19/82   key:    com.biblepub.m:http/bibles/mb/19/82 
     471    baseUrl:        null 
     472    status: 1 (status_unfetched) 
     473    fetchTime:      1571386117381 
     474    prevFetchTime:  0 
     475    fetchInterval:  2592000 
     476    retriesSinceFetch:      0 
     477    modifiedTime:   0 
     478    prevModifiedTime:       0 
     479    protocolStatus: (null) 
     480    parseStatus:    (null) 
     481    title:  null 
     482    score:  0.0 
     483    marker dist :   1 
     484    reprUrl:        null 
     485    metadata _csh_ :        ^@^@^@^@ 
     486