Ignore:
Timestamp:
2020-03-10T18:51:05+13:00 (4 years ago)
Author:
ak19
Message:

Committing more data I've collected for generating pie charts and the pie-charts for the first dataset, which is how the seed URLs for crawling were obtained.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • other-projects/maori-lang-detection/mongodb-data/piechart_data.txt

    r34004 r34006  
    11https://www.rapidtables.com/tools/pie-chart.html
    2 https://www.meta-chart.com/pie#/data
     2https://www.meta-chart.com/pie#/data (more powerful: can choose colours, display labels)
    33
    44"11.5 billion CC URLs"
     
    264264
    265265
    266 wharariki:[143]/Scratch/ak19/maori-lang-detection/src>wc -l ../mongodb-data/InfoOnEmptyPagesNotInMongoDB.txt
    267 589179 ../mongodb-data/InfoOnEmptyPagesNotInMongoDB.txt
     266wharariki:[143]/Scratch/ak19/maori-lang-detection/src>wc -l ../mongodb-data/InfoOnEmptyPagesNotInMongoDB.csv
     267589179 ../mongodb-data/InfoOnEmptyPagesNotInMongoDB.csv
    268268
    269269- 17 lines at start that aren't about empty web pages in dump.txt = 589162 empty web pages
     
    274274Inspecting the csv file:
    275275
    276 wharariki:[198]/Scratch/ak19/maori-lang-detection/src>wc -l InfoOnEmptyPagesNotInMongoDB.txt
    277 587082 InfoOnEmptyPagesNotInMongoDB.txt
     276
     277wharariki:[198]/Scratch/ak19/maori-lang-detection/src>wc -l InfoOnEmptyPagesNotInMongoDB.csv
     278587082 InfoOnEmptyPagesNotInMongoDB.csv
    278279-1 for column headings =
    279280587081 empty pages
    280281
    281 wharariki:[183]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.txt | wc
     282
     283# Listing of the nutch crawl status values:
     284# https://nutch.apache.org/apidocs/apidocs-2.0/org/apache/nutch/crawl/CrawlStatus.html
     285# But the only ones used are: status_unfetched|status_fetched|status_gone|status_redir|status_notmodified
     286# Remainder are status (null). See examples in siteID 00154 later in this file.
     287
     288
     289    wharariki:[298]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_unfetched" InfoOnEmptyPagesNotInMongoDB.csv | wc
     290     555167 1117894 60067623
     291    wharariki:[299]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | wc
     292       3441   21326  579499
     293    wharariki:[300]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv | wc
     294       5907   17929 1059096
     295    wharariki:[301]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_notmodified" InfoOnEmptyPagesNotInMongoDB.csv | wc
     296        291     873   51684
     297    wharariki:[302]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_redir" InfoOnEmptyPagesNotInMongoDB.csv | wc
     298      10959   32941 1927067
     299
     300    UNKNOWN STATUS (no status, protocolStatus or parseStatus info) forthe remainder:
     301    wharariki:[291]/Scratch/ak19/maori-lang-detection/mongodb-data>egrep -v "status_unfetched|status_fetched|status_gone|status_redir|status_notmodified" InfoOnEmptyPagesNotInMongoDB.csv | less
     302
     303    wharariki:[304]/Scratch/ak19/maori-lang-detection/mongodb-data>egrep -v "status_unfetched|status_fetched|status_gone|status_redir|status_notmodified" InfoOnEmptyPagesNotInMongoDB.csv | wc
     304      11317-1 (column heading)   22633  874662
     305
     306=> unfetched + fetched + gone + notmodified + redir + (UNKNOWN cause)
     307=> 555167+3441+5907+291+10959+11316 = 587081 empty pages (CHECKED)
     308
     309wharariki:[183]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | wc
    282310   3441   21326  579499
    283311
    284     OF WHICH fetched but parseException:
    285         wharariki:[187]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.txt | fgrep "ParseException" | wc
     312    wharariki:[315]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | grep "success/ok" | wc
     313       2065   10325  289719
     314
     315    wharariki:[317]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | grep "success/redirect" | wc
     316        150     750   33234
     317
     318    wharariki:[316]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | grep "failed/exception" | wc
     319        939    9390  219818
     320[
     321    all status_fetched with failed/exception are parseExceptions:
     322        wharariki:[187]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | fgrep "ParseException" | wc
    286323            939    9390  219818
    287 
    288     wharariki:[214]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.txt | fgrep -v "ParseException" | wc
    289         2502   11936  359681
    290 
    291     ONLY OTHER OPTION FOR status_fetched IS SUCCESS:
    292         wharariki:[211]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.txt | egrep -v "ParseException|SUCCESS" | wc
    293               0       0       0
    294 
    295 wharariki:[188]/Scratch/ak19/maori-lang-detection/src>fgrep "status_unfetched" InfoOnEmptyPagesNotInMongoDB.txt | wc
     324]
     325
     326All other kinds of status_fetched have no information besides SUCCESS (despite resulting in empty pages):
     327    wharariki:[319]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | egrep -v "success/ok|success/redirect|failed/exception" | wc
     328        287     861   36728
     329
     330
     331    All status_fetched that are not parseExceptions were SUCCESS:
     332
     333        wharariki:[214]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | fgrep -v "ParseException" | wc
     334            2502   11936  359681
     335
     336        ONLY OTHER OPTION FOR status_fetched IS SUCCESS:
     337            wharariki:[211]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | egrep -v "ParseException|SUCCESS" | wc
     338                  0       0       0
     339
     340
     341wharariki:[188]/Scratch/ak19/maori-lang-detection/src>fgrep "status_unfetched" InfoOnEmptyPagesNotInMongoDB.csv | wc
    296342 555167 1117894 60067623
    297343
    298     wharariki:[191]/Scratch/ak19/maori-lang-detection/src>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.txt | wc
     344    status_unfetched includes
     345    - EXCEPTIONs like http error code 403 (Forbidden), 402 (Payment Required), 429 (Too Many Requests), 502 (Bad Gateway)
     346    IOExceptions like unzipping issues (unzipBestEffort returned null)
     347    Unknown Host Exceptions, SocketTimeoutException, ConnectionException connection refused,
     348    SSL Exceptions like fatal alert/internal error, SSLHandshakeException (SSL security issues / invalid certificate),
     349    (EXCEPTION, args=[javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target])
     350    - (null)
     351
     352   
     353    wharariki:[309]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_unfetched" InfoOnEmptyPagesNotInMongoDB.csv | grep "EXCEPTION" | wc
     354       1847   11254  381055
     355
     356   
     357
     358status_redir_temp, status_redir_perm
     359    - MOVED
     360    - TEMP_MOVED
     361
     362    wharariki:[327]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_redir" InfoOnEmptyPagesNotInMongoDB.csv | wc
     363      10959   32941 1927067
     364    wharariki:[328]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_redir_temp" InfoOnEmptyPagesNotInMongoDB.csv | wc
     365       4872   14625  906162
     366    wharariki:[329]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_redir_perm" InfoOnEmptyPagesNotInMongoDB.csv | wc
     367       6087   18316 1020905
     368
     369
     370wharariki:[191]/Scratch/ak19/maori-lang-detection/src>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv | wc
    299371       5907   17929 1059096
    300     wharariki:[192]/Scratch/ak19/maori-lang-detection/src>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.txt  | fgrep "NOTFOUND" | wc
     372
     373[
     374For status_gone, alternative values to NOTFOUND are GONE and ROBOTS_DENIED and ACCESS_DENIED:
     375    wharariki:[200]/Scratch/ak19/maori-lang-detection/src>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv  | fgrep -v "NOTFOUND" | less
     376    wharariki:[204]/Scratch/ak19/maori-lang-detection/src>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv  | egrep -v "NOTFOUND|GONE|ROBOTS_DENIED" | less
     377
     378wharariki:[342]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv  | egrep -v "NOTFOUND|GONE|ROBOTS_DENIED|ACCESS_DENIED" | wc
     379      0       0       0
     380]
     381
     382    wharariki:[192]/Scratch/ak19/maori-lang-detection/src>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv  | fgrep "NOTFOUND" | wc
    301383       3276    9828  695839
    302384
    303 For status_gone, alternative values to NOTFOUND are GONE and ROBOTS_DENIED and ACCESS_DENIED:
    304     wharariki:[200]/Scratch/ak19/maori-lang-detection/src>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.txt  | fgrep -v "NOTFOUND" | less
    305     wharariki:[204]/Scratch/ak19/maori-lang-detection/src>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.txt  | egrep -v "NOTFOUND|GONE|ROBOTS_DENIED" | less
    306 
    307 
    308 wharariki:[196]/Scratch/ak19/maori-lang-detection/src>fgrep "status_notmodified" InfoOnEmptyPagesNotInMongoDB.txt  | wc
     385    wharariki:[337]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv  | egrep "GONE" | wc
     386        374    1322   93428
     387    wharariki:[338]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv  | egrep "ROBOTS_DENIED" | wc
     388       2253    6759  269069
     389    wharariki:[339]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv  | egrep "ACCESS_DENIED" | wc
     390          4      20     760
     391
     392= 5907
     393
     394wharariki:[196]/Scratch/ak19/maori-lang-detection/src>fgrep "status_notmodified" InfoOnEmptyPagesNotInMongoDB.csv  | wc
    309395    291     873   51684
    310 wharariki:[197]/Scratch/ak19/maori-lang-detection/src>fgrep "status_notmodified" InfoOnEmptyPagesNotInMongoDB.txt  | fgrep "NOTMODIFIED" | wc
     396wharariki:[197]/Scratch/ak19/maori-lang-detection/src>fgrep "status_notmodified" InfoOnEmptyPagesNotInMongoDB.csv  | fgrep "NOTMODIFIED" | wc
    311397    291     873   51684
    312398
     
    314400========
    315401
    316 wharariki:[222]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.txt | fgrep -v "success/ok" | wc
     402wharariki:[222]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | fgrep -v "success/ok" | wc
    317403   1376   11001  289780
    318 wharariki:[223]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.txt | fgrep "success/ok" | fgrep "ParseException" | wc
     404wharariki:[223]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | fgrep "success/ok" | fgrep "ParseException" | wc
    319405      0       0       0
    320406
    321407
    322 wharariki:[226]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.txt | fgrep -v "success/ok" | fgrep -v "ParseException" | less
    323 wharariki:[227]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.txt | fgrep -v "success/ok" | fgrep -v "ParseException" | wc
     408wharariki:[226]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | fgrep -v "success/ok" | fgrep -v "ParseException" | less
     409wharariki:[227]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | fgrep -v "success/ok" | fgrep -v "ParseException" | wc
    324410    437    1611   69962
    325411
     
    328414- "failed/exception" for ParseException
    329415All failed/exception are ParseExceptions:
    330 wharariki:[233]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.txt | fgrep "failed/exception" | fgrep -v "ParseException" | wc
     416wharariki:[233]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | fgrep "failed/exception" | fgrep -v "ParseException" | wc
    331417      0       0       0
    332418
    333419ALL THE status_fetched:
    334 wharariki:[234]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.txt | wc
     420wharariki:[234]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | wc
    335421   3441   21326  579499
    336 wharariki:[244]/Scratch/ak19/maori-lang-detection/src>egrep "success/redirect|success/ok|failed/exception" InfoOnEmptyPagesNotInMongoDB.txt | wc
     422wharariki:[244]/Scratch/ak19/maori-lang-detection/src>egrep "success/redirect|success/ok|failed/exception" InfoOnEmptyPagesNotInMongoDB.csv | wc
    337423   3154   20465  542771
    338 wharariki:[245]/Scratch/ak19/maori-lang-detection/src>egrep -v "success/redirect|success/ok|failed/exception" InfoOnEmptyPagesNotInMongoDB.txt | less
    339 wharariki:[246]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" | egrep -v "success/redirect|success/ok|failed/exception" InfoOnEmptyPagesNotInMongoDB.txt | less
    340 
    341 wharariki:[247]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.txt | egrep -v "success/redirect|success/ok|failed/exception" | lesswharariki:[248]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.txt | egrep -v "success/redirect|success/ok|failed/exception" | wc
     424wharariki:[245]/Scratch/ak19/maori-lang-detection/src>egrep -v "success/redirect|success/ok|failed/exception" InfoOnEmptyPagesNotInMongoDB.csv | less
     425wharariki:[246]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" | egrep -v "success/redirect|success/ok|failed/exception" InfoOnEmptyPagesNotInMongoDB.csv | less
     426
     427wharariki:[247]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | egrep -v "success/redirect|success/ok|failed/exception" | lesswharariki:[248]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | egrep -v "success/redirect|success/ok|failed/exception" | wc
    342428    287     861   36728
    343429
    344430(No equivalent info to success/ok, success/redirect, failed/exception)
    345431
     432-----------------------------
     433No status information for many pages on site 00154, from the following point onwards (crawled too much of the site?):
     434    http://m.biblepub.com/bibles/mb/19/81   key:    com.biblepub.m:http/bibles/mb/19/81
     435    baseUrl:        null
     436    status: 2 (status_fetched)
     437    fetchTime:      1573978084279
     438    prevFetchTime:  1571385510616
     439    fetchInterval:  2592000
     440    retriesSinceFetch:      0
     441    modifiedTime:   0
     442    prevModifiedTime:       0
     443    protocolStatus: SUCCESS, args=[]
     444    signature:      3e214d69ab677a676e40c2b91901acc9
     445    parseStatus:    success/ok (1/0), args=[]
     446    title:  Psalm 81 - Maori Bible - Bibles - BiblePub Mobile
     447    score:  1.0
     448    marker _injmrk_ :       y
     449    marker _updmrk_ :       1571386061-31026
     450    marker dist :   0
     451    reprUrl:        null
     452    batchId:        1571386061-31026
     453    metadata CharEncodingForConversion :    utf-8
     454    metadata OriginalCharEncoding :         utf-8
     455    metadata _rs_ :         ^@^@^By
     456    metadata _csh_ :        ^@^@^@^@
     457    text:start:
     458    Psalm 81 - Maori Bible - Bibles - BiblePub Mobile Maori Bible Books next back Psalm 81 1 Ki te tino kaiwhakatangi. Kititi. Na Ahapa. Kia kaha te waiata ki te Atua, ki to tatou kaha: kia hari te hamama ki
     459    te Atua o Hakopa. 2 Whakahuatia te himene, maua mai ki konei te timipera, te hapa reka me te hatere. 3 Whakatangihia te tetere i te kowhititanga marama, i te kinga o te marama, i to tatou ra hakari. 4 Ko
     460    te tikanga hoki tenei ma Iharaira, he mea whakarite na te Atua o Hakopa. 5 I whakatakotoria tenei e ia ma Hohepa hei whakaaturanga, i tona haerenga puta noa i te whenua o Ihipa: i rongo ai ahau ki reira i
     461     tetahi reo, kahore ahau i matau. 6 I tangohia mai e ahau tona pokohiwi i te pikaunga: whakarerea ake e ona ringa te kete. 7 I karanga koe ki ahau i te pouritanga, a kua ora koe i ahau; i whakahoki kupu a
     462    hau ki a koe i te wahi ngaro o te whatitiri; i whakamatau i a koe ki nga wai o Meripa. (Hera. 8 Whakarongo, e taku iwi, a ka whakaatu ahau ki a koe: e Iharaira, ki te whakarongo koe ki ahau; 9 Aua tetahi
     463    atua ke i roto i a koe; kaua ano e koropiko ki te atua ke. 10 Ko Ihowa ahau, ko tou Atua, i arahina mai ai koe i te whenua o Ihipa: kia nui te kowhera o tou mangai, a maku e whakaki. 11 Otiia kihai taku i
     464    wi i pai ki te whakarongo ki toku reo: kihai ano a Iharaira i aro ki ahau. 12 Na tukua atu ana ratou e ahau ki te maro o o ratou ngakau: a haere ana ratou i runga i o ratou whakaaro. 13 Aue, te whakarongo
     465     taku iwi ki ahau! Te haere a Iharaira i aku ara! 14 Penei e kore e aha kua whati i ahau te tara o o ratou hoariri: kua tahuri ano toku ringa ki o ratou hoariri. 15 Ko te hunga e kino ana ki a Ihowa kua n
     466    gohengohe ki a ia: ko to ratou taima ia kua mau tonu. 16 Kua whangainga hoki ratou e ia ki te witi pai rawa, kua whakamakonatia ano koe e ahau ki te honi i roto i te kohatu. next back Contact Us - Full Si
     467    te © 2013 BiblePub
     468    text:end:
     469
     470    http://m.biblepub.com/bibles/mb/19/82   key:    com.biblepub.m:http/bibles/mb/19/82
     471    baseUrl:        null
     472    status: 1 (status_unfetched)
     473    fetchTime:      1571386117381
     474    prevFetchTime:  0
     475    fetchInterval:  2592000
     476    retriesSinceFetch:      0
     477    modifiedTime:   0
     478    prevModifiedTime:       0
     479    protocolStatus: (null)
     480    parseStatus:    (null)
     481    title:  null
     482    score:  0.0
     483    marker dist :   1
     484    reprUrl:        null
     485    metadata _csh_ :        ^@^@^@^@
     486
Note: See TracChangeset for help on using the changeset viewer.