Ignore:
Timestamp:
2020-03-10T17:27:07+13:00 (4 years ago)
Author:
ak19
Message:

Renaming csv file to have csv extension

File:
1 edited

Legend:

Unmodified
Added
Removed
  • other-projects/maori-lang-detection/mongodb-data/piechart_data.txt

    r34001 r34004  
     1https://www.rapidtables.com/tools/pie-chart.html
     2https://www.meta-chart.com/pie#/data
     3
     4"11.5 billion CC URLs"
     538724 CC URLs in "MRI"
     610290 URLs discarded (blacklisted and too little text)
     72751 URLs greylisted
     825683-4 URLs retained = 25679 seed URLs for crawling
     9
     101463 sites prepared for crawling
     111447 sites crawled (16 were autotranslated or otherwise irrelevant)
     121446 crawled sites contained dump.txt files (1 site was missing dump.txt) - 1446 sites in mongodb
     13619 sites not finished crawling
     141027 sites where dump.txt contained text:start denoting text content, so 419 sites with no text content
     15
     16
     17119874 crawled web pages in mongodb
     18
     193276 crawled pages with no text content in dump.txt because the page was inaccessible when crawling (protocolStatus: NOTFOUND)
     20
     21----------
     22
    123The 12 month period CommonCrawl crawl data that we used:
    224
     
    3961
    4062= 9100 million or 9.1 billion new URLs not contained in any crawl archive before
    41 + taking the first crawl month's figure of 2.8 billion - 500 million new URL in first crawl month = 11.4 billion URLs? At least?
     63+ taking the first crawl month's figure of 2.8 billion - 500 million new URLs in 1st month crawled = 11.4 billion URLs? At least?
    4264---------------------------------------------
    4365
     
    139161d. Of the keepURLs, 4 more webpages ultimately irrelevant sites at unprocessed-topsite-matches.txt.
    140162
    141 3 not in MRI but of the same domain, one is just a gallery of holiday pictures.
     1633 are not in MRI but are of the same domain, one is just a gallery of holiday pictures.
    142164
    143165> less unprocessed-topsite-matches.txt
     
    242264
    243265
     266wharariki:[143]/Scratch/ak19/maori-lang-detection/src>wc -l ../mongodb-data/InfoOnEmptyPagesNotInMongoDB.txt
     267589179 ../mongodb-data/InfoOnEmptyPagesNotInMongoDB.txt
     268
     269- 17 lines at start that aren't about empty web pages in dump.txt = 589162 empty web pages
     270
     271
     272
     273================================
     274Inspecting the csv file:
     275
     276wharariki:[198]/Scratch/ak19/maori-lang-detection/src>wc -l InfoOnEmptyPagesNotInMongoDB.txt
     277587082 InfoOnEmptyPagesNotInMongoDB.txt
     278-1 for column headings =
     279587081 empty pages
     280
     281wharariki:[183]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.txt | wc
     282   3441   21326  579499
     283
     284    OF WHICH fetched but parseException:
     285        wharariki:[187]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.txt | fgrep "ParseException" | wc
     286            939    9390  219818
     287
     288    wharariki:[214]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.txt | fgrep -v "ParseException" | wc
     289        2502   11936  359681
     290
     291    ONLY OTHER OPTION FOR status_fetched IS SUCCESS:
     292        wharariki:[211]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.txt | egrep -v "ParseException|SUCCESS" | wc
     293              0       0       0
     294
     295wharariki:[188]/Scratch/ak19/maori-lang-detection/src>fgrep "status_unfetched" InfoOnEmptyPagesNotInMongoDB.txt | wc
     296 555167 1117894 60067623
     297
     298    wharariki:[191]/Scratch/ak19/maori-lang-detection/src>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.txt | wc
     299       5907   17929 1059096
     300    wharariki:[192]/Scratch/ak19/maori-lang-detection/src>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.txt  | fgrep "NOTFOUND" | wc
     301       3276    9828  695839
     302
     303For status_gone, alternative values to NOTFOUND are GONE and ROBOTS_DENIED and ACCESS_DENIED:
     304    wharariki:[200]/Scratch/ak19/maori-lang-detection/src>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.txt  | fgrep -v "NOTFOUND" | less
     305    wharariki:[204]/Scratch/ak19/maori-lang-detection/src>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.txt  | egrep -v "NOTFOUND|GONE|ROBOTS_DENIED" | less
     306
     307
     308wharariki:[196]/Scratch/ak19/maori-lang-detection/src>fgrep "status_notmodified" InfoOnEmptyPagesNotInMongoDB.txt  | wc
     309    291     873   51684
     310wharariki:[197]/Scratch/ak19/maori-lang-detection/src>fgrep "status_notmodified" InfoOnEmptyPagesNotInMongoDB.txt  | fgrep "NOTMODIFIED" | wc
     311    291     873   51684
     312
     313
     314========
     315
     316wharariki:[222]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.txt | fgrep -v "success/ok" | wc
     317   1376   11001  289780
     318wharariki:[223]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.txt | fgrep "success/ok" | fgrep "ParseException" | wc
     319      0       0       0
     320
     321
     322wharariki:[226]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.txt | fgrep -v "success/ok" | fgrep -v "ParseException" | less
     323wharariki:[227]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.txt | fgrep -v "success/ok" | fgrep -v "ParseException" | wc
     324    437    1611   69962
     325
     326- "success/ok"
     327- "success/redirect"
     328- "failed/exception" for ParseException
     329All failed/exception are ParseExceptions:
     330wharariki:[233]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.txt | fgrep "failed/exception" | fgrep -v "ParseException" | wc
     331      0       0       0
     332
     333ALL THE status_fetched:
     334wharariki:[234]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.txt | wc
     335   3441   21326  579499
     336wharariki:[244]/Scratch/ak19/maori-lang-detection/src>egrep "success/redirect|success/ok|failed/exception" InfoOnEmptyPagesNotInMongoDB.txt | wc
     337   3154   20465  542771
     338wharariki:[245]/Scratch/ak19/maori-lang-detection/src>egrep -v "success/redirect|success/ok|failed/exception" InfoOnEmptyPagesNotInMongoDB.txt | less
     339wharariki:[246]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" | egrep -v "success/redirect|success/ok|failed/exception" InfoOnEmptyPagesNotInMongoDB.txt | less
     340
     341wharariki:[247]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.txt | egrep -v "success/redirect|success/ok|failed/exception" | lesswharariki:[248]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.txt | egrep -v "success/redirect|success/ok|failed/exception" | wc
     342    287     861   36728
     343
     344(No equivalent info to success/ok, success/redirect, failed/exception)
     345
Note: See TracChangeset for help on using the changeset viewer.