Changeset 34007 for other-projects


Ignore:
Timestamp:
2020-03-10T19:56:01+13:00 (4 years ago)
Author:
ak19
Message:

Prepared more data for the piecharts. This time for empty web pages vs non-empty web pages that were crawled. Piecharts for tehese.

Location:
other-projects/maori-lang-detection/mongodb-data
Files:
5 added
2 edited

Legend:

Unmodified
Added
Removed
  • other-projects/maori-lang-detection/mongodb-data/piechart_data.txt

    r34006 r34007  
    348348    SSL Exceptions like fatal alert/internal error, SSLHandshakeException (SSL security issues / invalid certificate),
    349349    (EXCEPTION, args=[javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target])
    350     - (null)
     350    - (null): 553320 URLs - all status_unfetched without EXCEPTION
    351351
    352352   
     
    354354       1847   11254  381055
    355355
    356    
     356
    357357
    358358status_redir_temp, status_redir_perm
     
    360360    - TEMP_MOVED
    361361
     362    TOTAL:
    362363    wharariki:[327]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_redir" InfoOnEmptyPagesNotInMongoDB.csv | wc
    363364      10959   32941 1927067
     365
    364366    wharariki:[328]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_redir_temp" InfoOnEmptyPagesNotInMongoDB.csv | wc
    365367       4872   14625  906162
  • other-projects/maori-lang-detection/mongodb-data/piechart_data2.txt

    r34006 r34007  
    1010https://www.meta-chart.com/pie#/data
    1111
    12 Number of slices -> 4
    13 Series Unit: URLs
     12* Select "Number of slices"
     13* Number of Slices: 4
     14* Series Unit: URLs
    1415
    15 Slice 1: discarded (red) 10290
    16 Slice 2: greyListed (grey) 2751
    17 Slice 3: further pruned away (yellow) 4
    18 Slice 4: final crawl seeds (green) 25679
     16* Slice 1: discarded (red) 10290
     17* Slice 2: greyListed (grey) 2751
     18* Slice 3: further pruned away (yellow) 4
     19* Slice 4: final crawl seeds (green) 25679
    1920
    2021https://www.meta-chart.com/pie#/labels
    21 Graph title: Processing the 38724 out of >11.4 billion URLs in the 12-month CommonCrawl data which had content_language=MRI
    22 Slice Display data label display setting: Name, Value and Percent
     22* Graph title: Processing the 38724 out of >11.4 billion URLs in the 12-month CommonCrawl data which had content_language=MRI
     23* Slice Display data label display setting: Name, Value and Percent
    2324
    2425https://www.meta-chart.com/pie#/display
    25 Export as SVG and PNG
     26Export as both SVG and PNG
    2627Leave Sort setting at botton to "ORIG (default)"
     28
     29======================================================================================================
     30
     311463 sites to crawl, 16 left out, 1 failed to produce output
     32619 out of remaining 1446 sites not crawled to completion at depth=10
     33
     34
     35Non-empty crawled web pages stored in MongoDB vs empty crawled web pages
     36
     37119874 non-empty crawled pages stored in MongoDB
     38587081 crawled pages left out of DB for being empty:
     39   
     40    status_fetched:
     41    2502 empty pages fetched_SUCCESS
     42    939 empty pages fetched_failed_parseException
     43
     44    status_unfetched:
     45    1847 empty pages unfetched_due_to_EXCEPTION
     46    553320 empty pages unfetched_unknown_cause
     47
     48    status_redir_(perm/temp):
     49    6087 empty pages permanently_moved
     50    4872 empty pages temporarily_moved
     51
     52    status_gone:
     53    3276 empty pages gone_NOTFOUND
     54    374 empty pages gone_GONE
     55    2253 empty pages gone_ROBOTS_DENIED
     56    4 empty pages gone_ACCESS_DENIED
     57
     58    status_notmodified:
     59    291 empty pages notmodified
     60
     61    ?status (null):
     62    11316 empty pages UNKNOWN cause
     63
     64= 587081 empty pages.
     65
     66
     67https://www.meta-chart.com/pie#/
     68Graph title:
     69* 119874 non-empty crawled web pages stored in MongoDB vs 587081 empty crawled web pages
     70OR:
     71* Crawled web pages: 119874 non-empty stored in MongoDB vs 587081 empty
     72
     7313 SLICES:
     7401. 119874 non-empty pages in MongoDB (green)
     7502. 2502 empty pages fetched_SUCCESS (orange)
     7603. 939 empty pages fetched failed_parseException (pink)
     7704. 1847 empty pages unfetched due to Exception (magenta)
     7805. 553320 empty pages unfetched unknown cause (red)
     7906. 6087 empty pages permanently moved (yellow-orange)
     8007. 4872 empty pages temporarily moved (brown)
     8108. 3276 empty pages gone NOTFOUND (light blue)
     8209. 374 empty pages gone GONE (Dark blue)
     8310. 2253 empty pages gone ROBOTS_DENIED (Dark purple)
     8411. 4 empty pages gone ACCESS_DENIED (violet)
     8512. 291 empty pages notmodified (yellow)
     8613. 11316 empty pages due to UNKNOWN cause (grey)
     87
     88
     89
     90Graph title:
     91* 119874 non-empty crawled web pages stored in MongoDB vs 587081 empty crawled web pages
     92OR:
     93* Crawled web pages: 119874 non-empty stored in MongoDB vs 587081 empty
     94
     959 SLICES:
     9601. 119874 non-empty pages in MongoDB (green)
     9702. 555167 empty status_unfetched
     98  a. 553320 empty pages unfetched unknown cause
     99  b. 1847 empty pages unfetched due to Exception
     10003. 3441 empty status_fetched
     101  a. 2502 empty pages fetched_SUCCESS
     102  b. 939 empty pages fetched failed_parseException
     10304. 5907 empty status_gone
     10405. 291 empty status_notmodified
     10506. 10959 empty status_redir
     10607. 11316 empty status unknown
     107
Note: See TracChangeset for help on using the changeset viewer.