Changeset 34007

Show
Ignore:
Timestamp:
10.03.2020 19:56:01 (3 weeks ago)
Author:
ak19
Message:

Prepared more data for the piecharts. This time for empty web pages vs non-empty web pages that were crawled. Piecharts for tehese.

Location:
other-projects/maori-lang-detection/mongodb-data
Files:
5 added
2 modified

Legend:

Unmodified
Added
Removed
  • other-projects/maori-lang-detection/mongodb-data/piechart_data.txt

    r34006 r34007  
    348348    SSL Exceptions like fatal alert/internal error, SSLHandshakeException (SSL security issues / invalid certificate),  
    349349    (EXCEPTION, args=[javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target]) 
    350     - (null) 
     350    - (null): 553320 URLs - all status_unfetched without EXCEPTION 
    351351 
    352352     
     
    354354       1847   11254  381055 
    355355 
    356      
     356 
    357357 
    358358status_redir_temp, status_redir_perm 
     
    360360    - TEMP_MOVED  
    361361 
     362    TOTAL: 
    362363    wharariki:[327]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_redir" InfoOnEmptyPagesNotInMongoDB.csv | wc 
    363364      10959   32941 1927067 
     365 
    364366    wharariki:[328]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_redir_temp" InfoOnEmptyPagesNotInMongoDB.csv | wc 
    365367       4872   14625  906162 
  • other-projects/maori-lang-detection/mongodb-data/piechart_data2.txt

    r34006 r34007  
    1010https://www.meta-chart.com/pie#/data 
    1111 
    12 Number of slices -> 4 
    13 Series Unit: URLs 
     12* Select "Number of slices" 
     13* Number of Slices: 4 
     14* Series Unit: URLs 
    1415 
    15 Slice 1: discarded (red) 10290 
    16 Slice 2: greyListed (grey) 2751 
    17 Slice 3: further pruned away (yellow) 4 
    18 Slice 4: final crawl seeds (green) 25679 
     16* Slice 1: discarded (red) 10290 
     17* Slice 2: greyListed (grey) 2751 
     18* Slice 3: further pruned away (yellow) 4 
     19* Slice 4: final crawl seeds (green) 25679 
    1920 
    2021https://www.meta-chart.com/pie#/labels 
    21 Graph title: Processing the 38724 out of >11.4 billion URLs in the 12-month CommonCrawl data which had content_language=MRI 
    22 Slice Display data label display setting: Name, Value and Percent 
     22* Graph title: Processing the 38724 out of >11.4 billion URLs in the 12-month CommonCrawl data which had content_language=MRI 
     23* Slice Display data label display setting: Name, Value and Percent 
    2324 
    2425https://www.meta-chart.com/pie#/display 
    25 Export as SVG and PNG 
     26Export as both SVG and PNG 
    2627Leave Sort setting at botton to "ORIG (default)" 
     28 
     29====================================================================================================== 
     30 
     311463 sites to crawl, 16 left out, 1 failed to produce output 
     32619 out of remaining 1446 sites not crawled to completion at depth=10 
     33 
     34 
     35Non-empty crawled web pages stored in MongoDB vs empty crawled web pages 
     36 
     37119874 non-empty crawled pages stored in MongoDB 
     38587081 crawled pages left out of DB for being empty: 
     39     
     40    status_fetched: 
     41    2502 empty pages fetched_SUCCESS 
     42    939 empty pages fetched_failed_parseException 
     43 
     44    status_unfetched: 
     45    1847 empty pages unfetched_due_to_EXCEPTION 
     46    553320 empty pages unfetched_unknown_cause 
     47 
     48    status_redir_(perm/temp): 
     49    6087 empty pages permanently_moved 
     50    4872 empty pages temporarily_moved 
     51 
     52    status_gone: 
     53    3276 empty pages gone_NOTFOUND 
     54    374 empty pages gone_GONE 
     55    2253 empty pages gone_ROBOTS_DENIED 
     56    4 empty pages gone_ACCESS_DENIED 
     57 
     58    status_notmodified: 
     59    291 empty pages notmodified 
     60 
     61    ?status (null): 
     62    11316 empty pages UNKNOWN cause 
     63 
     64= 587081 empty pages. 
     65 
     66 
     67https://www.meta-chart.com/pie#/ 
     68Graph title:  
     69* 119874 non-empty crawled web pages stored in MongoDB vs 587081 empty crawled web pages 
     70OR:  
     71* Crawled web pages: 119874 non-empty stored in MongoDB vs 587081 empty 
     72 
     7313 SLICES:  
     7401. 119874 non-empty pages in MongoDB (green) 
     7502. 2502 empty pages fetched_SUCCESS (orange) 
     7603. 939 empty pages fetched failed_parseException (pink) 
     7704. 1847 empty pages unfetched due to Exception (magenta) 
     7805. 553320 empty pages unfetched unknown cause (red) 
     7906. 6087 empty pages permanently moved (yellow-orange) 
     8007. 4872 empty pages temporarily moved (brown) 
     8108. 3276 empty pages gone NOTFOUND (light blue) 
     8209. 374 empty pages gone GONE (Dark blue) 
     8310. 2253 empty pages gone ROBOTS_DENIED (Dark purple) 
     8411. 4 empty pages gone ACCESS_DENIED (violet) 
     8512. 291 empty pages notmodified (yellow) 
     8613. 11316 empty pages due to UNKNOWN cause (grey) 
     87 
     88 
     89 
     90Graph title:  
     91* 119874 non-empty crawled web pages stored in MongoDB vs 587081 empty crawled web pages 
     92OR:  
     93* Crawled web pages: 119874 non-empty stored in MongoDB vs 587081 empty 
     94 
     959 SLICES: 
     9601. 119874 non-empty pages in MongoDB (green) 
     9702. 555167 empty status_unfetched  
     98  a. 553320 empty pages unfetched unknown cause 
     99  b. 1847 empty pages unfetched due to Exception 
     10003. 3441 empty status_fetched 
     101  a. 2502 empty pages fetched_SUCCESS 
     102  b. 939 empty pages fetched failed_parseException 
     10304. 5907 empty status_gone 
     10405. 291 empty status_notmodified 
     10506. 10959 empty status_redir 
     10607. 11316 empty status unknown 
     107