Changeset 34007 for other-projects
- Timestamp:
- 2020-03-10T19:56:01+13:00 (4 years ago)
- Location:
- other-projects/maori-lang-detection/mongodb-data
- Files:
-
- 5 added
- 2 edited
Legend:
- Unmodified
- Added
- Removed
-
other-projects/maori-lang-detection/mongodb-data/piechart_data.txt
r34006 r34007 348 348 SSL Exceptions like fatal alert/internal error, SSLHandshakeException (SSL security issues / invalid certificate), 349 349 (EXCEPTION, args=[javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target]) 350 - (null) 350 - (null): 553320 URLs - all status_unfetched without EXCEPTION 351 351 352 352 … … 354 354 1847 11254 381055 355 355 356 356 357 357 358 358 status_redir_temp, status_redir_perm … … 360 360 - TEMP_MOVED 361 361 362 TOTAL: 362 363 wharariki:[327]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_redir" InfoOnEmptyPagesNotInMongoDB.csv | wc 363 364 10959 32941 1927067 365 364 366 wharariki:[328]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_redir_temp" InfoOnEmptyPagesNotInMongoDB.csv | wc 365 367 4872 14625 906162 -
other-projects/maori-lang-detection/mongodb-data/piechart_data2.txt
r34006 r34007 10 10 https://www.meta-chart.com/pie#/data 11 11 12 Number of slices -> 4 13 Series Unit: URLs 12 * Select "Number of slices" 13 * Number of Slices: 4 14 * Series Unit: URLs 14 15 15 Slice 1: discarded (red) 1029016 Slice 2: greyListed (grey) 275117 Slice 3: further pruned away (yellow) 418 Slice 4: final crawl seeds (green) 2567916 * Slice 1: discarded (red) 10290 17 * Slice 2: greyListed (grey) 2751 18 * Slice 3: further pruned away (yellow) 4 19 * Slice 4: final crawl seeds (green) 25679 19 20 20 21 https://www.meta-chart.com/pie#/labels 21 Graph title: Processing the 38724 out of >11.4 billion URLs in the 12-month CommonCrawl data which had content_language=MRI22 Slice Display data label display setting: Name, Value and Percent22 * Graph title: Processing the 38724 out of >11.4 billion URLs in the 12-month CommonCrawl data which had content_language=MRI 23 * Slice Display data label display setting: Name, Value and Percent 23 24 24 25 https://www.meta-chart.com/pie#/display 25 Export as SVG and PNG26 Export as both SVG and PNG 26 27 Leave Sort setting at botton to "ORIG (default)" 28 29 ====================================================================================================== 30 31 1463 sites to crawl, 16 left out, 1 failed to produce output 32 619 out of remaining 1446 sites not crawled to completion at depth=10 33 34 35 Non-empty crawled web pages stored in MongoDB vs empty crawled web pages 36 37 119874 non-empty crawled pages stored in MongoDB 38 587081 crawled pages left out of DB for being empty: 39 40 status_fetched: 41 2502 empty pages fetched_SUCCESS 42 939 empty pages fetched_failed_parseException 43 44 status_unfetched: 45 1847 empty pages unfetched_due_to_EXCEPTION 46 553320 empty pages unfetched_unknown_cause 47 48 status_redir_(perm/temp): 49 6087 empty pages permanently_moved 50 4872 empty pages temporarily_moved 51 52 status_gone: 53 3276 empty pages gone_NOTFOUND 54 374 empty pages gone_GONE 55 2253 empty pages gone_ROBOTS_DENIED 56 4 empty pages gone_ACCESS_DENIED 57 58 status_notmodified: 59 291 empty pages notmodified 60 61 ?status (null): 62 11316 empty pages UNKNOWN cause 63 64 = 587081 empty pages. 65 66 67 https://www.meta-chart.com/pie#/ 68 Graph title: 69 * 119874 non-empty crawled web pages stored in MongoDB vs 587081 empty crawled web pages 70 OR: 71 * Crawled web pages: 119874 non-empty stored in MongoDB vs 587081 empty 72 73 13 SLICES: 74 01. 119874 non-empty pages in MongoDB (green) 75 02. 2502 empty pages fetched_SUCCESS (orange) 76 03. 939 empty pages fetched failed_parseException (pink) 77 04. 1847 empty pages unfetched due to Exception (magenta) 78 05. 553320 empty pages unfetched unknown cause (red) 79 06. 6087 empty pages permanently moved (yellow-orange) 80 07. 4872 empty pages temporarily moved (brown) 81 08. 3276 empty pages gone NOTFOUND (light blue) 82 09. 374 empty pages gone GONE (Dark blue) 83 10. 2253 empty pages gone ROBOTS_DENIED (Dark purple) 84 11. 4 empty pages gone ACCESS_DENIED (violet) 85 12. 291 empty pages notmodified (yellow) 86 13. 11316 empty pages due to UNKNOWN cause (grey) 87 88 89 90 Graph title: 91 * 119874 non-empty crawled web pages stored in MongoDB vs 587081 empty crawled web pages 92 OR: 93 * Crawled web pages: 119874 non-empty stored in MongoDB vs 587081 empty 94 95 9 SLICES: 96 01. 119874 non-empty pages in MongoDB (green) 97 02. 555167 empty status_unfetched 98 a. 553320 empty pages unfetched unknown cause 99 b. 1847 empty pages unfetched due to Exception 100 03. 3441 empty status_fetched 101 a. 2502 empty pages fetched_SUCCESS 102 b. 939 empty pages fetched failed_parseException 103 04. 5907 empty status_gone 104 05. 291 empty status_notmodified 105 06. 10959 empty status_redir 106 07. 11316 empty status unknown 107
Note:
See TracChangeset
for help on using the changeset viewer.