[34006] | 1 | https://www.rapidtables.com/tools/pie-chart.html
|
---|
| 2 |
|
---|
| 3 | Title: 38724 out of >11.4 billion URLs in 12-month CommonCrawl data had content_language=MRI
|
---|
| 4 | data names: discarded_10290 greylisted_2751 pruned_4 crawlSeeds_25679
|
---|
| 5 | data values: 10290 2751 4 25679
|
---|
| 6 | slice text: (Percentage)
|
---|
| 7 |
|
---|
| 8 |
|
---|
| 9 | ------
|
---|
| 10 | https://www.meta-chart.com/pie#/data
|
---|
| 11 |
|
---|
[34007] | 12 | * Select "Number of slices"
|
---|
| 13 | * Number of Slices: 4
|
---|
| 14 | * Series Unit: URLs
|
---|
[34006] | 15 |
|
---|
[34007] | 16 | * Slice 1: discarded (red) 10290
|
---|
| 17 | * Slice 2: greyListed (grey) 2751
|
---|
| 18 | * Slice 3: further pruned away (yellow) 4
|
---|
| 19 | * Slice 4: final crawl seeds (green) 25679
|
---|
[34006] | 20 |
|
---|
| 21 | https://www.meta-chart.com/pie#/labels
|
---|
[34007] | 22 | * Graph title: Processing the 38724 out of >11.4 billion URLs in the 12-month CommonCrawl data which had content_language=MRI
|
---|
| 23 | * Slice Display data label display setting: Name, Value and Percent
|
---|
[34006] | 24 |
|
---|
| 25 | https://www.meta-chart.com/pie#/display
|
---|
[34007] | 26 | Export as both SVG and PNG
|
---|
[34006] | 27 | Leave Sort setting at botton to "ORIG (default)"
|
---|
[34007] | 28 |
|
---|
| 29 | ======================================================================================================
|
---|
| 30 |
|
---|
| 31 | 1463 sites to crawl, 16 left out, 1 failed to produce output
|
---|
| 32 | 619 out of remaining 1446 sites not crawled to completion at depth=10
|
---|
| 33 |
|
---|
| 34 |
|
---|
| 35 | Non-empty crawled web pages stored in MongoDB vs empty crawled web pages
|
---|
| 36 |
|
---|
| 37 | 119874 non-empty crawled pages stored in MongoDB
|
---|
| 38 | 587081 crawled pages left out of DB for being empty:
|
---|
| 39 |
|
---|
| 40 | status_fetched:
|
---|
| 41 | 2502 empty pages fetched_SUCCESS
|
---|
| 42 | 939 empty pages fetched_failed_parseException
|
---|
| 43 |
|
---|
| 44 | status_unfetched:
|
---|
| 45 | 1847 empty pages unfetched_due_to_EXCEPTION
|
---|
| 46 | 553320 empty pages unfetched_unknown_cause
|
---|
| 47 |
|
---|
| 48 | status_redir_(perm/temp):
|
---|
| 49 | 6087 empty pages permanently_moved
|
---|
| 50 | 4872 empty pages temporarily_moved
|
---|
| 51 |
|
---|
| 52 | status_gone:
|
---|
| 53 | 3276 empty pages gone_NOTFOUND
|
---|
| 54 | 374 empty pages gone_GONE
|
---|
| 55 | 2253 empty pages gone_ROBOTS_DENIED
|
---|
| 56 | 4 empty pages gone_ACCESS_DENIED
|
---|
| 57 |
|
---|
| 58 | status_notmodified:
|
---|
| 59 | 291 empty pages notmodified
|
---|
| 60 |
|
---|
| 61 | ?status (null):
|
---|
| 62 | 11316 empty pages UNKNOWN cause
|
---|
| 63 |
|
---|
| 64 | = 587081 empty pages.
|
---|
| 65 |
|
---|
| 66 |
|
---|
| 67 | https://www.meta-chart.com/pie#/
|
---|
| 68 | Graph title:
|
---|
| 69 | * 119874 non-empty crawled web pages stored in MongoDB vs 587081 empty crawled web pages
|
---|
| 70 | OR:
|
---|
| 71 | * Crawled web pages: 119874 non-empty stored in MongoDB vs 587081 empty
|
---|
| 72 |
|
---|
| 73 | 13 SLICES:
|
---|
| 74 | 01. 119874 non-empty pages in MongoDB (green)
|
---|
| 75 | 02. 2502 empty pages fetched_SUCCESS (orange)
|
---|
| 76 | 03. 939 empty pages fetched failed_parseException (pink)
|
---|
| 77 | 04. 1847 empty pages unfetched due to Exception (magenta)
|
---|
| 78 | 05. 553320 empty pages unfetched unknown cause (red)
|
---|
| 79 | 06. 6087 empty pages permanently moved (yellow-orange)
|
---|
| 80 | 07. 4872 empty pages temporarily moved (brown)
|
---|
| 81 | 08. 3276 empty pages gone NOTFOUND (light blue)
|
---|
| 82 | 09. 374 empty pages gone GONE (Dark blue)
|
---|
| 83 | 10. 2253 empty pages gone ROBOTS_DENIED (Dark purple)
|
---|
| 84 | 11. 4 empty pages gone ACCESS_DENIED (violet)
|
---|
| 85 | 12. 291 empty pages notmodified (yellow)
|
---|
| 86 | 13. 11316 empty pages due to UNKNOWN cause (grey)
|
---|
| 87 |
|
---|
| 88 |
|
---|
| 89 |
|
---|
| 90 | Graph title:
|
---|
| 91 | * 119874 non-empty crawled web pages stored in MongoDB vs 587081 empty crawled web pages
|
---|
| 92 | OR:
|
---|
| 93 | * Crawled web pages: 119874 non-empty stored in MongoDB vs 587081 empty
|
---|
| 94 |
|
---|
| 95 | 9 SLICES:
|
---|
| 96 | 01. 119874 non-empty pages in MongoDB (green)
|
---|
| 97 | 02. 555167 empty status_unfetched
|
---|
| 98 | a. 553320 empty pages unfetched unknown cause
|
---|
| 99 | b. 1847 empty pages unfetched due to Exception
|
---|
| 100 | 03. 3441 empty status_fetched
|
---|
| 101 | a. 2502 empty pages fetched_SUCCESS
|
---|
| 102 | b. 939 empty pages fetched failed_parseException
|
---|
| 103 | 04. 5907 empty status_gone
|
---|
| 104 | 05. 291 empty status_notmodified
|
---|
| 105 | 06. 10959 empty status_redir
|
---|
| 106 | 07. 11316 empty status unknown
|
---|
| 107 |
|
---|
[34011] | 108 |
|
---|
| 109 |
|
---|
| 110 | ============
|
---|
| 111 |
|
---|
| 112 |
|
---|
| 113 | 1463 sites prepared for crawling
|
---|
| 114 | 1447 sites crawled (16 were autotranslated or otherwise irrelevant)
|
---|
| 115 | 1446 crawled sites contained dump.txt files (1 site was missing dump.txt) - 1446 sites in mongodb
|
---|
| 116 | 619 sites not finished crawling
|
---|
| 117 | 1027 sites where dump.txt contained text:start denoting text content, so 419 sites with no text content
|
---|
| 118 |
|
---|
| 119 |
|
---|
| 120 |
|
---|
| 121 | 16 uncrawled irrelevant sites pruned away
|
---|
| 122 | 1 failed crawl of site (text dump missing)
|
---|
| 123 | 1446 crawled sites in MongoDB
|
---|
| 124 |
|
---|
| 125 |
|
---|
| 126 | Graph title: Breakdown of the 1463 sites prepared for crawling
|
---|
| 127 | * 16 uncrawled irrelevant sites pruned away
|
---|
| 128 | * 1 sites failed to properly crawl (text dump missing)
|
---|
| 129 | * 619 incompletely crawled sites
|
---|
| 130 | * 827 completely crawled sites
|
---|
| 131 |
|
---|
| 132 |
|
---|
| 133 | Graph title: Breakdown of the 1463 sites prepared for crawling
|
---|
| 134 | * 16 uncrawled irrelevant sites pruned away
|
---|
| 135 | * 1 sites failed to properly crawl (text dump missing)
|
---|
| 136 | * 419 crawled sites with no text content
|
---|
| 137 | - 150 crawled sites with 0-size dump.txt files [crawled sites with empty dump.txt files] See below.
|
---|
| 138 | - 269 crawled sites where dump.txt had no text content
|
---|
| 139 | * 1027 crawled sites with text content (WebPages collection in MongoDB will have webpage documents for these sites)
|
---|
| 140 |
|
---|
| 141 |
|
---|
| 142 |
|
---|
| 143 |
|
---|
| 144 | # All the dump.txt files that are 0 bytes (no content):
|
---|
| 145 | # https://stackoverflow.com/questions/15703664/find-all-zero-byte-files-in-directory-and-subdirectories
|
---|
| 146 | wharariki:[396]/Scratch/ak19/maori-lang-detection/crawled>find . -name "dump.txt" -size 0 | sort | wc
|
---|
| 147 | 150 150 2550
|
---|
| 148 |
|
---|
| 149 |
|
---|