https://www.rapidtables.com/tools/pie-chart.html Title: 38724 out of >11.4 billion URLs in 12-month CommonCrawl data had content_language=MRI data names: discarded_10290 greylisted_2751 pruned_4 crawlSeeds_25679 data values: 10290 2751 4 25679 slice text: (Percentage) ------ https://www.meta-chart.com/pie#/data * Select "Number of slices" * Number of Slices: 4 * Series Unit: URLs * Slice 1: discarded (red) 10290 * Slice 2: greyListed (grey) 2751 * Slice 3: further pruned away (yellow) 4 * Slice 4: final crawl seeds (green) 25679 https://www.meta-chart.com/pie#/labels * Graph title: Processing the 38724 out of >11.4 billion URLs in the 12-month CommonCrawl data which had content_language=MRI * Slice Display data label display setting: Name, Value and Percent https://www.meta-chart.com/pie#/display Export as both SVG and PNG Leave Sort setting at botton to "ORIG (default)" ====================================================================================================== 1463 sites to crawl, 16 left out, 1 failed to produce output 619 out of remaining 1446 sites not crawled to completion at depth=10 Non-empty crawled web pages stored in MongoDB vs empty crawled web pages 119874 non-empty crawled pages stored in MongoDB 587081 crawled pages left out of DB for being empty: status_fetched: 2502 empty pages fetched_SUCCESS 939 empty pages fetched_failed_parseException status_unfetched: 1847 empty pages unfetched_due_to_EXCEPTION 553320 empty pages unfetched_unknown_cause status_redir_(perm/temp): 6087 empty pages permanently_moved 4872 empty pages temporarily_moved status_gone: 3276 empty pages gone_NOTFOUND 374 empty pages gone_GONE 2253 empty pages gone_ROBOTS_DENIED 4 empty pages gone_ACCESS_DENIED status_notmodified: 291 empty pages notmodified ?status (null): 11316 empty pages UNKNOWN cause = 587081 empty pages. https://www.meta-chart.com/pie#/ Graph title: * 119874 non-empty crawled web pages stored in MongoDB vs 587081 empty crawled web pages OR: * Crawled web pages: 119874 non-empty stored in MongoDB vs 587081 empty 13 SLICES: 01. 119874 non-empty pages in MongoDB (green) 02. 2502 empty pages fetched_SUCCESS (orange) 03. 939 empty pages fetched failed_parseException (pink) 04. 1847 empty pages unfetched due to Exception (magenta) 05. 553320 empty pages unfetched unknown cause (red) 06. 6087 empty pages permanently moved (yellow-orange) 07. 4872 empty pages temporarily moved (brown) 08. 3276 empty pages gone NOTFOUND (light blue) 09. 374 empty pages gone GONE (Dark blue) 10. 2253 empty pages gone ROBOTS_DENIED (Dark purple) 11. 4 empty pages gone ACCESS_DENIED (violet) 12. 291 empty pages notmodified (yellow) 13. 11316 empty pages due to UNKNOWN cause (grey) Graph title: * 119874 non-empty crawled web pages stored in MongoDB vs 587081 empty crawled web pages OR: * Crawled web pages: 119874 non-empty stored in MongoDB vs 587081 empty 9 SLICES: 01. 119874 non-empty pages in MongoDB (green) 02. 555167 empty status_unfetched a. 553320 empty pages unfetched unknown cause b. 1847 empty pages unfetched due to Exception 03. 3441 empty status_fetched a. 2502 empty pages fetched_SUCCESS b. 939 empty pages fetched failed_parseException 04. 5907 empty status_gone 05. 291 empty status_notmodified 06. 10959 empty status_redir 07. 11316 empty status unknown ============ 1463 sites prepared for crawling 1447 sites crawled (16 were autotranslated or otherwise irrelevant) 1446 crawled sites contained dump.txt files (1 site was missing dump.txt) - 1446 sites in mongodb 619 sites not finished crawling 1027 sites where dump.txt contained text:start denoting text content, so 419 sites with no text content 16 uncrawled irrelevant sites pruned away 1 failed crawl of site (text dump missing) 1446 crawled sites in MongoDB Graph title: Breakdown of the 1463 sites prepared for crawling * 16 uncrawled irrelevant sites pruned away * 1 sites failed to properly crawl (text dump missing) * 619 incompletely crawled sites * 827 completely crawled sites Graph title: Breakdown of the 1463 sites prepared for crawling * 16 uncrawled irrelevant sites pruned away * 1 sites failed to properly crawl (text dump missing) * 419 crawled sites with no text content - 150 crawled sites with 0-size dump.txt files [crawled sites with empty dump.txt files] See below. - 269 crawled sites where dump.txt had no text content * 1027 crawled sites with text content (WebPages collection in MongoDB will have webpage documents for these sites) # All the dump.txt files that are 0 bytes (no content): # https://stackoverflow.com/questions/15703664/find-all-zero-byte-files-in-directory-and-subdirectories wharariki:[396]/Scratch/ak19/maori-lang-detection/crawled>find . -name "dump.txt" -size 0 | sort | wc 150 150 2550