1 | https://www.rapidtables.com/tools/pie-chart.html
|
---|
2 |
|
---|
3 | Title: 38724 out of >11.4 billion URLs in 12-month CommonCrawl data had content_language=MRI
|
---|
4 | data names: discarded_10290 greylisted_2751 pruned_4 crawlSeeds_25679
|
---|
5 | data values: 10290 2751 4 25679
|
---|
6 | slice text: (Percentage)
|
---|
7 |
|
---|
8 |
|
---|
9 | ------
|
---|
10 | https://www.meta-chart.com/pie#/data
|
---|
11 |
|
---|
12 | * Select "Number of slices"
|
---|
13 | * Number of Slices: 4
|
---|
14 | * Series Unit: URLs
|
---|
15 |
|
---|
16 | * Slice 1: discarded (red) 10290
|
---|
17 | * Slice 2: greyListed (grey) 2751
|
---|
18 | * Slice 3: further pruned away (yellow) 4
|
---|
19 | * Slice 4: final crawl seeds (green) 25679
|
---|
20 |
|
---|
21 | https://www.meta-chart.com/pie#/labels
|
---|
22 | * Graph title: Processing the 38724 out of >11.4 billion URLs in the 12-month CommonCrawl data which had content_language=MRI
|
---|
23 | * Slice Display data label display setting: Name, Value and Percent
|
---|
24 |
|
---|
25 | https://www.meta-chart.com/pie#/display
|
---|
26 | Export as both SVG and PNG
|
---|
27 | Leave Sort setting at botton to "ORIG (default)"
|
---|
28 |
|
---|
29 | ======================================================================================================
|
---|
30 |
|
---|
31 | 1463 sites to crawl, 16 left out, 1 failed to produce output
|
---|
32 | 619 out of remaining 1446 sites not crawled to completion at depth=10
|
---|
33 |
|
---|
34 |
|
---|
35 | Non-empty crawled web pages stored in MongoDB vs empty crawled web pages
|
---|
36 |
|
---|
37 | 119874 non-empty crawled pages stored in MongoDB
|
---|
38 | 587081 crawled pages left out of DB for being empty:
|
---|
39 |
|
---|
40 | status_fetched:
|
---|
41 | 2502 empty pages fetched_SUCCESS
|
---|
42 | 939 empty pages fetched_failed_parseException
|
---|
43 |
|
---|
44 | status_unfetched:
|
---|
45 | 1847 empty pages unfetched_due_to_EXCEPTION
|
---|
46 | 553320 empty pages unfetched_unknown_cause
|
---|
47 |
|
---|
48 | status_redir_(perm/temp):
|
---|
49 | 6087 empty pages permanently_moved
|
---|
50 | 4872 empty pages temporarily_moved
|
---|
51 |
|
---|
52 | status_gone:
|
---|
53 | 3276 empty pages gone_NOTFOUND
|
---|
54 | 374 empty pages gone_GONE
|
---|
55 | 2253 empty pages gone_ROBOTS_DENIED
|
---|
56 | 4 empty pages gone_ACCESS_DENIED
|
---|
57 |
|
---|
58 | status_notmodified:
|
---|
59 | 291 empty pages notmodified
|
---|
60 |
|
---|
61 | ?status (null):
|
---|
62 | 11316 empty pages UNKNOWN cause
|
---|
63 |
|
---|
64 | = 587081 empty pages.
|
---|
65 |
|
---|
66 |
|
---|
67 | https://www.meta-chart.com/pie#/
|
---|
68 | Graph title:
|
---|
69 | * 119874 non-empty crawled web pages stored in MongoDB vs 587081 empty crawled web pages
|
---|
70 | OR:
|
---|
71 | * Crawled web pages: 119874 non-empty stored in MongoDB vs 587081 empty
|
---|
72 |
|
---|
73 | 13 SLICES:
|
---|
74 | 01. 119874 non-empty pages in MongoDB (green)
|
---|
75 | 02. 2502 empty pages fetched_SUCCESS (orange)
|
---|
76 | 03. 939 empty pages fetched failed_parseException (pink)
|
---|
77 | 04. 1847 empty pages unfetched due to Exception (magenta)
|
---|
78 | 05. 553320 empty pages unfetched unknown cause (red)
|
---|
79 | 06. 6087 empty pages permanently moved (yellow-orange)
|
---|
80 | 07. 4872 empty pages temporarily moved (brown)
|
---|
81 | 08. 3276 empty pages gone NOTFOUND (light blue)
|
---|
82 | 09. 374 empty pages gone GONE (Dark blue)
|
---|
83 | 10. 2253 empty pages gone ROBOTS_DENIED (Dark purple)
|
---|
84 | 11. 4 empty pages gone ACCESS_DENIED (violet)
|
---|
85 | 12. 291 empty pages notmodified (yellow)
|
---|
86 | 13. 11316 empty pages due to UNKNOWN cause (grey)
|
---|
87 |
|
---|
88 |
|
---|
89 |
|
---|
90 | Graph title:
|
---|
91 | * 119874 non-empty crawled web pages stored in MongoDB vs 587081 empty crawled web pages
|
---|
92 | OR:
|
---|
93 | * Crawled web pages: 119874 non-empty stored in MongoDB vs 587081 empty
|
---|
94 |
|
---|
95 | 9 SLICES:
|
---|
96 | 01. 119874 non-empty pages in MongoDB (green)
|
---|
97 | 02. 555167 empty status_unfetched
|
---|
98 | a. 553320 empty pages unfetched unknown cause
|
---|
99 | b. 1847 empty pages unfetched due to Exception
|
---|
100 | 03. 3441 empty status_fetched
|
---|
101 | a. 2502 empty pages fetched_SUCCESS
|
---|
102 | b. 939 empty pages fetched failed_parseException
|
---|
103 | 04. 5907 empty status_gone
|
---|
104 | 05. 291 empty status_notmodified
|
---|
105 | 06. 10959 empty status_redir
|
---|
106 | 07. 11316 empty status unknown
|
---|
107 |
|
---|
108 |
|
---|
109 |
|
---|
110 | ============
|
---|
111 |
|
---|
112 |
|
---|
113 | 1463 sites prepared for crawling
|
---|
114 | 1447 sites crawled (16 were autotranslated or otherwise irrelevant)
|
---|
115 | 1446 crawled sites contained dump.txt files (1 site was missing dump.txt) - 1446 sites in mongodb
|
---|
116 | 619 sites not finished crawling
|
---|
117 | 1027 sites where dump.txt contained text:start denoting text content, so 419 sites with no text content
|
---|
118 |
|
---|
119 |
|
---|
120 |
|
---|
121 | 16 uncrawled irrelevant sites pruned away
|
---|
122 | 1 failed crawl of site (text dump missing)
|
---|
123 | 1446 crawled sites in MongoDB
|
---|
124 |
|
---|
125 |
|
---|
126 | Graph title: Breakdown of the 1463 sites prepared for crawling
|
---|
127 | * 16 uncrawled irrelevant sites pruned away
|
---|
128 | * 1 sites failed to properly crawl (text dump missing)
|
---|
129 | * 619 incompletely crawled sites
|
---|
130 | * 827 completely crawled sites
|
---|
131 |
|
---|
132 |
|
---|
133 | Graph title: Breakdown of the 1463 sites prepared for crawling
|
---|
134 | * 16 uncrawled irrelevant sites pruned away
|
---|
135 | * 1 sites failed to properly crawl (text dump missing)
|
---|
136 | * 419 crawled sites with no text content
|
---|
137 | - 150 crawled sites with 0-size dump.txt files [crawled sites with empty dump.txt files] See below.
|
---|
138 | - 269 crawled sites where dump.txt had no text content
|
---|
139 | * 1027 crawled sites with text content (WebPages collection in MongoDB will have webpage documents for these sites)
|
---|
140 |
|
---|
141 |
|
---|
142 |
|
---|
143 |
|
---|
144 | # All the dump.txt files that are 0 bytes (no content):
|
---|
145 | # https://stackoverflow.com/questions/15703664/find-all-zero-byte-files-in-directory-and-subdirectories
|
---|
146 | wharariki:[396]/Scratch/ak19/maori-lang-detection/crawled>find . -name "dump.txt" -size 0 | sort | wc
|
---|
147 | 150 150 2550
|
---|
148 |
|
---|
149 |
|
---|