source: other-projects/maori-lang-detection/mongodb-data/piechart_data2.txt@ 34011

Last change on this file since 34011 was 34011, checked in by ak19, 4 years ago

Piechart data for sites prepared for crawling and the piecharts for these

File size: 4.7 KB
Line 
1https://www.rapidtables.com/tools/pie-chart.html
2
3Title: 38724 out of >11.4 billion URLs in 12-month CommonCrawl data had content_language=MRI
4data names: discarded_10290 greylisted_2751 pruned_4 crawlSeeds_25679
5data values: 10290 2751 4 25679
6slice text: (Percentage)
7
8
9------
10https://www.meta-chart.com/pie#/data
11
12* Select "Number of slices"
13* Number of Slices: 4
14* Series Unit: URLs
15
16* Slice 1: discarded (red) 10290
17* Slice 2: greyListed (grey) 2751
18* Slice 3: further pruned away (yellow) 4
19* Slice 4: final crawl seeds (green) 25679
20
21https://www.meta-chart.com/pie#/labels
22* Graph title: Processing the 38724 out of >11.4 billion URLs in the 12-month CommonCrawl data which had content_language=MRI
23* Slice Display data label display setting: Name, Value and Percent
24
25https://www.meta-chart.com/pie#/display
26Export as both SVG and PNG
27Leave Sort setting at botton to "ORIG (default)"
28
29======================================================================================================
30
311463 sites to crawl, 16 left out, 1 failed to produce output
32619 out of remaining 1446 sites not crawled to completion at depth=10
33
34
35Non-empty crawled web pages stored in MongoDB vs empty crawled web pages
36
37119874 non-empty crawled pages stored in MongoDB
38587081 crawled pages left out of DB for being empty:
39
40 status_fetched:
41 2502 empty pages fetched_SUCCESS
42 939 empty pages fetched_failed_parseException
43
44 status_unfetched:
45 1847 empty pages unfetched_due_to_EXCEPTION
46 553320 empty pages unfetched_unknown_cause
47
48 status_redir_(perm/temp):
49 6087 empty pages permanently_moved
50 4872 empty pages temporarily_moved
51
52 status_gone:
53 3276 empty pages gone_NOTFOUND
54 374 empty pages gone_GONE
55 2253 empty pages gone_ROBOTS_DENIED
56 4 empty pages gone_ACCESS_DENIED
57
58 status_notmodified:
59 291 empty pages notmodified
60
61 ?status (null):
62 11316 empty pages UNKNOWN cause
63
64= 587081 empty pages.
65
66
67https://www.meta-chart.com/pie#/
68Graph title:
69* 119874 non-empty crawled web pages stored in MongoDB vs 587081 empty crawled web pages
70OR:
71* Crawled web pages: 119874 non-empty stored in MongoDB vs 587081 empty
72
7313 SLICES:
7401. 119874 non-empty pages in MongoDB (green)
7502. 2502 empty pages fetched_SUCCESS (orange)
7603. 939 empty pages fetched failed_parseException (pink)
7704. 1847 empty pages unfetched due to Exception (magenta)
7805. 553320 empty pages unfetched unknown cause (red)
7906. 6087 empty pages permanently moved (yellow-orange)
8007. 4872 empty pages temporarily moved (brown)
8108. 3276 empty pages gone NOTFOUND (light blue)
8209. 374 empty pages gone GONE (Dark blue)
8310. 2253 empty pages gone ROBOTS_DENIED (Dark purple)
8411. 4 empty pages gone ACCESS_DENIED (violet)
8512. 291 empty pages notmodified (yellow)
8613. 11316 empty pages due to UNKNOWN cause (grey)
87
88
89
90Graph title:
91* 119874 non-empty crawled web pages stored in MongoDB vs 587081 empty crawled web pages
92OR:
93* Crawled web pages: 119874 non-empty stored in MongoDB vs 587081 empty
94
959 SLICES:
9601. 119874 non-empty pages in MongoDB (green)
9702. 555167 empty status_unfetched
98 a. 553320 empty pages unfetched unknown cause
99 b. 1847 empty pages unfetched due to Exception
10003. 3441 empty status_fetched
101 a. 2502 empty pages fetched_SUCCESS
102 b. 939 empty pages fetched failed_parseException
10304. 5907 empty status_gone
10405. 291 empty status_notmodified
10506. 10959 empty status_redir
10607. 11316 empty status unknown
107
108
109
110============
111
112
1131463 sites prepared for crawling
1141447 sites crawled (16 were autotranslated or otherwise irrelevant)
1151446 crawled sites contained dump.txt files (1 site was missing dump.txt) - 1446 sites in mongodb
116619 sites not finished crawling
1171027 sites where dump.txt contained text:start denoting text content, so 419 sites with no text content
118
119
120
12116 uncrawled irrelevant sites pruned away
1221 failed crawl of site (text dump missing)
1231446 crawled sites in MongoDB
124
125
126Graph title: Breakdown of the 1463 sites prepared for crawling
127* 16 uncrawled irrelevant sites pruned away
128* 1 sites failed to properly crawl (text dump missing)
129* 619 incompletely crawled sites
130* 827 completely crawled sites
131
132
133Graph title: Breakdown of the 1463 sites prepared for crawling
134* 16 uncrawled irrelevant sites pruned away
135* 1 sites failed to properly crawl (text dump missing)
136* 419 crawled sites with no text content
137 - 150 crawled sites with 0-size dump.txt files [crawled sites with empty dump.txt files] See below.
138 - 269 crawled sites where dump.txt had no text content
139* 1027 crawled sites with text content (WebPages collection in MongoDB will have webpage documents for these sites)
140
141
142
143
144# All the dump.txt files that are 0 bytes (no content):
145# https://stackoverflow.com/questions/15703664/find-all-zero-byte-files-in-directory-and-subdirectories
146wharariki:[396]/Scratch/ak19/maori-lang-detection/crawled>find . -name "dump.txt" -size 0 | sort | wc
147 150 150 2550
148
149
Note: See TracBrowser for help on using the repository browser.