source: other-projects/maori-lang-detection/mongodb-data/piechart_data2.txt@ 34007

Last change on this file since 34007 was 34007, checked in by ak19, 4 years ago

Prepared more data for the piecharts. This time for empty web pages vs non-empty web pages that were crawled. Piecharts for tehese.

File size: 3.3 KB
Line 
1https://www.rapidtables.com/tools/pie-chart.html
2
3Title: 38724 out of >11.4 billion URLs in 12-month CommonCrawl data had content_language=MRI
4data names: discarded_10290 greylisted_2751 pruned_4 crawlSeeds_25679
5data values: 10290 2751 4 25679
6slice text: (Percentage)
7
8
9------
10https://www.meta-chart.com/pie#/data
11
12* Select "Number of slices"
13* Number of Slices: 4
14* Series Unit: URLs
15
16* Slice 1: discarded (red) 10290
17* Slice 2: greyListed (grey) 2751
18* Slice 3: further pruned away (yellow) 4
19* Slice 4: final crawl seeds (green) 25679
20
21https://www.meta-chart.com/pie#/labels
22* Graph title: Processing the 38724 out of >11.4 billion URLs in the 12-month CommonCrawl data which had content_language=MRI
23* Slice Display data label display setting: Name, Value and Percent
24
25https://www.meta-chart.com/pie#/display
26Export as both SVG and PNG
27Leave Sort setting at botton to "ORIG (default)"
28
29======================================================================================================
30
311463 sites to crawl, 16 left out, 1 failed to produce output
32619 out of remaining 1446 sites not crawled to completion at depth=10
33
34
35Non-empty crawled web pages stored in MongoDB vs empty crawled web pages
36
37119874 non-empty crawled pages stored in MongoDB
38587081 crawled pages left out of DB for being empty:
39
40 status_fetched:
41 2502 empty pages fetched_SUCCESS
42 939 empty pages fetched_failed_parseException
43
44 status_unfetched:
45 1847 empty pages unfetched_due_to_EXCEPTION
46 553320 empty pages unfetched_unknown_cause
47
48 status_redir_(perm/temp):
49 6087 empty pages permanently_moved
50 4872 empty pages temporarily_moved
51
52 status_gone:
53 3276 empty pages gone_NOTFOUND
54 374 empty pages gone_GONE
55 2253 empty pages gone_ROBOTS_DENIED
56 4 empty pages gone_ACCESS_DENIED
57
58 status_notmodified:
59 291 empty pages notmodified
60
61 ?status (null):
62 11316 empty pages UNKNOWN cause
63
64= 587081 empty pages.
65
66
67https://www.meta-chart.com/pie#/
68Graph title:
69* 119874 non-empty crawled web pages stored in MongoDB vs 587081 empty crawled web pages
70OR:
71* Crawled web pages: 119874 non-empty stored in MongoDB vs 587081 empty
72
7313 SLICES:
7401. 119874 non-empty pages in MongoDB (green)
7502. 2502 empty pages fetched_SUCCESS (orange)
7603. 939 empty pages fetched failed_parseException (pink)
7704. 1847 empty pages unfetched due to Exception (magenta)
7805. 553320 empty pages unfetched unknown cause (red)
7906. 6087 empty pages permanently moved (yellow-orange)
8007. 4872 empty pages temporarily moved (brown)
8108. 3276 empty pages gone NOTFOUND (light blue)
8209. 374 empty pages gone GONE (Dark blue)
8310. 2253 empty pages gone ROBOTS_DENIED (Dark purple)
8411. 4 empty pages gone ACCESS_DENIED (violet)
8512. 291 empty pages notmodified (yellow)
8613. 11316 empty pages due to UNKNOWN cause (grey)
87
88
89
90Graph title:
91* 119874 non-empty crawled web pages stored in MongoDB vs 587081 empty crawled web pages
92OR:
93* Crawled web pages: 119874 non-empty stored in MongoDB vs 587081 empty
94
959 SLICES:
9601. 119874 non-empty pages in MongoDB (green)
9702. 555167 empty status_unfetched
98 a. 553320 empty pages unfetched unknown cause
99 b. 1847 empty pages unfetched due to Exception
10003. 3441 empty status_fetched
101 a. 2502 empty pages fetched_SUCCESS
102 b. 939 empty pages fetched failed_parseException
10304. 5907 empty status_gone
10405. 291 empty status_notmodified
10506. 10959 empty status_redir
10607. 11316 empty status unknown
107
Note: See TracBrowser for help on using the repository browser.