Context Navigation

source: other-projects/maori-lang-detection/mongodb-data/piechart_data.txt@ 34001

Last change on this file since 34001 was 34001, checked in by ak19, 4 years ago
Tentative total urls from common crawl 12 month cral data.
File size: 10.5 KB

Line
1	The 12 month period CommonCrawl crawl data that we used:
2
3	https://commoncrawl.org/2018/10/september-2018-crawl-archive-now-available/
4	- contains 2.8 billion web pages and 220 TiB of uncompressed content
5	- contains 500 million new URLs, not contained in any crawl archive before
6	https://commoncrawl.org/2018/10/october-2018-crawl-archive-now-available/
7	- 3.0 billion web pages and 240 TiB of uncompressed content
8	- 600 million new URLs, not contained in any crawl archive before
9	https://commoncrawl.org/2018/11/november-2018-crawl-archive-now-available/
10	- 2.6 billion web pages or 220 TiB of uncompressed content
11	- 640 million new URLs, not contained in any crawl archive before
12	https://commoncrawl.org/2018/12/december-2018-crawl-archive-now-available/
13	- 3.1 billion web pages or 250 TiB of uncompressed content,
14	- 735 million URLs not contained in any crawl archive before
15	https://commoncrawl.org/2019/01/january-2019-crawl-archive-now-available/
16	- 2.85 billion web pages or 240 TiB of uncompressed content
17	- 850 million URLs not contained in any crawl archive before.
18	https://commoncrawl.org/2019/03/february-2019-crawl-archive-now-available/
19	- 2.9 billion web pages or 225 TiB of uncompressed content
20	- 750 million URLs not contained in any crawl archive before
21	https://commoncrawl.org/2019/04/march-2019-crawl-archive-now-available/
22	- 2.55 billion web pages or 210 TiB of uncompressed content
23	- 660 million URLs not contained in any crawl archive before
24	https://commoncrawl.org/2019/04/april-2019-crawl-archive-now-available/
25	- 2.5 billion web pages or 198 TiB of uncompressed content
26	- 750 million URLs not contained in any crawl archive before
27	https://commoncrawl.org/2019/05/may-2019-crawl-archive-now-available/
28	- 2.65 billion web pages or 220 TiB of uncompressed content
29	- 825 million URLs not contained in any crawl archive before
30	https://commoncrawl.org/2019/07/june-2019-crawl-archive-now-available/
31	- 2.6 billion web pages or 220 TiB of uncompressed content
32	- 880 million URLs not contained in any crawl archive before
33	https://commoncrawl.org/2019/07/july-2019-crawl-archive-now-available/
34	- 2.6 billion web pages or 220 TiB of uncompressed content
35	- 810 million URLs not contained in any crawl archive before
36	https://commoncrawl.org/2019/08/august-2019-crawl-archive-now-available/
37	- 2.95 billion web pages or 260 TiB of uncompressed content
38	- 1.1 billion URLs not contained in any crawl archive before
39
40	= 9100 million or 9.1 billion new URLs not contained in any crawl archive before
41	+ taking the first crawl month's figure of 2.8 billion - 500 million new URL in first crawl month = 11.4 billion URLs? At least?
42	---------------------------------------------
43
44	"UPPER BOUND"
45
46	blacklisted
47	greylisted
48	skipped crawling
49	unfinished (crawling)
50
51	Sites crawled and ingested into mongodb:
52	- domains shortlisted
53	- not shortlisted
54
55
56	Not included: only areas of interest of sites otherwise too big to exhaustively crawl were crawled. Not the rest. For example, not all of wikipedia but only mi.wikipedia.org. Not all of blogspot, only blogspot blogs indicated by common crawl results for MRI. Not all of docs.google.com, only the specific pages that turned up in common crawl for MRI.
57
58
59	1. ALL DOMAINS FROM CC-CRAWL:
60
61	Total counts from CommonCrawl (i.e. unique domain count across discardURLs + greyListed.txt + keepURLs.txt)
62
63	wharariki:[1153]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount
64	Counting all domains and urls in keepURLs.txt + discardURLs.txt + greyListed.txt
65	Count of unique domains: 3074
66	Count of unique basic domains (stripped of protocol and www): 2791
67	Line count: 75559
68	Actual unique URL count: 38717
69	Unique basic URL count (stripped of protocol and www): 32827
70	******************************************************
71
72	[X 1588 domains from discardURLs + 288 (-1) greylistedURLs + 1462 (+1) keepURLs = 3338 domains]
73
74	Line count above is correct and consistent with the following: 23794+4485+47280=75559
75
76	But instead of domain/unique domain or URL/basic unique URL counts. The union of:
77	- domains of the following: 1588+288+1462 = 3338
78	- unique basic domains of the following (stripped of protocol and www): 1415+277+1362 = 3054
79	- basic URL count = 10290 + 2751 + 25683 = 38724
80	- basic unique URL count (stripped of protocol and www) = 9656 + 2727 + 20451 = 32834
81
82
83	wharariki:[1154]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount ../tmp/to_crawl.THE_VERSION_USED/discardURLs.txt
84	Counting all domains and urls in discardURLs.txt
85	Count of unique domains: 1588
86	Count of unique basic domains (stripped of protocol and www): 1415
87	Line count: 23794
88	Actual unique URL count: 10290
89	Unique basic URL count (stripped of protocol and www): 9656
90	******************************************************
91	wharariki:[1155]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount ../tmp/to_crawl.THE_VERSION_USED/greyListed.txt
92	Counting all domains and urls in greyListed.txt
93	Count of unique domains: 288
94	Count of unique basic domains (stripped of protocol and www): 277
95	Line count: 4485
96	Actual unique URL count: 2751
97	Unique basic URL count (stripped of protocol and www): 2727
98	******************************************************
99	wharariki:[1156]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount ../tmp/to_crawl.THE_VERSION_USED/keepURLs.txt
100	Counting all domains and urls in keepURLs.txt
101	Count of unique domains: 1464
102	Count of unique basic domains (stripped of protocol and www): 1362
103	Line count: 47280
104	Actual unique URL count: 25683
105	Unique basic URL count (stripped of protocol and www): 20451
106	******************************************************
107
108
109	XXXXXXXXXX
110	wharariki:[1159]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount ../tmp/to_crawl.THE_VERSION_USED/seedURLs.txt
111	Counting all domains and urls in seedURLs.txt
112	Count of unique domains: 1462
113	Count of unique basic domains (stripped of protocol and www): 1360
114	Line count: 25679
115	Actual unique URL count: 25679
116	Unique basic URL count (stripped of protocol and www): 20447
117	******************************************************
118	XXXXXXXXXX
119
120	seedURLs is a subset of keepURLs.
121
122
123	2a. DISCARDED URLS:
124	URLS that are blacklisted + those pages with too little text content (under an arbitrary min threshold)
125
126	> wc -l discardURLs.txt
127	23794
128
129	b. GREYLISTED URLS:
130	> wc -l greyListed.txt
131	4485
132
133
134	c. keepURLs (the URLs we kept for further processing):
135	wc -l keepURLs.txt
136	47280 keepURLs.txt
137
138
139	d. Of the keepURLs, 4 more webpages ultimately irrelevant sites at unprocessed-topsite-matches.txt.
140
141	3 not in MRI but of the same domain, one is just a gallery of holiday pictures.
142
143	> less unprocessed-topsite-matches.txt
144	The following domain with seedURLs are on a major/top 500 site
145	for which no allowed URL pattern regex has been specified.
146	Specify one for this domain in the tab-spaced sites-too-big-to-exhaustively-crawl.txt file
147	http://familypedia.wikia.com/wiki/Property:Father?limit=500&offset=0
148	http://familypedia.wikia.com/wiki/Property:Mother?limit=250&offset=0
149	http://familypedia.wikia.com/wiki/Property:Mother?limit=500&offset=0
150	https://get.google.com/albumarchive/112997211423463224598/album/AF1QipM73RVcpCT2gpp5XhDUawnfyUDBbuJbeCEbVckl
151
152
153	e. After duplicates further pruned out from what remained of keepURLs - the seedURLs for Nutch:
154
155	wc -l seedURLs.txt
156	25679 seedURLs.txt
157
158	wharariki:[1111]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.UniqueDomainCount ../tmp/to_crawl.THE_VERSION_USED/seedURLs.txt
159	In file ../tmp/to_crawl.THE_VERSION_USED/seedURLs.txt:
160	Count of domains: 1462
161	Count of unique domains: 1360
162
163
164	But anglican.org was wrongly greylisted and added back in
165	-> 1463 domains.
166
167	3a. Num URLs prepared for crawling:
168	wharariki:[119]/Scratch/ak19/maori-lang-detection/tmp/to_crawl.THE_VERSION_USED>wc -l seedURLs.txt
169	25679 seedURLs.txt
170
171	b. Num sites prepared for crawling (https://stackoverflow.com/questions/17648033/counting-number-of-directories-in-a-specific-directory):
172
173
174	wharariki:[147]/Scratch/ak19/maori-lang-detection/tmp/to_crawl.THE_VERSION_USED/sites>echo */ \| wc
175	1 1463 10241
176
177	(2nd number)
178	OR: sites>find . -mindepth 1 -maxdepth 1 -type d \| wc -l
179	1463
180
181
182	/Scratch/ak19/maori-lang-detection/tmp/to_crawl.THE_VERSION_USED/sites/ also contains subfolders up to 01463
183
184
185	[maori-lang-detection/tmp/to_crawl.THE_VERSION_USED>emacs all-domain-urls.txt
186	1462+1 (for the greylisted anglican.org) = 1463]
187
188	4. Num sites crawled:
189	wharariki:[155]/Scratch/ak19/maori-lang-detection/crawled>find . -mindepth 1 -maxdepth 1 -type d \| wc -l
190	1447
191	wharariki:[156]/Scratch/ak19/maori-lang-detection/crawled>echo */ \| wc
192	1 1447 10129
193
194	5. Number of sites not finished crawling (using Nutch at max crawl depth 10):
195	wharariki:[158]/Scratch/ak19/maori-lang-detection/crawled>find . -name "UNFINISHED" \| wc -l
196	619
197
198
199	6. Number of sites in MongoDB:
200	1446
201
202	Not: 00179, 00485-00495, 00499-00502, 01067* (No dump.txt, but website is repeated in 01408)
203
204	* 01067 is listed under sites crawled, but not ingested into mongodb.
205
206	In siteID ranges of 00100s, 00400s, 00500s, 01000s, each have less than 100 sites in mongodbs:
207	99, 88, 97, 99
208
209	and 64/64 sites in siteIDs 1400-1463.
210
211	=> 1+12+3+1 = 17 sites not crawled of the 1463 = 1446 sites ingested in mongodb.
212
213
214	7. Despite 25679 non-duplicate seedURLs and many more pages crawled from those seeds,
215	the number of web pages ingested into mongodb are less than about 5 times as much,
216	because only crawled web pages with non-empty text were ingested into mongodb.
217
218	Num pages in MongoDB:
219	db.getCollection('Webpages').find({}).count()
220	119874
221
222	---------------------------
223
224	#Number of crawled pages with 0 content in dump.txt because the page was inaccessible when crawling (protocolStatus: NOTFOUND)
225	wharariki:[646]/Scratch/ak19/maori-lang-detection/crawled>fgrep -a 'NOTFOUND' 0*/dump.txt \| grep protocolStatus \| wc
226	3276 9828 419259
227
228	#Number of dump.txt files (sites) that had text:start in them vs those that didn't:
229	wharariki:[647]/Scratch/ak19/maori-lang-detection/crawled>fgrep -l text:start */dump.txt \| wc
230	1027 1027 15405
231	wharariki:[648]/Scratch/ak19/maori-lang-detection/crawled>fgrep text:start */dump.txt \| wc
232	1027 4108 35945
233
234	# number of dump.txt files
235	wharariki:[652]/Scratch/ak19/maori-lang-detection/crawled>find . -name "dump.txt" \| wc
236	1446 1446 24582
237	wharariki:[653]/Scratch/ak19/maori-lang-detection/crawled>
238
239
240	Look to see if commoncrawl has a field for how much text there is on the page.
241	Else this is a useful feature for them to add.
242
243

Note: See TracBrowser for help on using the repository browser.

Download in other formats: