Context Navigation

piechart_data.txt@ 34089

Last change on this file since 34089 was 34089, checked in by ak19, 4 years ago
So far accumulated URLs to docs on Google scholar about or somewhat related to finding low-resource languages on the web, as Dr Bainbridge had suggested.
File size: 25.3 KB

Line
1	https://www.rapidtables.com/tools/pie-chart.html
2	https://www.meta-chart.com/pie#/data (more powerful: can choose colours, display labels)
3
4	"11.5 billion CC URLs"
5	38724 CC URLs in "MRI"
6	10290 URLs discarded (blacklisted and too little text)
7	2751 URLs greylisted
8	25683-4 URLs retained = 25679 seed URLs for crawling
9
10	1463 sites prepared for crawling
11	1447 sites crawled (16 were autotranslated or otherwise irrelevant)
12	1446 crawled sites contained dump.txt files (1 site was missing dump.txt) - 1446 sites in mongodb
13	619 sites not finished crawling
14	1027 sites where dump.txt contained text:start denoting text content, so 419 sites with no text content
15
16
17	119874 crawled web pages in mongodb
18
19	3276 crawled pages with no text content in dump.txt because the page was inaccessible when crawling (protocolStatus: NOTFOUND)
20
21	----------
22
23	The 12 month period CommonCrawl crawl data that we used:
24
25	https://commoncrawl.org/2018/10/september-2018-crawl-archive-now-available/
26	- contains 2.8 billion web pages and 220 TiB of uncompressed content
27	- contains 500 million new URLs, not contained in any crawl archive before
28	https://commoncrawl.org/2018/10/october-2018-crawl-archive-now-available/
29	- 3.0 billion web pages and 240 TiB of uncompressed content
30	- 600 million new URLs, not contained in any crawl archive before
31	https://commoncrawl.org/2018/11/november-2018-crawl-archive-now-available/
32	- 2.6 billion web pages or 220 TiB of uncompressed content
33	- 640 million new URLs, not contained in any crawl archive before
34	https://commoncrawl.org/2018/12/december-2018-crawl-archive-now-available/
35	- 3.1 billion web pages or 250 TiB of uncompressed content,
36	- 735 million URLs not contained in any crawl archive before
37	https://commoncrawl.org/2019/01/january-2019-crawl-archive-now-available/
38	- 2.85 billion web pages or 240 TiB of uncompressed content
39	- 850 million URLs not contained in any crawl archive before.
40	https://commoncrawl.org/2019/03/february-2019-crawl-archive-now-available/
41	- 2.9 billion web pages or 225 TiB of uncompressed content
42	- 750 million URLs not contained in any crawl archive before
43	https://commoncrawl.org/2019/04/march-2019-crawl-archive-now-available/
44	- 2.55 billion web pages or 210 TiB of uncompressed content
45	- 660 million URLs not contained in any crawl archive before
46	https://commoncrawl.org/2019/04/april-2019-crawl-archive-now-available/
47	- 2.5 billion web pages or 198 TiB of uncompressed content
48	- 750 million URLs not contained in any crawl archive before
49	https://commoncrawl.org/2019/05/may-2019-crawl-archive-now-available/
50	- 2.65 billion web pages or 220 TiB of uncompressed content
51	- 825 million URLs not contained in any crawl archive before
52	https://commoncrawl.org/2019/07/june-2019-crawl-archive-now-available/
53	- 2.6 billion web pages or 220 TiB of uncompressed content
54	- 880 million URLs not contained in any crawl archive before
55	https://commoncrawl.org/2019/07/july-2019-crawl-archive-now-available/
56	- 2.6 billion web pages or 220 TiB of uncompressed content
57	- 810 million URLs not contained in any crawl archive before
58	https://commoncrawl.org/2019/08/august-2019-crawl-archive-now-available/
59	- 2.95 billion web pages or 260 TiB of uncompressed content
60	- 1.1 billion URLs not contained in any crawl archive before
61
62	= 9100 million or 9.1 billion new URLs not contained in any crawl archive before
63	+ taking the first crawl month's figure of 2.8 billion - 500 million new URLs in 1st month crawled = 11.4 billion URLs? At least?
64	---------------------------------------------
65
66	"UPPER BOUND"
67
68	blacklisted
69	greylisted
70	skipped crawling
71	unfinished (crawling)
72
73	Sites crawled and ingested into mongodb:
74	- domains shortlisted
75	- not shortlisted
76
77
78	Not included: only areas of interest of sites otherwise too big to exhaustively crawl were crawled. Not the rest. For example, not all of wikipedia but only mi.wikipedia.org. Not all of blogspot, only blogspot blogs indicated by common crawl results for MRI. Not all of docs.google.com, only the specific pages that turned up in common crawl for MRI.
79
80
81	1. ALL DOMAINS FROM CC-CRAWL:
82
83	Total counts from CommonCrawl (i.e. unique domain count across discardURLs + greyListed.txt + keepURLs.txt)
84
85	wharariki:[1153]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount
86	Counting all domains and urls in keepURLs.txt + discardURLs.txt + greyListed.txt
87	Count of unique domains: 3074
88	Count of unique basic domains (stripped of protocol and www): 2791
89	Line count: 75559
90	Actual unique URL count: 38717
91	Unique basic URL count (stripped of protocol and www): 32827
92	******************************************************
93
94	[X 1588 domains from discardURLs + 288 (-1) greylistedURLs + 1462 (+1) keepURLs = 3338 domains]
95
96	Line count above is correct and consistent with the following: 23794+4485+47280=75559
97
98	But instead of domain/unique domain or URL/basic unique URL counts. The union of:
99	- domains of the following: 1588+288+1462 = 3338
100	- unique basic domains of the following (stripped of protocol and www): 1415+277+1362 = 3054
101	- basic URL count = 10290 + 2751 + 25683 = 38724
102	- basic unique URL count (stripped of protocol and www) = 9656 + 2727 + 20451 = 32834
103
104
105	wharariki:[1154]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount ../tmp/to_crawl.THE_VERSION_USED/discardURLs.txt
106	Counting all domains and urls in discardURLs.txt
107	Count of unique domains: 1588
108	Count of unique basic domains (stripped of protocol and www): 1415
109	Line count: 23794
110	Actual unique URL count: 10290
111	Unique basic URL count (stripped of protocol and www): 9656
112	******************************************************
113	wharariki:[1155]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount ../tmp/to_crawl.THE_VERSION_USED/greyListed.txt
114	Counting all domains and urls in greyListed.txt
115	Count of unique domains: 288
116	Count of unique basic domains (stripped of protocol and www): 277
117	Line count: 4485
118	Actual unique URL count: 2751
119	Unique basic URL count (stripped of protocol and www): 2727
120	******************************************************
121	wharariki:[1156]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount ../tmp/to_crawl.THE_VERSION_USED/keepURLs.txt
122	Counting all domains and urls in keepURLs.txt
123	Count of unique domains: 1464
124	Count of unique basic domains (stripped of protocol and www): 1362
125	Line count: 47280
126	Actual unique URL count: 25683
127	Unique basic URL count (stripped of protocol and www): 20451
128	******************************************************
129
130
131	XXXXXXXXXX
132	wharariki:[1159]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount ../tmp/to_crawl.THE_VERSION_USED/seedURLs.txt
133	Counting all domains and urls in seedURLs.txt
134	Count of unique domains: 1462
135	Count of unique basic domains (stripped of protocol and www): 1360
136	Line count: 25679
137	Actual unique URL count: 25679
138	Unique basic URL count (stripped of protocol and www): 20447
139	******************************************************
140	XXXXXXXXXX
141
142	seedURLs is a subset of keepURLs.
143
144
145	2a. DISCARDED URLS:
146	URLS that are blacklisted + those pages with too little text content (under an arbitrary min threshold)
147
148	> wc -l discardURLs.txt
149	23794
150
151	b. GREYLISTED URLS:
152	> wc -l greyListed.txt
153	4485
154
155
156	c. keepURLs (the URLs we kept for further processing):
157	wc -l keepURLs.txt
158	47280 keepURLs.txt
159
160
161	d. Of the keepURLs, 4 more webpages ultimately irrelevant sites at unprocessed-topsite-matches.txt.
162
163	3 are not in MRI but are of the same domain, one is just a gallery of holiday pictures.
164
165	> less unprocessed-topsite-matches.txt
166	The following domain with seedURLs are on a major/top 500 site
167	for which no allowed URL pattern regex has been specified.
168	Specify one for this domain in the tab-spaced sites-too-big-to-exhaustively-crawl.txt file
169	http://familypedia.wikia.com/wiki/Property:Father?limit=500&offset=0
170	http://familypedia.wikia.com/wiki/Property:Mother?limit=250&offset=0
171	http://familypedia.wikia.com/wiki/Property:Mother?limit=500&offset=0
172	https://get.google.com/albumarchive/112997211423463224598/album/AF1QipM73RVcpCT2gpp5XhDUawnfyUDBbuJbeCEbVckl
173
174
175	e. After duplicates further pruned out from what remained of keepURLs - the seedURLs for Nutch:
176
177	wc -l seedURLs.txt
178	25679 seedURLs.txt
179
180	wharariki:[1111]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.UniqueDomainCount ../tmp/to_crawl.THE_VERSION_USED/seedURLs.txt
181	In file ../tmp/to_crawl.THE_VERSION_USED/seedURLs.txt:
182	Count of domains: 1462
183	Count of unique domains: 1360
184
185
186	But anglican.org was wrongly greylisted and added back in
187	-> 1463 domains.
188
189	3a. Num URLs prepared for crawling:
190	wharariki:[119]/Scratch/ak19/maori-lang-detection/tmp/to_crawl.THE_VERSION_USED>wc -l seedURLs.txt
191	25679 seedURLs.txt
192
193	b. Num sites prepared for crawling (https://stackoverflow.com/questions/17648033/counting-number-of-directories-in-a-specific-directory):
194
195
196	wharariki:[147]/Scratch/ak19/maori-lang-detection/tmp/to_crawl.THE_VERSION_USED/sites>echo */ \| wc
197	1 1463 10241
198
199	(2nd number)
200	OR: sites>find . -mindepth 1 -maxdepth 1 -type d \| wc -l
201	1463
202
203
204	/Scratch/ak19/maori-lang-detection/tmp/to_crawl.THE_VERSION_USED/sites/ also contains subfolders up to 01463
205
206
207	[maori-lang-detection/tmp/to_crawl.THE_VERSION_USED>emacs all-domain-urls.txt
208	1462+1 (for the greylisted anglican.org) = 1463]
209
210	4. Num sites crawled:
211	wharariki:[155]/Scratch/ak19/maori-lang-detection/crawled>find . -mindepth 1 -maxdepth 1 -type d \| wc -l
212	1447
213	wharariki:[156]/Scratch/ak19/maori-lang-detection/crawled>echo */ \| wc
214	1 1447 10129
215
216	5. Number of sites not finished crawling (using Nutch at max crawl depth 10):
217	wharariki:[158]/Scratch/ak19/maori-lang-detection/crawled>find . -name "UNFINISHED" \| wc -l
218	619
219
220
221	6. Number of sites in MongoDB:
222	1446
223
224	Not: 00179, 00485-00495, 00499-00502, 01067* (No dump.txt, but website is repeated in 01408)
225
226	* 01067 is listed under sites crawled, but not ingested into mongodb.
227
228	In siteID ranges of 00100s, 00400s, 00500s, 01000s, each have less than 100 sites in mongodbs:
229	99, 88, 97, 99
230
231	and 64/64 sites in siteIDs 1400-1463.
232
233	=> 1+12+3+1 = 17 sites not crawled of the 1463 = 1446 sites ingested in mongodb.
234
235
236	7. Despite 25679 non-duplicate seedURLs and many more pages crawled from those seeds,
237	the number of web pages ingested into mongodb are less than about 5 times as much,
238	because only crawled web pages with non-empty text were ingested into mongodb.
239
240	Num pages in MongoDB:
241	db.getCollection('Webpages').find({}).count()
242	119874
243
244	---------------------------
245
246	#Number of crawled pages with 0 content in dump.txt because the page was inaccessible when crawling (protocolStatus: NOTFOUND)
247	wharariki:[646]/Scratch/ak19/maori-lang-detection/crawled>fgrep -a 'NOTFOUND' 0*/dump.txt \| grep protocolStatus \| wc
248	3276 9828 419259
249
250	#Number of dump.txt files (sites) that had text:start in them vs those that didn't:
251	wharariki:[647]/Scratch/ak19/maori-lang-detection/crawled>fgrep -l text:start */dump.txt \| wc
252	1027 1027 15405
253	wharariki:[648]/Scratch/ak19/maori-lang-detection/crawled>fgrep text:start */dump.txt \| wc
254	1027 4108 35945
255
256	# number of dump.txt files
257	wharariki:[652]/Scratch/ak19/maori-lang-detection/crawled>find . -name "dump.txt" \| wc
258	1446 1446 24582
259	wharariki:[653]/Scratch/ak19/maori-lang-detection/crawled>
260
261
262	Look to see if commoncrawl has a field for how much text there is on the page.
263	Else this is a useful feature for them to add.
264
265
266	wharariki:[143]/Scratch/ak19/maori-lang-detection/src>wc -l ../mongodb-data/InfoOnEmptyPagesNotInMongoDB.csv
267	589179 ../mongodb-data/InfoOnEmptyPagesNotInMongoDB.csv
268
269	- 17 lines at start that aren't about empty web pages in dump.txt = 589162 empty web pages
270
271
272
273	================================
274	Inspecting the csv file:
275
276
277	wharariki:[198]/Scratch/ak19/maori-lang-detection/src>wc -l InfoOnEmptyPagesNotInMongoDB.csv
278	587082 InfoOnEmptyPagesNotInMongoDB.csv
279	-1 for column headings =
280	587081 empty pages
281
282
283	# Listing of the nutch crawl status values:
284	# https://nutch.apache.org/apidocs/apidocs-2.0/org/apache/nutch/crawl/CrawlStatus.html
285	# But the only ones used are: status_unfetched\|status_fetched\|status_gone\|status_redir\|status_notmodified
286	# Remainder are status (null). See examples in siteID 00154 later in this file.
287
288
289	wharariki:[298]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_unfetched" InfoOnEmptyPagesNotInMongoDB.csv \| wc
290	555167 1117894 60067623
291	wharariki:[299]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv \| wc
292	3441 21326 579499
293	wharariki:[300]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv \| wc
294	5907 17929 1059096
295	wharariki:[301]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_notmodified" InfoOnEmptyPagesNotInMongoDB.csv \| wc
296	291 873 51684
297	wharariki:[302]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_redir" InfoOnEmptyPagesNotInMongoDB.csv \| wc
298	10959 32941 1927067
299
300	UNKNOWN STATUS (no status, protocolStatus or parseStatus info) forthe remainder:
301	wharariki:[291]/Scratch/ak19/maori-lang-detection/mongodb-data>egrep -v "status_unfetched\|status_fetched\|status_gone\|status_redir\|status_notmodified" InfoOnEmptyPagesNotInMongoDB.csv \| less
302
303	wharariki:[304]/Scratch/ak19/maori-lang-detection/mongodb-data>egrep -v "status_unfetched\|status_fetched\|status_gone\|status_redir\|status_notmodified" InfoOnEmptyPagesNotInMongoDB.csv \| wc
304	11317-1 (column heading) 22633 874662
305
306	=> unfetched + fetched + gone + notmodified + redir + (UNKNOWN cause)
307	=> 555167+3441+5907+291+10959+11316 = 587081 empty pages (CHECKED)
308
309	wharariki:[183]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv \| wc
310	3441 21326 579499
311
312	wharariki:[315]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv \| grep "success/ok" \| wc
313	2065 10325 289719
314
315	wharariki:[317]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv \| grep "success/redirect" \| wc
316	150 750 33234
317
318	wharariki:[316]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv \| grep "failed/exception" \| wc
319	939 9390 219818
320	[
321	all status_fetched with failed/exception are parseExceptions:
322	wharariki:[187]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv \| fgrep "ParseException" \| wc
323	939 9390 219818
324	]
325
326	All other kinds of status_fetched have no information besides SUCCESS (despite resulting in empty pages):
327	wharariki:[319]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv \| egrep -v "success/ok\|success/redirect\|failed/exception" \| wc
328	287 861 36728
329
330
331	All status_fetched that are not parseExceptions were SUCCESS:
332
333	wharariki:[214]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv \| fgrep -v "ParseException" \| wc
334	2502 11936 359681
335
336	ONLY OTHER OPTION FOR status_fetched IS SUCCESS:
337	wharariki:[211]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv \| egrep -v "ParseException\|SUCCESS" \| wc
338	0 0 0
339
340
341	wharariki:[188]/Scratch/ak19/maori-lang-detection/src>fgrep "status_unfetched" InfoOnEmptyPagesNotInMongoDB.csv \| wc
342	555167 1117894 60067623
343
344	status_unfetched includes
345	- EXCEPTIONs like http error code 403 (Forbidden), 402 (Payment Required), 429 (Too Many Requests), 502 (Bad Gateway)
346	IOExceptions like unzipping issues (unzipBestEffort returned null)
347	Unknown Host Exceptions, SocketTimeoutException, ConnectionException connection refused,
348	SSL Exceptions like fatal alert/internal error, SSLHandshakeException (SSL security issues / invalid certificate),
349	(EXCEPTION, args=[javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target])
350	- (null): 553320 URLs - all status_unfetched without EXCEPTION
351
352
353	wharariki:[309]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_unfetched" InfoOnEmptyPagesNotInMongoDB.csv \| grep "EXCEPTION" \| wc
354	1847 11254 381055
355
356
357
358	status_redir_temp, status_redir_perm
359	- MOVED
360	- TEMP_MOVED
361
362	TOTAL:
363	wharariki:[327]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_redir" InfoOnEmptyPagesNotInMongoDB.csv \| wc
364	10959 32941 1927067
365
366	wharariki:[328]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_redir_temp" InfoOnEmptyPagesNotInMongoDB.csv \| wc
367	4872 14625 906162
368	wharariki:[329]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_redir_perm" InfoOnEmptyPagesNotInMongoDB.csv \| wc
369	6087 18316 1020905
370
371
372	wharariki:[191]/Scratch/ak19/maori-lang-detection/src>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv \| wc
373	5907 17929 1059096
374
375	[
376	For status_gone, alternative values to NOTFOUND are GONE and ROBOTS_DENIED and ACCESS_DENIED:
377	wharariki:[200]/Scratch/ak19/maori-lang-detection/src>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv \| fgrep -v "NOTFOUND" \| less
378	wharariki:[204]/Scratch/ak19/maori-lang-detection/src>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv \| egrep -v "NOTFOUND\|GONE\|ROBOTS_DENIED" \| less
379
380	wharariki:[342]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv \| egrep -v "NOTFOUND\|GONE\|ROBOTS_DENIED\|ACCESS_DENIED" \| wc
381	0 0 0
382	]
383
384	wharariki:[192]/Scratch/ak19/maori-lang-detection/src>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv \| fgrep "NOTFOUND" \| wc
385	3276 9828 695839
386
387	wharariki:[337]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv \| egrep "GONE" \| wc
388	374 1322 93428
389	wharariki:[338]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv \| egrep "ROBOTS_DENIED" \| wc
390	2253 6759 269069
391	wharariki:[339]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv \| egrep "ACCESS_DENIED" \| wc
392	4 20 760
393
394	= 5907
395
396	wharariki:[196]/Scratch/ak19/maori-lang-detection/src>fgrep "status_notmodified" InfoOnEmptyPagesNotInMongoDB.csv \| wc
397	291 873 51684
398	wharariki:[197]/Scratch/ak19/maori-lang-detection/src>fgrep "status_notmodified" InfoOnEmptyPagesNotInMongoDB.csv \| fgrep "NOTMODIFIED" \| wc
399	291 873 51684
400
401
402	========
403
404	wharariki:[222]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv \| fgrep -v "success/ok" \| wc
405	1376 11001 289780
406	wharariki:[223]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv \| fgrep "success/ok" \| fgrep "ParseException" \| wc
407	0 0 0
408
409
410	wharariki:[226]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv \| fgrep -v "success/ok" \| fgrep -v "ParseException" \| less
411	wharariki:[227]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv \| fgrep -v "success/ok" \| fgrep -v "ParseException" \| wc
412	437 1611 69962
413
414	- "success/ok"
415	- "success/redirect"
416	- "failed/exception" for ParseException
417	All failed/exception are ParseExceptions:
418	wharariki:[233]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv \| fgrep "failed/exception" \| fgrep -v "ParseException" \| wc
419	0 0 0
420
421	ALL THE status_fetched:
422	wharariki:[234]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv \| wc
423	3441 21326 579499
424	wharariki:[244]/Scratch/ak19/maori-lang-detection/src>egrep "success/redirect\|success/ok\|failed/exception" InfoOnEmptyPagesNotInMongoDB.csv \| wc
425	3154 20465 542771
426	wharariki:[245]/Scratch/ak19/maori-lang-detection/src>egrep -v "success/redirect\|success/ok\|failed/exception" InfoOnEmptyPagesNotInMongoDB.csv \| less
427	wharariki:[246]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" \| egrep -v "success/redirect\|success/ok\|failed/exception" InfoOnEmptyPagesNotInMongoDB.csv \| less
428
429	wharariki:[247]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv \| egrep -v "success/redirect\|success/ok\|failed/exception" \| lesswharariki:[248]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv \| egrep -v "success/redirect\|success/ok\|failed/exception" \| wc
430	287 861 36728
431
432	(No equivalent info to success/ok, success/redirect, failed/exception)
433
434	-----------------------------
435	No status information for many pages on site 00154, from the following point onwards (crawled too much of the site?):
436	http://m.biblepub.com/bibles/mb/19/81 key: com.biblepub.m:http/bibles/mb/19/81
437	baseUrl: null
438	status: 2 (status_fetched)
439	fetchTime: 1573978084279
440	prevFetchTime: 1571385510616
441	fetchInterval: 2592000
442	retriesSinceFetch: 0
443	modifiedTime: 0
444	prevModifiedTime: 0
445	protocolStatus: SUCCESS, args=[]
446	signature: 3e214d69ab677a676e40c2b91901acc9
447	parseStatus: success/ok (1/0), args=[]
448	title: Psalm 81 - Maori Bible - Bibles - BiblePub Mobile
449	score: 1.0
450	marker _injmrk_ : y
451	marker _updmrk_ : 1571386061-31026
452	marker dist : 0
453	reprUrl: null
454	batchId: 1571386061-31026
455	metadata CharEncodingForConversion : utf-8
456	metadata OriginalCharEncoding : utf-8
457	metadata _rs_ : ^@^@^By
458	metadata _csh_ : ^@^@^@^@
459	text:start:
460	Psalm 81 - Maori Bible - Bibles - BiblePub Mobile Maori Bible Books next back Psalm 81 1 Ki te tino kaiwhakatangi. Kititi. Na Ahapa. Kia kaha te waiata ki te Atua, ki to tatou kaha: kia hari te hamama ki
461	te Atua o Hakopa. 2 Whakahuatia te himene, maua mai ki konei te timipera, te hapa reka me te hatere. 3 Whakatangihia te tetere i te kowhititanga marama, i te kinga o te marama, i to tatou ra hakari. 4 Ko
462	te tikanga hoki tenei ma Iharaira, he mea whakarite na te Atua o Hakopa. 5 I whakatakotoria tenei e ia ma Hohepa hei whakaaturanga, i tona haerenga puta noa i te whenua o Ihipa: i rongo ai ahau ki reira i
463	tetahi reo, kahore ahau i matau. 6 I tangohia mai e ahau tona pokohiwi i te pikaunga: whakarerea ake e ona ringa te kete. 7 I karanga koe ki ahau i te pouritanga, a kua ora koe i ahau; i whakahoki kupu a
464	hau ki a koe i te wahi ngaro o te whatitiri; i whakamatau i a koe ki nga wai o Meripa. (Hera. 8 Whakarongo, e taku iwi, a ka whakaatu ahau ki a koe: e Iharaira, ki te whakarongo koe ki ahau; 9 Aua tetahi
465	atua ke i roto i a koe; kaua ano e koropiko ki te atua ke. 10 Ko Ihowa ahau, ko tou Atua, i arahina mai ai koe i te whenua o Ihipa: kia nui te kowhera o tou mangai, a maku e whakaki. 11 Otiia kihai taku i
466	wi i pai ki te whakarongo ki toku reo: kihai ano a Iharaira i aro ki ahau. 12 Na tukua atu ana ratou e ahau ki te maro o o ratou ngakau: a haere ana ratou i runga i o ratou whakaaro. 13 Aue, te whakarongo
467	taku iwi ki ahau! Te haere a Iharaira i aku ara! 14 Penei e kore e aha kua whati i ahau te tara o o ratou hoariri: kua tahuri ano toku ringa ki o ratou hoariri. 15 Ko te hunga e kino ana ki a Ihowa kua n
468	gohengohe ki a ia: ko to ratou taima ia kua mau tonu. 16 Kua whangainga hoki ratou e ia ki te witi pai rawa, kua whakamakonatia ano koe e ahau ki te honi i roto i te kohatu. next back Contact Us - Full Si
469	te Â© 2013 BiblePub
470	text:end:
471
472	http://m.biblepub.com/bibles/mb/19/82 key: com.biblepub.m:http/bibles/mb/19/82
473	baseUrl: null
474	status: 1 (status_unfetched)
475	fetchTime: 1571386117381
476	prevFetchTime: 0
477	fetchInterval: 2592000
478	retriesSinceFetch: 0
479	modifiedTime: 0
480	prevModifiedTime: 0
481	protocolStatus: (null)
482	parseStatus: (null)
483	title: null
484	score: 0.0
485	marker dist : 1
486	reprUrl: null
487	metadata _csh_ : ^@^@^@^@
488
489
490	------------
491
492
493	Would like to do something like:
494	wharariki:[378]/Scratch/ak19/maori-lang-detection/crawled>find . -name UNFINISHED \| grep -l text:start */dump.txt \| wc
495
496
497	Would like to find how many and which of the unfinished websites had a dump.txt with no text content
498	AND how many of the completely crawled websites had a dump.txt with no text content.
499
500
501	--------------
502
503
504
505	wharariki:[393]/Scratch/ak19/maori-lang-detection/crawled>grep -l "text:start" */dump.txt
506
507	wharariki:[388]/Scratch/ak19/maori-lang-detection/crawled>less 01461/dump.txt
508	wharariki:[389]/Scratch/ak19/maori-lang-detection/crawled>less 01453/dump.txt
509	wharariki:[390]/Scratch/ak19/maori-lang-detection/crawled>less 01447/dump.txt
510	wharariki:[391]/Scratch/ak19/maori-lang-detection/crawled>less 01446/dump.txt
511	wharariki:[392]/Scratch/ak19/maori-lang-detection/crawled>less 01445/dump.txt
512
513	# All the dump.txt files that are 0 bytes (no content):
514	# https://stackoverflow.com/questions/15703664/find-all-zero-byte-files-in-directory-and-subdirectories
515	wharariki:[396]/Scratch/ak19/maori-lang-detection/crawled>find . -name "dump.txt" -size 0 \| sort \| wc
516	150 150 2550
517
518
519	Examples of empty dump.txt files (listed with: find . -name "dump.txt" -size 0 \| sort):
520	wharariki:[400]/Scratch/ak19/maori-lang-detection/crawled>less ../tmp/to_crawl.THE_VERSION_USED/sites/00014/seedURLs.txt
521	wharariki:[401]/Scratch/ak19/maori-lang-detection/crawled>less ../tmp/to_crawl.THE_VERSION_USED/sites/01461/seedURLs.txt
522	wharariki:[402]/Scratch/ak19/maori-lang-detection/crawled>less ../tmp/to_crawl.THE_VERSION_USED/sites/01447/seedURLs.txt
523	wharariki:[403]/Scratch/ak19/maori-lang-detection/crawled>less ../tmp/to_crawl.THE_VERSION_USED/sites/01422/seedURLs.txt
524
525
526	=======
527
528

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: other-projects/maori-lang-detection/mongodb-data/piechart_data.txt@ 34089

Download in other formats: