source: other-projects/maori-lang-detection/mongodb-data/piechart_data.txt@ 34089

Last change on this file since 34089 was 34089, checked in by ak19, 4 years ago

So far accumulated URLs to docs on Google scholar about or somewhat related to finding low-resource languages on the web, as Dr Bainbridge had suggested.

File size: 25.3 KB
Line 
1https://www.rapidtables.com/tools/pie-chart.html
2https://www.meta-chart.com/pie#/data (more powerful: can choose colours, display labels)
3
4"11.5 billion CC URLs"
538724 CC URLs in "MRI"
610290 URLs discarded (blacklisted and too little text)
72751 URLs greylisted
825683-4 URLs retained = 25679 seed URLs for crawling
9
101463 sites prepared for crawling
111447 sites crawled (16 were autotranslated or otherwise irrelevant)
121446 crawled sites contained dump.txt files (1 site was missing dump.txt) - 1446 sites in mongodb
13619 sites not finished crawling
141027 sites where dump.txt contained text:start denoting text content, so 419 sites with no text content
15
16
17119874 crawled web pages in mongodb
18
193276 crawled pages with no text content in dump.txt because the page was inaccessible when crawling (protocolStatus: NOTFOUND)
20
21----------
22
23The 12 month period CommonCrawl crawl data that we used:
24
25https://commoncrawl.org/2018/10/september-2018-crawl-archive-now-available/
26- contains 2.8 billion web pages and 220 TiB of uncompressed content
27- contains 500 million new URLs, not contained in any crawl archive before
28https://commoncrawl.org/2018/10/october-2018-crawl-archive-now-available/
29- 3.0 billion web pages and 240 TiB of uncompressed content
30- 600 million new URLs, not contained in any crawl archive before
31https://commoncrawl.org/2018/11/november-2018-crawl-archive-now-available/
32- 2.6 billion web pages or 220 TiB of uncompressed content
33- 640 million new URLs, not contained in any crawl archive before
34https://commoncrawl.org/2018/12/december-2018-crawl-archive-now-available/
35- 3.1 billion web pages or 250 TiB of uncompressed content,
36- 735 million URLs not contained in any crawl archive before
37https://commoncrawl.org/2019/01/january-2019-crawl-archive-now-available/
38- 2.85 billion web pages or 240 TiB of uncompressed content
39- 850 million URLs not contained in any crawl archive before.
40https://commoncrawl.org/2019/03/february-2019-crawl-archive-now-available/
41- 2.9 billion web pages or 225 TiB of uncompressed content
42- 750 million URLs not contained in any crawl archive before
43https://commoncrawl.org/2019/04/march-2019-crawl-archive-now-available/
44- 2.55 billion web pages or 210 TiB of uncompressed content
45- 660 million URLs not contained in any crawl archive before
46https://commoncrawl.org/2019/04/april-2019-crawl-archive-now-available/
47- 2.5 billion web pages or 198 TiB of uncompressed content
48- 750 million URLs not contained in any crawl archive before
49https://commoncrawl.org/2019/05/may-2019-crawl-archive-now-available/
50- 2.65 billion web pages or 220 TiB of uncompressed content
51- 825 million URLs not contained in any crawl archive before
52https://commoncrawl.org/2019/07/june-2019-crawl-archive-now-available/
53- 2.6 billion web pages or 220 TiB of uncompressed content
54- 880 million URLs not contained in any crawl archive before
55https://commoncrawl.org/2019/07/july-2019-crawl-archive-now-available/
56- 2.6 billion web pages or 220 TiB of uncompressed content
57- 810 million URLs not contained in any crawl archive before
58https://commoncrawl.org/2019/08/august-2019-crawl-archive-now-available/
59- 2.95 billion web pages or 260 TiB of uncompressed content
60- 1.1 billion URLs not contained in any crawl archive before
61
62= 9100 million or 9.1 billion new URLs not contained in any crawl archive before
63+ taking the first crawl month's figure of 2.8 billion - 500 million new URLs in 1st month crawled = 11.4 billion URLs? At least?
64---------------------------------------------
65
66"UPPER BOUND"
67
68blacklisted
69greylisted
70skipped crawling
71unfinished (crawling)
72
73Sites crawled and ingested into mongodb:
74- domains shortlisted
75- not shortlisted
76
77
78Not included: only areas of interest of sites otherwise too big to exhaustively crawl were crawled. Not the rest. For example, not all of wikipedia but only mi.wikipedia.org. Not all of blogspot, only blogspot blogs indicated by common crawl results for MRI. Not all of docs.google.com, only the specific pages that turned up in common crawl for MRI.
79
80
811. ALL DOMAINS FROM CC-CRAWL:
82
83Total counts from CommonCrawl (i.e. unique domain count across discardURLs + greyListed.txt + keepURLs.txt)
84
85wharariki:[1153]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount
86Counting all domains and urls in keepURLs.txt + discardURLs.txt + greyListed.txt
87 Count of unique domains: 3074
88 Count of unique basic domains (stripped of protocol and www): 2791
89 Line count: 75559
90 Actual unique URL count: 38717
91 Unique basic URL count (stripped of protocol and www): 32827
92******************************************************
93
94[X 1588 domains from discardURLs + 288 (-1) greylistedURLs + 1462 (+1) keepURLs = 3338 domains]
95
96Line count above is correct and consistent with the following: 23794+4485+47280=75559
97
98But instead of domain/unique domain or URL/basic unique URL counts. The union of:
99- domains of the following: 1588+288+1462 = 3338
100- unique basic domains of the following (stripped of protocol and www): 1415+277+1362 = 3054
101- basic URL count = 10290 + 2751 + 25683 = 38724
102- basic unique URL count (stripped of protocol and www) = 9656 + 2727 + 20451 = 32834
103
104
105wharariki:[1154]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount ../tmp/to_crawl.THE_VERSION_USED/discardURLs.txt
106Counting all domains and urls in discardURLs.txt
107 Count of unique domains: 1588
108 Count of unique basic domains (stripped of protocol and www): 1415
109 Line count: 23794
110 Actual unique URL count: 10290
111 Unique basic URL count (stripped of protocol and www): 9656
112******************************************************
113wharariki:[1155]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount ../tmp/to_crawl.THE_VERSION_USED/greyListed.txt
114Counting all domains and urls in greyListed.txt
115 Count of unique domains: 288
116 Count of unique basic domains (stripped of protocol and www): 277
117 Line count: 4485
118 Actual unique URL count: 2751
119 Unique basic URL count (stripped of protocol and www): 2727
120******************************************************
121wharariki:[1156]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount ../tmp/to_crawl.THE_VERSION_USED/keepURLs.txt
122Counting all domains and urls in keepURLs.txt
123 Count of unique domains: 1464
124 Count of unique basic domains (stripped of protocol and www): 1362
125 Line count: 47280
126 Actual unique URL count: 25683
127 Unique basic URL count (stripped of protocol and www): 20451
128******************************************************
129
130
131XXXXXXXXXX
132wharariki:[1159]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount ../tmp/to_crawl.THE_VERSION_USED/seedURLs.txt
133Counting all domains and urls in seedURLs.txt
134 Count of unique domains: 1462
135 Count of unique basic domains (stripped of protocol and www): 1360
136 Line count: 25679
137 Actual unique URL count: 25679
138 Unique basic URL count (stripped of protocol and www): 20447
139******************************************************
140XXXXXXXXXX
141
142seedURLs is a subset of keepURLs.
143
144
1452a. DISCARDED URLS:
146URLS that are blacklisted + those pages with too little text content (under an arbitrary min threshold)
147
148> wc -l discardURLs.txt
14923794
150
151b. GREYLISTED URLS:
152> wc -l greyListed.txt
1534485
154
155
156c. keepURLs (the URLs we kept for further processing):
157wc -l keepURLs.txt
15847280 keepURLs.txt
159
160
161d. Of the keepURLs, 4 more webpages ultimately irrelevant sites at unprocessed-topsite-matches.txt.
162
1633 are not in MRI but are of the same domain, one is just a gallery of holiday pictures.
164
165> less unprocessed-topsite-matches.txt
166 The following domain with seedURLs are on a major/top 500 site
167 for which no allowed URL pattern regex has been specified.
168 Specify one for this domain in the tab-spaced sites-too-big-to-exhaustively-crawl.txt file
169 http://familypedia.wikia.com/wiki/Property:Father?limit=500&offset=0
170 http://familypedia.wikia.com/wiki/Property:Mother?limit=250&offset=0
171 http://familypedia.wikia.com/wiki/Property:Mother?limit=500&offset=0
172 https://get.google.com/albumarchive/112997211423463224598/album/AF1QipM73RVcpCT2gpp5XhDUawnfyUDBbuJbeCEbVckl
173
174
175e. After duplicates further pruned out from what remained of keepURLs - the seedURLs for Nutch:
176
177wc -l seedURLs.txt
17825679 seedURLs.txt
179
180wharariki:[1111]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.UniqueDomainCount ../tmp/to_crawl.THE_VERSION_USED/seedURLs.txt
181In file ../tmp/to_crawl.THE_VERSION_USED/seedURLs.txt:
182 Count of domains: 1462
183 Count of unique domains: 1360
184
185
186But anglican.org was wrongly greylisted and added back in
187-> 1463 domains.
188
1893a. Num URLs prepared for crawling:
190wharariki:[119]/Scratch/ak19/maori-lang-detection/tmp/to_crawl.THE_VERSION_USED>wc -l seedURLs.txt
19125679 seedURLs.txt
192
193b. Num sites prepared for crawling (https://stackoverflow.com/questions/17648033/counting-number-of-directories-in-a-specific-directory):
194
195
196wharariki:[147]/Scratch/ak19/maori-lang-detection/tmp/to_crawl.THE_VERSION_USED/sites>echo */ | wc
197 1 1463 10241
198
199(2nd number)
200OR: sites>find . -mindepth 1 -maxdepth 1 -type d | wc -l
2011463
202
203
204/Scratch/ak19/maori-lang-detection/tmp/to_crawl.THE_VERSION_USED/sites/ also contains subfolders up to 01463
205
206
207[maori-lang-detection/tmp/to_crawl.THE_VERSION_USED>emacs all-domain-urls.txt
2081462+1 (for the greylisted anglican.org) = 1463]
209
2104. Num sites crawled:
211wharariki:[155]/Scratch/ak19/maori-lang-detection/crawled>find . -mindepth 1 -maxdepth 1 -type d | wc -l
2121447
213wharariki:[156]/Scratch/ak19/maori-lang-detection/crawled>echo */ | wc
214 1 1447 10129
215
2165. Number of sites not finished crawling (using Nutch at max crawl depth 10):
217wharariki:[158]/Scratch/ak19/maori-lang-detection/crawled>find . -name "UNFINISHED" | wc -l
218619
219
220
2216. Number of sites in MongoDB:
2221446
223
224Not: 00179, 00485-00495, 00499-00502, 01067* (No dump.txt, but website is repeated in 01408)
225
226* 01067 is listed under sites crawled, but not ingested into mongodb.
227
228In siteID ranges of 00100s, 00400s, 00500s, 01000s, each have less than 100 sites in mongodbs:
22999, 88, 97, 99
230
231and 64/64 sites in siteIDs 1400-1463.
232
233=> 1+12+3+1 = 17 sites not crawled of the 1463 = 1446 sites ingested in mongodb.
234
235
2367. Despite 25679 non-duplicate seedURLs and many more pages crawled from those seeds,
237the number of web pages ingested into mongodb are less than about 5 times as much,
238because only crawled web pages with non-empty text were ingested into mongodb.
239
240Num pages in MongoDB:
241db.getCollection('Webpages').find({}).count()
242119874
243
244---------------------------
245
246#Number of crawled pages with 0 content in dump.txt because the page was inaccessible when crawling (protocolStatus: NOTFOUND)
247wharariki:[646]/Scratch/ak19/maori-lang-detection/crawled>fgrep -a 'NOTFOUND' 0*/dump.txt | grep protocolStatus | wc
248 3276 9828 419259
249
250#Number of dump.txt files (sites) that had text:start in them vs those that didn't:
251wharariki:[647]/Scratch/ak19/maori-lang-detection/crawled>fgrep -l text:start */dump.txt | wc
252 1027 1027 15405
253wharariki:[648]/Scratch/ak19/maori-lang-detection/crawled>fgrep text:start */dump.txt | wc
254 1027 4108 35945
255
256# number of dump.txt files
257wharariki:[652]/Scratch/ak19/maori-lang-detection/crawled>find . -name "dump.txt" | wc
258 1446 1446 24582
259wharariki:[653]/Scratch/ak19/maori-lang-detection/crawled>
260
261
262Look to see if commoncrawl has a field for how much text there is on the page.
263Else this is a useful feature for them to add.
264
265
266wharariki:[143]/Scratch/ak19/maori-lang-detection/src>wc -l ../mongodb-data/InfoOnEmptyPagesNotInMongoDB.csv
267589179 ../mongodb-data/InfoOnEmptyPagesNotInMongoDB.csv
268
269- 17 lines at start that aren't about empty web pages in dump.txt = 589162 empty web pages
270
271
272
273================================
274Inspecting the csv file:
275
276
277wharariki:[198]/Scratch/ak19/maori-lang-detection/src>wc -l InfoOnEmptyPagesNotInMongoDB.csv
278587082 InfoOnEmptyPagesNotInMongoDB.csv
279-1 for column headings =
280587081 empty pages
281
282
283# Listing of the nutch crawl status values:
284# https://nutch.apache.org/apidocs/apidocs-2.0/org/apache/nutch/crawl/CrawlStatus.html
285# But the only ones used are: status_unfetched|status_fetched|status_gone|status_redir|status_notmodified
286# Remainder are status (null). See examples in siteID 00154 later in this file.
287
288
289 wharariki:[298]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_unfetched" InfoOnEmptyPagesNotInMongoDB.csv | wc
290 555167 1117894 60067623
291 wharariki:[299]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | wc
292 3441 21326 579499
293 wharariki:[300]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv | wc
294 5907 17929 1059096
295 wharariki:[301]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_notmodified" InfoOnEmptyPagesNotInMongoDB.csv | wc
296 291 873 51684
297 wharariki:[302]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_redir" InfoOnEmptyPagesNotInMongoDB.csv | wc
298 10959 32941 1927067
299
300 UNKNOWN STATUS (no status, protocolStatus or parseStatus info) forthe remainder:
301 wharariki:[291]/Scratch/ak19/maori-lang-detection/mongodb-data>egrep -v "status_unfetched|status_fetched|status_gone|status_redir|status_notmodified" InfoOnEmptyPagesNotInMongoDB.csv | less
302
303 wharariki:[304]/Scratch/ak19/maori-lang-detection/mongodb-data>egrep -v "status_unfetched|status_fetched|status_gone|status_redir|status_notmodified" InfoOnEmptyPagesNotInMongoDB.csv | wc
304 11317-1 (column heading) 22633 874662
305
306=> unfetched + fetched + gone + notmodified + redir + (UNKNOWN cause)
307=> 555167+3441+5907+291+10959+11316 = 587081 empty pages (CHECKED)
308
309wharariki:[183]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | wc
310 3441 21326 579499
311
312 wharariki:[315]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | grep "success/ok" | wc
313 2065 10325 289719
314
315 wharariki:[317]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | grep "success/redirect" | wc
316 150 750 33234
317
318 wharariki:[316]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | grep "failed/exception" | wc
319 939 9390 219818
320[
321 all status_fetched with failed/exception are parseExceptions:
322 wharariki:[187]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | fgrep "ParseException" | wc
323 939 9390 219818
324]
325
326All other kinds of status_fetched have no information besides SUCCESS (despite resulting in empty pages):
327 wharariki:[319]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | egrep -v "success/ok|success/redirect|failed/exception" | wc
328 287 861 36728
329
330
331 All status_fetched that are not parseExceptions were SUCCESS:
332
333 wharariki:[214]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | fgrep -v "ParseException" | wc
334 2502 11936 359681
335
336 ONLY OTHER OPTION FOR status_fetched IS SUCCESS:
337 wharariki:[211]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | egrep -v "ParseException|SUCCESS" | wc
338 0 0 0
339
340
341wharariki:[188]/Scratch/ak19/maori-lang-detection/src>fgrep "status_unfetched" InfoOnEmptyPagesNotInMongoDB.csv | wc
342 555167 1117894 60067623
343
344 status_unfetched includes
345 - EXCEPTIONs like http error code 403 (Forbidden), 402 (Payment Required), 429 (Too Many Requests), 502 (Bad Gateway)
346 IOExceptions like unzipping issues (unzipBestEffort returned null)
347 Unknown Host Exceptions, SocketTimeoutException, ConnectionException connection refused,
348 SSL Exceptions like fatal alert/internal error, SSLHandshakeException (SSL security issues / invalid certificate),
349 (EXCEPTION, args=[javax.net.ssl.SSLHandshakeException: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target])
350 - (null): 553320 URLs - all status_unfetched without EXCEPTION
351
352
353 wharariki:[309]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_unfetched" InfoOnEmptyPagesNotInMongoDB.csv | grep "EXCEPTION" | wc
354 1847 11254 381055
355
356
357
358status_redir_temp, status_redir_perm
359 - MOVED
360 - TEMP_MOVED
361
362 TOTAL:
363 wharariki:[327]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_redir" InfoOnEmptyPagesNotInMongoDB.csv | wc
364 10959 32941 1927067
365
366 wharariki:[328]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_redir_temp" InfoOnEmptyPagesNotInMongoDB.csv | wc
367 4872 14625 906162
368 wharariki:[329]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_redir_perm" InfoOnEmptyPagesNotInMongoDB.csv | wc
369 6087 18316 1020905
370
371
372wharariki:[191]/Scratch/ak19/maori-lang-detection/src>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv | wc
373 5907 17929 1059096
374
375[
376For status_gone, alternative values to NOTFOUND are GONE and ROBOTS_DENIED and ACCESS_DENIED:
377 wharariki:[200]/Scratch/ak19/maori-lang-detection/src>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv | fgrep -v "NOTFOUND" | less
378 wharariki:[204]/Scratch/ak19/maori-lang-detection/src>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv | egrep -v "NOTFOUND|GONE|ROBOTS_DENIED" | less
379
380wharariki:[342]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv | egrep -v "NOTFOUND|GONE|ROBOTS_DENIED|ACCESS_DENIED" | wc
381 0 0 0
382]
383
384 wharariki:[192]/Scratch/ak19/maori-lang-detection/src>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv | fgrep "NOTFOUND" | wc
385 3276 9828 695839
386
387 wharariki:[337]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv | egrep "GONE" | wc
388 374 1322 93428
389 wharariki:[338]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv | egrep "ROBOTS_DENIED" | wc
390 2253 6759 269069
391 wharariki:[339]/Scratch/ak19/maori-lang-detection/mongodb-data>fgrep "status_gone" InfoOnEmptyPagesNotInMongoDB.csv | egrep "ACCESS_DENIED" | wc
392 4 20 760
393
394= 5907
395
396wharariki:[196]/Scratch/ak19/maori-lang-detection/src>fgrep "status_notmodified" InfoOnEmptyPagesNotInMongoDB.csv | wc
397 291 873 51684
398wharariki:[197]/Scratch/ak19/maori-lang-detection/src>fgrep "status_notmodified" InfoOnEmptyPagesNotInMongoDB.csv | fgrep "NOTMODIFIED" | wc
399 291 873 51684
400
401
402========
403
404wharariki:[222]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | fgrep -v "success/ok" | wc
405 1376 11001 289780
406wharariki:[223]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | fgrep "success/ok" | fgrep "ParseException" | wc
407 0 0 0
408
409
410wharariki:[226]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | fgrep -v "success/ok" | fgrep -v "ParseException" | less
411wharariki:[227]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | fgrep -v "success/ok" | fgrep -v "ParseException" | wc
412 437 1611 69962
413
414- "success/ok"
415- "success/redirect"
416- "failed/exception" for ParseException
417All failed/exception are ParseExceptions:
418wharariki:[233]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | fgrep "failed/exception" | fgrep -v "ParseException" | wc
419 0 0 0
420
421ALL THE status_fetched:
422wharariki:[234]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | wc
423 3441 21326 579499
424wharariki:[244]/Scratch/ak19/maori-lang-detection/src>egrep "success/redirect|success/ok|failed/exception" InfoOnEmptyPagesNotInMongoDB.csv | wc
425 3154 20465 542771
426wharariki:[245]/Scratch/ak19/maori-lang-detection/src>egrep -v "success/redirect|success/ok|failed/exception" InfoOnEmptyPagesNotInMongoDB.csv | less
427wharariki:[246]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" | egrep -v "success/redirect|success/ok|failed/exception" InfoOnEmptyPagesNotInMongoDB.csv | less
428
429wharariki:[247]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | egrep -v "success/redirect|success/ok|failed/exception" | lesswharariki:[248]/Scratch/ak19/maori-lang-detection/src>fgrep "status_fetched" InfoOnEmptyPagesNotInMongoDB.csv | egrep -v "success/redirect|success/ok|failed/exception" | wc
430 287 861 36728
431
432(No equivalent info to success/ok, success/redirect, failed/exception)
433
434-----------------------------
435No status information for many pages on site 00154, from the following point onwards (crawled too much of the site?):
436 http://m.biblepub.com/bibles/mb/19/81 key: com.biblepub.m:http/bibles/mb/19/81
437 baseUrl: null
438 status: 2 (status_fetched)
439 fetchTime: 1573978084279
440 prevFetchTime: 1571385510616
441 fetchInterval: 2592000
442 retriesSinceFetch: 0
443 modifiedTime: 0
444 prevModifiedTime: 0
445 protocolStatus: SUCCESS, args=[]
446 signature: 3e214d69ab677a676e40c2b91901acc9
447 parseStatus: success/ok (1/0), args=[]
448 title: Psalm 81 - Maori Bible - Bibles - BiblePub Mobile
449 score: 1.0
450 marker _injmrk_ : y
451 marker _updmrk_ : 1571386061-31026
452 marker dist : 0
453 reprUrl: null
454 batchId: 1571386061-31026
455 metadata CharEncodingForConversion : utf-8
456 metadata OriginalCharEncoding : utf-8
457 metadata _rs_ : ^@^@^By
458 metadata _csh_ : ^@^@^@^@
459 text:start:
460 Psalm 81 - Maori Bible - Bibles - BiblePub Mobile Maori Bible Books next back Psalm 81 1 Ki te tino kaiwhakatangi. Kititi. Na Ahapa. Kia kaha te waiata ki te Atua, ki to tatou kaha: kia hari te hamama ki
461 te Atua o Hakopa. 2 Whakahuatia te himene, maua mai ki konei te timipera, te hapa reka me te hatere. 3 Whakatangihia te tetere i te kowhititanga marama, i te kinga o te marama, i to tatou ra hakari. 4 Ko
462 te tikanga hoki tenei ma Iharaira, he mea whakarite na te Atua o Hakopa. 5 I whakatakotoria tenei e ia ma Hohepa hei whakaaturanga, i tona haerenga puta noa i te whenua o Ihipa: i rongo ai ahau ki reira i
463 tetahi reo, kahore ahau i matau. 6 I tangohia mai e ahau tona pokohiwi i te pikaunga: whakarerea ake e ona ringa te kete. 7 I karanga koe ki ahau i te pouritanga, a kua ora koe i ahau; i whakahoki kupu a
464 hau ki a koe i te wahi ngaro o te whatitiri; i whakamatau i a koe ki nga wai o Meripa. (Hera. 8 Whakarongo, e taku iwi, a ka whakaatu ahau ki a koe: e Iharaira, ki te whakarongo koe ki ahau; 9 Aua tetahi
465 atua ke i roto i a koe; kaua ano e koropiko ki te atua ke. 10 Ko Ihowa ahau, ko tou Atua, i arahina mai ai koe i te whenua o Ihipa: kia nui te kowhera o tou mangai, a maku e whakaki. 11 Otiia kihai taku i
466 wi i pai ki te whakarongo ki toku reo: kihai ano a Iharaira i aro ki ahau. 12 Na tukua atu ana ratou e ahau ki te maro o o ratou ngakau: a haere ana ratou i runga i o ratou whakaaro. 13 Aue, te whakarongo
467 taku iwi ki ahau! Te haere a Iharaira i aku ara! 14 Penei e kore e aha kua whati i ahau te tara o o ratou hoariri: kua tahuri ano toku ringa ki o ratou hoariri. 15 Ko te hunga e kino ana ki a Ihowa kua n
468 gohengohe ki a ia: ko to ratou taima ia kua mau tonu. 16 Kua whangainga hoki ratou e ia ki te witi pai rawa, kua whakamakonatia ano koe e ahau ki te honi i roto i te kohatu. next back Contact Us - Full Si
469 te © 2013 BiblePub
470 text:end:
471
472 http://m.biblepub.com/bibles/mb/19/82 key: com.biblepub.m:http/bibles/mb/19/82
473 baseUrl: null
474 status: 1 (status_unfetched)
475 fetchTime: 1571386117381
476 prevFetchTime: 0
477 fetchInterval: 2592000
478 retriesSinceFetch: 0
479 modifiedTime: 0
480 prevModifiedTime: 0
481 protocolStatus: (null)
482 parseStatus: (null)
483 title: null
484 score: 0.0
485 marker dist : 1
486 reprUrl: null
487 metadata _csh_ : ^@^@^@^@
488
489
490------------
491
492
493Would like to do something like:
494wharariki:[378]/Scratch/ak19/maori-lang-detection/crawled>find . -name UNFINISHED | grep -l text:start */dump.txt | wc
495
496
497Would like to find how many and which of the unfinished websites had a dump.txt with no text content
498AND how many of the completely crawled websites had a dump.txt with no text content.
499
500
501--------------
502
503
504
505wharariki:[393]/Scratch/ak19/maori-lang-detection/crawled>grep -l "text:start" */dump.txt
506
507wharariki:[388]/Scratch/ak19/maori-lang-detection/crawled>less 01461/dump.txt
508wharariki:[389]/Scratch/ak19/maori-lang-detection/crawled>less 01453/dump.txt
509wharariki:[390]/Scratch/ak19/maori-lang-detection/crawled>less 01447/dump.txt
510wharariki:[391]/Scratch/ak19/maori-lang-detection/crawled>less 01446/dump.txt
511wharariki:[392]/Scratch/ak19/maori-lang-detection/crawled>less 01445/dump.txt
512
513# All the dump.txt files that are 0 bytes (no content):
514# https://stackoverflow.com/questions/15703664/find-all-zero-byte-files-in-directory-and-subdirectories
515wharariki:[396]/Scratch/ak19/maori-lang-detection/crawled>find . -name "dump.txt" -size 0 | sort | wc
516 150 150 2550
517
518
519Examples of empty dump.txt files (listed with: find . -name "dump.txt" -size 0 | sort):
520 wharariki:[400]/Scratch/ak19/maori-lang-detection/crawled>less ../tmp/to_crawl.THE_VERSION_USED/sites/00014/seedURLs.txt
521 wharariki:[401]/Scratch/ak19/maori-lang-detection/crawled>less ../tmp/to_crawl.THE_VERSION_USED/sites/01461/seedURLs.txt
522 wharariki:[402]/Scratch/ak19/maori-lang-detection/crawled>less ../tmp/to_crawl.THE_VERSION_USED/sites/01447/seedURLs.txt
523 wharariki:[403]/Scratch/ak19/maori-lang-detection/crawled>less ../tmp/to_crawl.THE_VERSION_USED/sites/01422/seedURLs.txt
524
525
526=======
527
528
Note: See TracBrowser for help on using the repository browser.