source: other-projects/maori-lang-detection/mongodb-data/piechart_data.txt@ 33985

Last change on this file since 33985 was 33985, checked in by ak19, 4 years ago

Data to back the piechart I need to make that will illustrate how we continuously filtered out the pool of sites and urls returned by commoncrawl for MRI text down to the final web domains and pages we worked with for our samples.

File size: 6.9 KB
Line 
1blacklisted
2greylisted
3skipped crawling
4unfinished (crawling)
5
6Sites crawled and ingested into mongodb:
7- domains shortlisted
8- not shortlisted
9
10
11Not included: only areas of interest of sites otherwise too big to exhaustively crawl were crawled. Not the rest. For example, not all of wikipedia but only mi.wikipedia.org. Not all of blogspot, only blogspot blogs indicated by common crawl results for MRI. Not all of docs.google.com, only the specific pages that turned up in common crawl for MRI.
12
13
141. ALL DOMAINS FROM CC-CRAWL:
15
16Total counts from CommonCrawl (i.e. unique domain count across discardURLs + greyListed.txt + keepURLs.txt)
17
18wharariki:[1153]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount
19Counting all domains and urls in keepURLs.txt + discardURLs.txt + greyListed.txt
20 Count of unique domains: 3074
21 Count of unique basic domains (stripped of protocol and www): 2791
22 Line count: 75559
23 Actual unique URL count: 38717
24 Unique basic URL count (stripped of protocol and www): 32827
25******************************************************
26
27[X 1588 domains from discardURLs + 288 (-1) greylistedURLs + 1462 (+1) keepURLs = 3338 domains]
28
29Line count above correct with the following: 23794+4485+47280=75559
30
31But instead of domain/unique domain/URL/basic unique URL counts. The union of:
32- domains of the following: 1588+288+1462 = 3338
33- unique basic domains of the following (stripped of protocol and www): 1415+277+1362 = 3054
34- basic URL count = 10290 + 2751 + 25683 = 38724
35- basic unique URL count (stripped of protocol and www) = 9656 + 2727 + 20451 = 32834
36
37
38wharariki:[1154]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount ../tmp/to_crawl.THE_VERSION_USED/discardURLs.txt
39Counting all domains and urls in discardURLs.txt
40 Count of unique domains: 1588
41 Count of unique basic domains (stripped of protocol and www): 1415
42 Line count: 23794
43 Actual unique URL count: 10290
44 Unique basic URL count (stripped of protocol and www): 9656
45******************************************************
46wharariki:[1155]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount ../tmp/to_crawl.THE_VERSION_USED/greyListed.txt
47Counting all domains and urls in greyListed.txt
48 Count of unique domains: 288
49 Count of unique basic domains (stripped of protocol and www): 277
50 Line count: 4485
51 Actual unique URL count: 2751
52 Unique basic URL count (stripped of protocol and www): 2727
53******************************************************
54wharariki:[1156]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount ../tmp/to_crawl.THE_VERSION_USED/keepURLs.txt
55Counting all domains and urls in keepURLs.txt
56 Count of unique domains: 1464
57 Count of unique basic domains (stripped of protocol and www): 1362
58 Line count: 47280
59 Actual unique URL count: 25683
60 Unique basic URL count (stripped of protocol and www): 20451
61******************************************************
62
63
64XXXXXXXXXX
65wharariki:[1159]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount ../tmp/to_crawl.THE_VERSION_USED/seedURLs.txt
66Counting all domains and urls in seedURLs.txt
67 Count of unique domains: 1462
68 Count of unique basic domains (stripped of protocol and www): 1360
69 Line count: 25679
70 Actual unique URL count: 25679
71 Unique basic URL count (stripped of protocol and www): 20447
72******************************************************
73XXXXXXXXXX
74
75seedURLs is a subset of keepURLs.
76
77
782a. DISCARDED URLS:
79URLS that are blacklisted + those pages with too little text content (under an arbitrary min threshold)
8023794
81
82b. GREYLISTED URLS:
83> wc -l greyListed.txt
844485
85
86
87c. keepURLs (the URLs we kept for further processing):
88wc -l keepURLs.txt
8947280 keepURLs.txt
90
91
92d. Of the keepURLs, 4 more webpages ultimately irrelevant sites at unprocessed-topsite-matches.txt.
93
943 not in MRI but of the same domain, one is just a gallery of holiday pictures.
95
96> less unprocessed-topsite-matches.txt
97 The following domain with seedURLs are on a major/top 500 site
98 for which no allowed URL pattern regex has been specified.
99 Specify one for this domain in the tab-spaced sites-too-big-to-exhaustively-crawl.txt file
100 http://familypedia.wikia.com/wiki/Property:Father?limit=500&offset=0
101 http://familypedia.wikia.com/wiki/Property:Mother?limit=250&offset=0
102 http://familypedia.wikia.com/wiki/Property:Mother?limit=500&offset=0
103 https://get.google.com/albumarchive/112997211423463224598/album/AF1QipM73RVcpCT2gpp5XhDUawnfyUDBbuJbeCEbVckl
104
105
106e. After duplicates further pruned out from waht remained of keepURLs - the seedURLs for Nutch:
107
108wc -l seedURLs.txt
10925679 seedURLs.txt
110
111wharariki:[1111]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.UniqueDomainCount ../tmp/to_crawl.THE_VERSION_USED/seedURLs.txt
112In file ../tmp/to_crawl.THE_VERSION_USED/seedURLs.txt:
113 Count of domains: 1462
114 Count of unique domains: 1360
115
116
117But anglican.org was wrongly greylisted and added back in
118-> 1463 domains.
119
1203a. Num URLs prepared for crawling:
121wharariki:[119]/Scratch/ak19/maori-lang-detection/tmp/to_crawl.THE_VERSION_USED>wc -l seedURLs.txt
12225679 seedURLs.txt
123
124b. Num sites prepared for crawling (https://stackoverflow.com/questions/17648033/counting-number-of-directories-in-a-specific-directory):
125
126
127wharariki:[147]/Scratch/ak19/maori-lang-detection/tmp/to_crawl.THE_VERSION_USED/sites>echo */ | wc
128 1 1463 10241
129
130(2nd number)
131OR: sites>find . -mindepth 1 -maxdepth 1 -type d | wc -l
1321463
133
134
135/Scratch/ak19/maori-lang-detection/tmp/to_crawl.THE_VERSION_USED/sites/ also contains subfolders up to 01463
136
137
138[maori-lang-detection/tmp/to_crawl.THE_VERSION_USED>emacs all-domain-urls.txt
1391462+1 (for the greylisted anglican.org) = 1463]
140
1414. Num sites crawled:
142wharariki:[155]/Scratch/ak19/maori-lang-detection/crawled>find . -mindepth 1 -maxdepth 1 -type d | wc -l
1431447
144wharariki:[156]/Scratch/ak19/maori-lang-detection/crawled>echo */ | wc
145 1 1447 10129
146
1475. Number of sites not finished crawling (using Nutch at max crawl depth 10):
148wharariki:[158]/Scratch/ak19/maori-lang-detection/crawled>find . -name "UNFINISHED" | wc -l
149619
150
151
1526. Number of sites in MongoDB:
1531446
154
155Not: 00179, 00485-00495, 00499-00502, 01067* (No dump.txt, but website is repeated in 01408)
156
157* 01067 is listed under sites crawled, but not ingested into mongodb.
158
159In siteID ranges of 00100s, 00400s, 00500s, 01000s, each have less than 100 sites in mongodbs:
16099, 88, 97, 99
161
162and 64/64 sites in siteIDs 1400-1463.
163
164=> 1+12+3+1 = 17 sites not crawled of the 1463 = 1446 sites ingested in mongodb.
165
166
1677. Despite 25679 non-duplicate seedURLs and many more pages crawled from those seeds,
168the number of web pages ingested into mongodb are less than about 5 times as much,
169because only crawled web pages with non-empty text were ingested into mongodb.
170
171Num pages in MongoDB:
172db.getCollection('Webpages').find({}).count()
173119874
174
175
Note: See TracBrowser for help on using the repository browser.