source: other-projects/maori-lang-detection/mongodb-data/piechart_data.txt@ 33986

Last change on this file since 33986 was 33986, checked in by ak19, 4 years ago

Dr Bainbridge investigated the original data set more

File size: 7.8 KB
Line 
1"UPPER BOUND"
2
3blacklisted
4greylisted
5skipped crawling
6unfinished (crawling)
7
8Sites crawled and ingested into mongodb:
9- domains shortlisted
10- not shortlisted
11
12
13Not included: only areas of interest of sites otherwise too big to exhaustively crawl were crawled. Not the rest. For example, not all of wikipedia but only mi.wikipedia.org. Not all of blogspot, only blogspot blogs indicated by common crawl results for MRI. Not all of docs.google.com, only the specific pages that turned up in common crawl for MRI.
14
15
161. ALL DOMAINS FROM CC-CRAWL:
17
18Total counts from CommonCrawl (i.e. unique domain count across discardURLs + greyListed.txt + keepURLs.txt)
19
20wharariki:[1153]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount
21Counting all domains and urls in keepURLs.txt + discardURLs.txt + greyListed.txt
22 Count of unique domains: 3074
23 Count of unique basic domains (stripped of protocol and www): 2791
24 Line count: 75559
25 Actual unique URL count: 38717
26 Unique basic URL count (stripped of protocol and www): 32827
27******************************************************
28
29[X 1588 domains from discardURLs + 288 (-1) greylistedURLs + 1462 (+1) keepURLs = 3338 domains]
30
31Line count above correct with the following: 23794+4485+47280=75559
32
33But instead of domain/unique domain/URL/basic unique URL counts. The union of:
34- domains of the following: 1588+288+1462 = 3338
35- unique basic domains of the following (stripped of protocol and www): 1415+277+1362 = 3054
36- basic URL count = 10290 + 2751 + 25683 = 38724
37- basic unique URL count (stripped of protocol and www) = 9656 + 2727 + 20451 = 32834
38
39
40wharariki:[1154]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount ../tmp/to_crawl.THE_VERSION_USED/discardURLs.txt
41Counting all domains and urls in discardURLs.txt
42 Count of unique domains: 1588
43 Count of unique basic domains (stripped of protocol and www): 1415
44 Line count: 23794
45 Actual unique URL count: 10290
46 Unique basic URL count (stripped of protocol and www): 9656
47******************************************************
48wharariki:[1155]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount ../tmp/to_crawl.THE_VERSION_USED/greyListed.txt
49Counting all domains and urls in greyListed.txt
50 Count of unique domains: 288
51 Count of unique basic domains (stripped of protocol and www): 277
52 Line count: 4485
53 Actual unique URL count: 2751
54 Unique basic URL count (stripped of protocol and www): 2727
55******************************************************
56wharariki:[1156]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount ../tmp/to_crawl.THE_VERSION_USED/keepURLs.txt
57Counting all domains and urls in keepURLs.txt
58 Count of unique domains: 1464
59 Count of unique basic domains (stripped of protocol and www): 1362
60 Line count: 47280
61 Actual unique URL count: 25683
62 Unique basic URL count (stripped of protocol and www): 20451
63******************************************************
64
65
66XXXXXXXXXX
67wharariki:[1159]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount ../tmp/to_crawl.THE_VERSION_USED/seedURLs.txt
68Counting all domains and urls in seedURLs.txt
69 Count of unique domains: 1462
70 Count of unique basic domains (stripped of protocol and www): 1360
71 Line count: 25679
72 Actual unique URL count: 25679
73 Unique basic URL count (stripped of protocol and www): 20447
74******************************************************
75XXXXXXXXXX
76
77seedURLs is a subset of keepURLs.
78
79
802a. DISCARDED URLS:
81URLS that are blacklisted + those pages with too little text content (under an arbitrary min threshold)
8223794
83
84b. GREYLISTED URLS:
85> wc -l greyListed.txt
864485
87
88
89c. keepURLs (the URLs we kept for further processing):
90wc -l keepURLs.txt
9147280 keepURLs.txt
92
93
94d. Of the keepURLs, 4 more webpages ultimately irrelevant sites at unprocessed-topsite-matches.txt.
95
963 not in MRI but of the same domain, one is just a gallery of holiday pictures.
97
98> less unprocessed-topsite-matches.txt
99 The following domain with seedURLs are on a major/top 500 site
100 for which no allowed URL pattern regex has been specified.
101 Specify one for this domain in the tab-spaced sites-too-big-to-exhaustively-crawl.txt file
102 http://familypedia.wikia.com/wiki/Property:Father?limit=500&offset=0
103 http://familypedia.wikia.com/wiki/Property:Mother?limit=250&offset=0
104 http://familypedia.wikia.com/wiki/Property:Mother?limit=500&offset=0
105 https://get.google.com/albumarchive/112997211423463224598/album/AF1QipM73RVcpCT2gpp5XhDUawnfyUDBbuJbeCEbVckl
106
107
108e. After duplicates further pruned out from waht remained of keepURLs - the seedURLs for Nutch:
109
110wc -l seedURLs.txt
11125679 seedURLs.txt
112
113wharariki:[1111]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.UniqueDomainCount ../tmp/to_crawl.THE_VERSION_USED/seedURLs.txt
114In file ../tmp/to_crawl.THE_VERSION_USED/seedURLs.txt:
115 Count of domains: 1462
116 Count of unique domains: 1360
117
118
119But anglican.org was wrongly greylisted and added back in
120-> 1463 domains.
121
1223a. Num URLs prepared for crawling:
123wharariki:[119]/Scratch/ak19/maori-lang-detection/tmp/to_crawl.THE_VERSION_USED>wc -l seedURLs.txt
12425679 seedURLs.txt
125
126b. Num sites prepared for crawling (https://stackoverflow.com/questions/17648033/counting-number-of-directories-in-a-specific-directory):
127
128
129wharariki:[147]/Scratch/ak19/maori-lang-detection/tmp/to_crawl.THE_VERSION_USED/sites>echo */ | wc
130 1 1463 10241
131
132(2nd number)
133OR: sites>find . -mindepth 1 -maxdepth 1 -type d | wc -l
1341463
135
136
137/Scratch/ak19/maori-lang-detection/tmp/to_crawl.THE_VERSION_USED/sites/ also contains subfolders up to 01463
138
139
140[maori-lang-detection/tmp/to_crawl.THE_VERSION_USED>emacs all-domain-urls.txt
1411462+1 (for the greylisted anglican.org) = 1463]
142
1434. Num sites crawled:
144wharariki:[155]/Scratch/ak19/maori-lang-detection/crawled>find . -mindepth 1 -maxdepth 1 -type d | wc -l
1451447
146wharariki:[156]/Scratch/ak19/maori-lang-detection/crawled>echo */ | wc
147 1 1447 10129
148
1495. Number of sites not finished crawling (using Nutch at max crawl depth 10):
150wharariki:[158]/Scratch/ak19/maori-lang-detection/crawled>find . -name "UNFINISHED" | wc -l
151619
152
153
1546. Number of sites in MongoDB:
1551446
156
157Not: 00179, 00485-00495, 00499-00502, 01067* (No dump.txt, but website is repeated in 01408)
158
159* 01067 is listed under sites crawled, but not ingested into mongodb.
160
161In siteID ranges of 00100s, 00400s, 00500s, 01000s, each have less than 100 sites in mongodbs:
16299, 88, 97, 99
163
164and 64/64 sites in siteIDs 1400-1463.
165
166=> 1+12+3+1 = 17 sites not crawled of the 1463 = 1446 sites ingested in mongodb.
167
168
1697. Despite 25679 non-duplicate seedURLs and many more pages crawled from those seeds,
170the number of web pages ingested into mongodb are less than about 5 times as much,
171because only crawled web pages with non-empty text were ingested into mongodb.
172
173Num pages in MongoDB:
174db.getCollection('Webpages').find({}).count()
175119874
176
177---------------------------
178
179#Number of crawled pages with 0 content in dump.txt because the page was inaccessible when crawling (protocolStatus: NOTFOUND)
180wharariki:[646]/Scratch/ak19/maori-lang-detection/crawled>fgrep -a 'NOTFOUND' 0*/dump.txt | grep protocolStatus | wc
181 3276 9828 419259
182
183#Number of dump.txt files (sites) that had text:start in them vs those that didn't:
184wharariki:[647]/Scratch/ak19/maori-lang-detection/crawled>fgrep -l text:start */dump.txt | wc
185 1027 1027 15405
186wharariki:[648]/Scratch/ak19/maori-lang-detection/crawled>fgrep text:start */dump.txt | wc
187 1027 4108 35945
188
189# number of dump.txt files
190wharariki:[652]/Scratch/ak19/maori-lang-detection/crawled>find . -name "dump.txt" | wc
191 1446 1446 24582
192wharariki:[653]/Scratch/ak19/maori-lang-detection/crawled>
193
194
195Look to see if commoncrawl has a field for how much text there is on the page.
196Else this is a useful feature for them to add.
197
198
Note: See TracBrowser for help on using the repository browser.