1 | "UPPER BOUND"
|
---|
2 |
|
---|
3 | blacklisted
|
---|
4 | greylisted
|
---|
5 | skipped crawling
|
---|
6 | unfinished (crawling)
|
---|
7 |
|
---|
8 | Sites crawled and ingested into mongodb:
|
---|
9 | - domains shortlisted
|
---|
10 | - not shortlisted
|
---|
11 |
|
---|
12 |
|
---|
13 | Not included: only areas of interest of sites otherwise too big to exhaustively crawl were crawled. Not the rest. For example, not all of wikipedia but only mi.wikipedia.org. Not all of blogspot, only blogspot blogs indicated by common crawl results for MRI. Not all of docs.google.com, only the specific pages that turned up in common crawl for MRI.
|
---|
14 |
|
---|
15 |
|
---|
16 | 1. ALL DOMAINS FROM CC-CRAWL:
|
---|
17 |
|
---|
18 | Total counts from CommonCrawl (i.e. unique domain count across discardURLs + greyListed.txt + keepURLs.txt)
|
---|
19 |
|
---|
20 | wharariki:[1153]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount
|
---|
21 | Counting all domains and urls in keepURLs.txt + discardURLs.txt + greyListed.txt
|
---|
22 | Count of unique domains: 3074
|
---|
23 | Count of unique basic domains (stripped of protocol and www): 2791
|
---|
24 | Line count: 75559
|
---|
25 | Actual unique URL count: 38717
|
---|
26 | Unique basic URL count (stripped of protocol and www): 32827
|
---|
27 | ******************************************************
|
---|
28 |
|
---|
29 | [X 1588 domains from discardURLs + 288 (-1) greylistedURLs + 1462 (+1) keepURLs = 3338 domains]
|
---|
30 |
|
---|
31 | Line count above correct with the following: 23794+4485+47280=75559
|
---|
32 |
|
---|
33 | But instead of domain/unique domain/URL/basic unique URL counts. The union of:
|
---|
34 | - domains of the following: 1588+288+1462 = 3338
|
---|
35 | - unique basic domains of the following (stripped of protocol and www): 1415+277+1362 = 3054
|
---|
36 | - basic URL count = 10290 + 2751 + 25683 = 38724
|
---|
37 | - basic unique URL count (stripped of protocol and www) = 9656 + 2727 + 20451 = 32834
|
---|
38 |
|
---|
39 |
|
---|
40 | wharariki:[1154]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount ../tmp/to_crawl.THE_VERSION_USED/discardURLs.txt
|
---|
41 | Counting all domains and urls in discardURLs.txt
|
---|
42 | Count of unique domains: 1588
|
---|
43 | Count of unique basic domains (stripped of protocol and www): 1415
|
---|
44 | Line count: 23794
|
---|
45 | Actual unique URL count: 10290
|
---|
46 | Unique basic URL count (stripped of protocol and www): 9656
|
---|
47 | ******************************************************
|
---|
48 | wharariki:[1155]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount ../tmp/to_crawl.THE_VERSION_USED/greyListed.txt
|
---|
49 | Counting all domains and urls in greyListed.txt
|
---|
50 | Count of unique domains: 288
|
---|
51 | Count of unique basic domains (stripped of protocol and www): 277
|
---|
52 | Line count: 4485
|
---|
53 | Actual unique URL count: 2751
|
---|
54 | Unique basic URL count (stripped of protocol and www): 2727
|
---|
55 | ******************************************************
|
---|
56 | wharariki:[1156]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount ../tmp/to_crawl.THE_VERSION_USED/keepURLs.txt
|
---|
57 | Counting all domains and urls in keepURLs.txt
|
---|
58 | Count of unique domains: 1464
|
---|
59 | Count of unique basic domains (stripped of protocol and www): 1362
|
---|
60 | Line count: 47280
|
---|
61 | Actual unique URL count: 25683
|
---|
62 | Unique basic URL count (stripped of protocol and www): 20451
|
---|
63 | ******************************************************
|
---|
64 |
|
---|
65 |
|
---|
66 | XXXXXXXXXX
|
---|
67 | wharariki:[1159]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount ../tmp/to_crawl.THE_VERSION_USED/seedURLs.txt
|
---|
68 | Counting all domains and urls in seedURLs.txt
|
---|
69 | Count of unique domains: 1462
|
---|
70 | Count of unique basic domains (stripped of protocol and www): 1360
|
---|
71 | Line count: 25679
|
---|
72 | Actual unique URL count: 25679
|
---|
73 | Unique basic URL count (stripped of protocol and www): 20447
|
---|
74 | ******************************************************
|
---|
75 | XXXXXXXXXX
|
---|
76 |
|
---|
77 | seedURLs is a subset of keepURLs.
|
---|
78 |
|
---|
79 |
|
---|
80 | 2a. DISCARDED URLS:
|
---|
81 | URLS that are blacklisted + those pages with too little text content (under an arbitrary min threshold)
|
---|
82 | 23794
|
---|
83 |
|
---|
84 | b. GREYLISTED URLS:
|
---|
85 | > wc -l greyListed.txt
|
---|
86 | 4485
|
---|
87 |
|
---|
88 |
|
---|
89 | c. keepURLs (the URLs we kept for further processing):
|
---|
90 | wc -l keepURLs.txt
|
---|
91 | 47280 keepURLs.txt
|
---|
92 |
|
---|
93 |
|
---|
94 | d. Of the keepURLs, 4 more webpages ultimately irrelevant sites at unprocessed-topsite-matches.txt.
|
---|
95 |
|
---|
96 | 3 not in MRI but of the same domain, one is just a gallery of holiday pictures.
|
---|
97 |
|
---|
98 | > less unprocessed-topsite-matches.txt
|
---|
99 | The following domain with seedURLs are on a major/top 500 site
|
---|
100 | for which no allowed URL pattern regex has been specified.
|
---|
101 | Specify one for this domain in the tab-spaced sites-too-big-to-exhaustively-crawl.txt file
|
---|
102 | http://familypedia.wikia.com/wiki/Property:Father?limit=500&offset=0
|
---|
103 | http://familypedia.wikia.com/wiki/Property:Mother?limit=250&offset=0
|
---|
104 | http://familypedia.wikia.com/wiki/Property:Mother?limit=500&offset=0
|
---|
105 | https://get.google.com/albumarchive/112997211423463224598/album/AF1QipM73RVcpCT2gpp5XhDUawnfyUDBbuJbeCEbVckl
|
---|
106 |
|
---|
107 |
|
---|
108 | e. After duplicates further pruned out from waht remained of keepURLs - the seedURLs for Nutch:
|
---|
109 |
|
---|
110 | wc -l seedURLs.txt
|
---|
111 | 25679 seedURLs.txt
|
---|
112 |
|
---|
113 | wharariki:[1111]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.UniqueDomainCount ../tmp/to_crawl.THE_VERSION_USED/seedURLs.txt
|
---|
114 | In file ../tmp/to_crawl.THE_VERSION_USED/seedURLs.txt:
|
---|
115 | Count of domains: 1462
|
---|
116 | Count of unique domains: 1360
|
---|
117 |
|
---|
118 |
|
---|
119 | But anglican.org was wrongly greylisted and added back in
|
---|
120 | -> 1463 domains.
|
---|
121 |
|
---|
122 | 3a. Num URLs prepared for crawling:
|
---|
123 | wharariki:[119]/Scratch/ak19/maori-lang-detection/tmp/to_crawl.THE_VERSION_USED>wc -l seedURLs.txt
|
---|
124 | 25679 seedURLs.txt
|
---|
125 |
|
---|
126 | b. Num sites prepared for crawling (https://stackoverflow.com/questions/17648033/counting-number-of-directories-in-a-specific-directory):
|
---|
127 |
|
---|
128 |
|
---|
129 | wharariki:[147]/Scratch/ak19/maori-lang-detection/tmp/to_crawl.THE_VERSION_USED/sites>echo */ | wc
|
---|
130 | 1 1463 10241
|
---|
131 |
|
---|
132 | (2nd number)
|
---|
133 | OR: sites>find . -mindepth 1 -maxdepth 1 -type d | wc -l
|
---|
134 | 1463
|
---|
135 |
|
---|
136 |
|
---|
137 | /Scratch/ak19/maori-lang-detection/tmp/to_crawl.THE_VERSION_USED/sites/ also contains subfolders up to 01463
|
---|
138 |
|
---|
139 |
|
---|
140 | [maori-lang-detection/tmp/to_crawl.THE_VERSION_USED>emacs all-domain-urls.txt
|
---|
141 | 1462+1 (for the greylisted anglican.org) = 1463]
|
---|
142 |
|
---|
143 | 4. Num sites crawled:
|
---|
144 | wharariki:[155]/Scratch/ak19/maori-lang-detection/crawled>find . -mindepth 1 -maxdepth 1 -type d | wc -l
|
---|
145 | 1447
|
---|
146 | wharariki:[156]/Scratch/ak19/maori-lang-detection/crawled>echo */ | wc
|
---|
147 | 1 1447 10129
|
---|
148 |
|
---|
149 | 5. Number of sites not finished crawling (using Nutch at max crawl depth 10):
|
---|
150 | wharariki:[158]/Scratch/ak19/maori-lang-detection/crawled>find . -name "UNFINISHED" | wc -l
|
---|
151 | 619
|
---|
152 |
|
---|
153 |
|
---|
154 | 6. Number of sites in MongoDB:
|
---|
155 | 1446
|
---|
156 |
|
---|
157 | Not: 00179, 00485-00495, 00499-00502, 01067* (No dump.txt, but website is repeated in 01408)
|
---|
158 |
|
---|
159 | * 01067 is listed under sites crawled, but not ingested into mongodb.
|
---|
160 |
|
---|
161 | In siteID ranges of 00100s, 00400s, 00500s, 01000s, each have less than 100 sites in mongodbs:
|
---|
162 | 99, 88, 97, 99
|
---|
163 |
|
---|
164 | and 64/64 sites in siteIDs 1400-1463.
|
---|
165 |
|
---|
166 | => 1+12+3+1 = 17 sites not crawled of the 1463 = 1446 sites ingested in mongodb.
|
---|
167 |
|
---|
168 |
|
---|
169 | 7. Despite 25679 non-duplicate seedURLs and many more pages crawled from those seeds,
|
---|
170 | the number of web pages ingested into mongodb are less than about 5 times as much,
|
---|
171 | because only crawled web pages with non-empty text were ingested into mongodb.
|
---|
172 |
|
---|
173 | Num pages in MongoDB:
|
---|
174 | db.getCollection('Webpages').find({}).count()
|
---|
175 | 119874
|
---|
176 |
|
---|
177 | ---------------------------
|
---|
178 |
|
---|
179 | #Number of crawled pages with 0 content in dump.txt because the page was inaccessible when crawling (protocolStatus: NOTFOUND)
|
---|
180 | wharariki:[646]/Scratch/ak19/maori-lang-detection/crawled>fgrep -a 'NOTFOUND' 0*/dump.txt | grep protocolStatus | wc
|
---|
181 | 3276 9828 419259
|
---|
182 |
|
---|
183 | #Number of dump.txt files (sites) that had text:start in them vs those that didn't:
|
---|
184 | wharariki:[647]/Scratch/ak19/maori-lang-detection/crawled>fgrep -l text:start */dump.txt | wc
|
---|
185 | 1027 1027 15405
|
---|
186 | wharariki:[648]/Scratch/ak19/maori-lang-detection/crawled>fgrep text:start */dump.txt | wc
|
---|
187 | 1027 4108 35945
|
---|
188 |
|
---|
189 | # number of dump.txt files
|
---|
190 | wharariki:[652]/Scratch/ak19/maori-lang-detection/crawled>find . -name "dump.txt" | wc
|
---|
191 | 1446 1446 24582
|
---|
192 | wharariki:[653]/Scratch/ak19/maori-lang-detection/crawled>
|
---|
193 |
|
---|
194 |
|
---|
195 | Look to see if commoncrawl has a field for how much text there is on the page.
|
---|
196 | Else this is a useful feature for them to add.
|
---|
197 |
|
---|
198 |
|
---|