1 | blacklisted
|
---|
2 | greylisted
|
---|
3 | skipped crawling
|
---|
4 | unfinished (crawling)
|
---|
5 |
|
---|
6 | Sites crawled and ingested into mongodb:
|
---|
7 | - domains shortlisted
|
---|
8 | - not shortlisted
|
---|
9 |
|
---|
10 |
|
---|
11 | Not included: only areas of interest of sites otherwise too big to exhaustively crawl were crawled. Not the rest. For example, not all of wikipedia but only mi.wikipedia.org. Not all of blogspot, only blogspot blogs indicated by common crawl results for MRI. Not all of docs.google.com, only the specific pages that turned up in common crawl for MRI.
|
---|
12 |
|
---|
13 |
|
---|
14 | 1. ALL DOMAINS FROM CC-CRAWL:
|
---|
15 |
|
---|
16 | Total counts from CommonCrawl (i.e. unique domain count across discardURLs + greyListed.txt + keepURLs.txt)
|
---|
17 |
|
---|
18 | wharariki:[1153]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount
|
---|
19 | Counting all domains and urls in keepURLs.txt + discardURLs.txt + greyListed.txt
|
---|
20 | Count of unique domains: 3074
|
---|
21 | Count of unique basic domains (stripped of protocol and www): 2791
|
---|
22 | Line count: 75559
|
---|
23 | Actual unique URL count: 38717
|
---|
24 | Unique basic URL count (stripped of protocol and www): 32827
|
---|
25 | ******************************************************
|
---|
26 |
|
---|
27 | [X 1588 domains from discardURLs + 288 (-1) greylistedURLs + 1462 (+1) keepURLs = 3338 domains]
|
---|
28 |
|
---|
29 | Line count above correct with the following: 23794+4485+47280=75559
|
---|
30 |
|
---|
31 | But instead of domain/unique domain/URL/basic unique URL counts. The union of:
|
---|
32 | - domains of the following: 1588+288+1462 = 3338
|
---|
33 | - unique basic domains of the following (stripped of protocol and www): 1415+277+1362 = 3054
|
---|
34 | - basic URL count = 10290 + 2751 + 25683 = 38724
|
---|
35 | - basic unique URL count (stripped of protocol and www) = 9656 + 2727 + 20451 = 32834
|
---|
36 |
|
---|
37 |
|
---|
38 | wharariki:[1154]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount ../tmp/to_crawl.THE_VERSION_USED/discardURLs.txt
|
---|
39 | Counting all domains and urls in discardURLs.txt
|
---|
40 | Count of unique domains: 1588
|
---|
41 | Count of unique basic domains (stripped of protocol and www): 1415
|
---|
42 | Line count: 23794
|
---|
43 | Actual unique URL count: 10290
|
---|
44 | Unique basic URL count (stripped of protocol and www): 9656
|
---|
45 | ******************************************************
|
---|
46 | wharariki:[1155]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount ../tmp/to_crawl.THE_VERSION_USED/greyListed.txt
|
---|
47 | Counting all domains and urls in greyListed.txt
|
---|
48 | Count of unique domains: 288
|
---|
49 | Count of unique basic domains (stripped of protocol and www): 277
|
---|
50 | Line count: 4485
|
---|
51 | Actual unique URL count: 2751
|
---|
52 | Unique basic URL count (stripped of protocol and www): 2727
|
---|
53 | ******************************************************
|
---|
54 | wharariki:[1156]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount ../tmp/to_crawl.THE_VERSION_USED/keepURLs.txt
|
---|
55 | Counting all domains and urls in keepURLs.txt
|
---|
56 | Count of unique domains: 1464
|
---|
57 | Count of unique basic domains (stripped of protocol and www): 1362
|
---|
58 | Line count: 47280
|
---|
59 | Actual unique URL count: 25683
|
---|
60 | Unique basic URL count (stripped of protocol and www): 20451
|
---|
61 | ******************************************************
|
---|
62 |
|
---|
63 |
|
---|
64 | XXXXXXXXXX
|
---|
65 | wharariki:[1159]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.AllDomainCount ../tmp/to_crawl.THE_VERSION_USED/seedURLs.txt
|
---|
66 | Counting all domains and urls in seedURLs.txt
|
---|
67 | Count of unique domains: 1462
|
---|
68 | Count of unique basic domains (stripped of protocol and www): 1360
|
---|
69 | Line count: 25679
|
---|
70 | Actual unique URL count: 25679
|
---|
71 | Unique basic URL count (stripped of protocol and www): 20447
|
---|
72 | ******************************************************
|
---|
73 | XXXXXXXXXX
|
---|
74 |
|
---|
75 | seedURLs is a subset of keepURLs.
|
---|
76 |
|
---|
77 |
|
---|
78 | 2a. DISCARDED URLS:
|
---|
79 | URLS that are blacklisted + those pages with too little text content (under an arbitrary min threshold)
|
---|
80 | 23794
|
---|
81 |
|
---|
82 | b. GREYLISTED URLS:
|
---|
83 | > wc -l greyListed.txt
|
---|
84 | 4485
|
---|
85 |
|
---|
86 |
|
---|
87 | c. keepURLs (the URLs we kept for further processing):
|
---|
88 | wc -l keepURLs.txt
|
---|
89 | 47280 keepURLs.txt
|
---|
90 |
|
---|
91 |
|
---|
92 | d. Of the keepURLs, 4 more webpages ultimately irrelevant sites at unprocessed-topsite-matches.txt.
|
---|
93 |
|
---|
94 | 3 not in MRI but of the same domain, one is just a gallery of holiday pictures.
|
---|
95 |
|
---|
96 | > less unprocessed-topsite-matches.txt
|
---|
97 | The following domain with seedURLs are on a major/top 500 site
|
---|
98 | for which no allowed URL pattern regex has been specified.
|
---|
99 | Specify one for this domain in the tab-spaced sites-too-big-to-exhaustively-crawl.txt file
|
---|
100 | http://familypedia.wikia.com/wiki/Property:Father?limit=500&offset=0
|
---|
101 | http://familypedia.wikia.com/wiki/Property:Mother?limit=250&offset=0
|
---|
102 | http://familypedia.wikia.com/wiki/Property:Mother?limit=500&offset=0
|
---|
103 | https://get.google.com/albumarchive/112997211423463224598/album/AF1QipM73RVcpCT2gpp5XhDUawnfyUDBbuJbeCEbVckl
|
---|
104 |
|
---|
105 |
|
---|
106 | e. After duplicates further pruned out from waht remained of keepURLs - the seedURLs for Nutch:
|
---|
107 |
|
---|
108 | wc -l seedURLs.txt
|
---|
109 | 25679 seedURLs.txt
|
---|
110 |
|
---|
111 | wharariki:[1111]/Scratch/ak19/maori-lang-detection/src>java -cp ".:../conf:../lib/*" org.greenstone.atea.UniqueDomainCount ../tmp/to_crawl.THE_VERSION_USED/seedURLs.txt
|
---|
112 | In file ../tmp/to_crawl.THE_VERSION_USED/seedURLs.txt:
|
---|
113 | Count of domains: 1462
|
---|
114 | Count of unique domains: 1360
|
---|
115 |
|
---|
116 |
|
---|
117 | But anglican.org was wrongly greylisted and added back in
|
---|
118 | -> 1463 domains.
|
---|
119 |
|
---|
120 | 3a. Num URLs prepared for crawling:
|
---|
121 | wharariki:[119]/Scratch/ak19/maori-lang-detection/tmp/to_crawl.THE_VERSION_USED>wc -l seedURLs.txt
|
---|
122 | 25679 seedURLs.txt
|
---|
123 |
|
---|
124 | b. Num sites prepared for crawling (https://stackoverflow.com/questions/17648033/counting-number-of-directories-in-a-specific-directory):
|
---|
125 |
|
---|
126 |
|
---|
127 | wharariki:[147]/Scratch/ak19/maori-lang-detection/tmp/to_crawl.THE_VERSION_USED/sites>echo */ | wc
|
---|
128 | 1 1463 10241
|
---|
129 |
|
---|
130 | (2nd number)
|
---|
131 | OR: sites>find . -mindepth 1 -maxdepth 1 -type d | wc -l
|
---|
132 | 1463
|
---|
133 |
|
---|
134 |
|
---|
135 | /Scratch/ak19/maori-lang-detection/tmp/to_crawl.THE_VERSION_USED/sites/ also contains subfolders up to 01463
|
---|
136 |
|
---|
137 |
|
---|
138 | [maori-lang-detection/tmp/to_crawl.THE_VERSION_USED>emacs all-domain-urls.txt
|
---|
139 | 1462+1 (for the greylisted anglican.org) = 1463]
|
---|
140 |
|
---|
141 | 4. Num sites crawled:
|
---|
142 | wharariki:[155]/Scratch/ak19/maori-lang-detection/crawled>find . -mindepth 1 -maxdepth 1 -type d | wc -l
|
---|
143 | 1447
|
---|
144 | wharariki:[156]/Scratch/ak19/maori-lang-detection/crawled>echo */ | wc
|
---|
145 | 1 1447 10129
|
---|
146 |
|
---|
147 | 5. Number of sites not finished crawling (using Nutch at max crawl depth 10):
|
---|
148 | wharariki:[158]/Scratch/ak19/maori-lang-detection/crawled>find . -name "UNFINISHED" | wc -l
|
---|
149 | 619
|
---|
150 |
|
---|
151 |
|
---|
152 | 6. Number of sites in MongoDB:
|
---|
153 | 1446
|
---|
154 |
|
---|
155 | Not: 00179, 00485-00495, 00499-00502, 01067* (No dump.txt, but website is repeated in 01408)
|
---|
156 |
|
---|
157 | * 01067 is listed under sites crawled, but not ingested into mongodb.
|
---|
158 |
|
---|
159 | In siteID ranges of 00100s, 00400s, 00500s, 01000s, each have less than 100 sites in mongodbs:
|
---|
160 | 99, 88, 97, 99
|
---|
161 |
|
---|
162 | and 64/64 sites in siteIDs 1400-1463.
|
---|
163 |
|
---|
164 | => 1+12+3+1 = 17 sites not crawled of the 1463 = 1446 sites ingested in mongodb.
|
---|
165 |
|
---|
166 |
|
---|
167 | 7. Despite 25679 non-duplicate seedURLs and many more pages crawled from those seeds,
|
---|
168 | the number of web pages ingested into mongodb are less than about 5 times as much,
|
---|
169 | because only crawled web pages with non-empty text were ingested into mongodb.
|
---|
170 |
|
---|
171 | Num pages in MongoDB:
|
---|
172 | db.getCollection('Webpages').find({}).count()
|
---|
173 | 119874
|
---|
174 |
|
---|
175 |
|
---|