source: other-projects/maori-lang-detection/mongodb-data/counts_sitesWithPagesContainingMRI.json@ 33813

Last change on this file since 33813 was 33813, checked in by ak19, 4 years ago

With the bugfix from yesterday and the inclusion of http(s):mi.* type URLs in setting the Websites mongodb collection's urlContainsLangCodeInPath property, and updated/improved mongodb queries and their results I have now regenerated the latest geojson json data and maps.

File size: 2.3 KB
Line 
1/*
2Number of sites containing at least one sentence for which OpenNLP detected the best language = MRI
3
4db.getCollection('Websites').find({numPagesContainingMRI: {$gt: 0}}).count()
5868
6
7
8Obviously, the following should be equal to that:
9
10db.getCollection('Websites').find({ $or: [ { numPagesInMRI: { $gt: 0 } }, { numPagesContainingMRI: {$gt: 0} } ] } ).count()
11868
12
13
14Count of country codes for sites that have at least one page containing at least one sentence detected as MRI by OpenNLP:
15
16db.Websites.aggregate([
17 {
18 $match: {
19 numPagesContainingMRI: {$gt: 0}
20 }
21 },
22 { $unwind: "$geoLocationCountryCode" },
23 {
24 $group: {
25 _id: {$toLower: '$geoLocationCountryCode'},
26 count: { $sum: 1 }
27 }
28 },
29 { $sort : { count : -1} }
30]);
31
32*/
33
34/* 1 */
35{
36 "_id" : "us",
37 "count" : 486.0
38}
39
40/* 2 */
41{
42 "_id" : "cn",
43 "count" : 114.0
44}
45
46/* 3 */
47{
48 "_id" : "nz",
49 "count" : 89.0
50}
51
52/* 4 */
53{
54 "_id" : "fr",
55 "count" : 36.0
56}
57
58/* 5 */
59{
60 "_id" : "de",
61 "count" : 27.0
62}
63
64/* 6 */
65{
66 "_id" : "nl",
67 "count" : 22.0
68}
69
70/* 7 */
71{
72 "_id" : "au",
73 "count" : 21.0
74}
75
76/* 8 */
77{
78 "_id" : "ca",
79 "count" : 12.0
80}
81
82/* 9 */
83{
84 "_id" : "dk",
85 "count" : 8.0
86}
87
88/* 10 */
89{
90 "_id" : "es",
91 "count" : 7.0
92}
93
94/* 11 */
95{
96 "_id" : "gb",
97 "count" : 7.0
98}
99
100/* 12 */
101{
102 "_id" : "cz",
103 "count" : 4.0
104}
105
106/* 13 */
107{
108 "_id" : "unknown",
109 "count" : 3.0
110}
111
112/* 14 */
113{
114 "_id" : "at",
115 "count" : 3.0
116}
117
118/* 15 */
119{
120 "_id" : "ro",
121 "count" : 3.0
122}
123
124/* 16 */
125{
126 "_id" : "it",
127 "count" : 3.0
128}
129
130/* 17 */
131{
132 "_id" : "sg",
133 "count" : 2.0
134}
135
136/* 18 */
137{
138 "_id" : "jp",
139 "count" : 2.0
140}
141
142/* 19 */
143{
144 "_id" : "ie",
145 "count" : 2.0
146}
147
148/* 20 */
149{
150 "_id" : "hk",
151 "count" : 2.0
152}
153
154/* 21 */
155{
156 "_id" : "ua",
157 "count" : 2.0
158}
159
160/* 22 */
161{
162 "_id" : "ru",
163 "count" : 2.0
164}
165
166/* 23 */
167{
168 "_id" : "ch",
169 "count" : 2.0
170}
171
172/* 24 */
173{
174 "_id" : "il",
175 "count" : 2.0
176}
177
178/* 25 */
179{
180 "_id" : "tr",
181 "count" : 1.0
182}
183
184/* 26 */
185{
186 "_id" : "mx",
187 "count" : 1.0
188}
189
190/* 27 */
191{
192 "_id" : "ir",
193 "count" : 1.0
194}
195
196/* 28 */
197{
198 "_id" : "gr",
199 "count" : 1.0
200}
201
202/* 29 */
203{
204 "_id" : "bg",
205 "count" : 1.0
206}
207
208/* 30 */
209{
210 "_id" : "eu",
211 "count" : 1.0
212}
213
214/* 31 */
215{
216 "_id" : "fi",
217 "count" : 1.0
218}
Note: See TracBrowser for help on using the repository browser.