source: other-projects/maori-lang-detection/src/org/greenstone/atea/morphia/WebsiteInfo.java@ 33909

Last change on this file since 33909 was 33909, checked in by ak19, 4 years ago
  1. Implementing tables 3 to 5. 2. Rolled back the introduction of the basicDomain field (domain stripped of http/https and www prefixes) as the code can create and sort this field alphabetically, whereas it didn't sort properly in mongodb. 3. The code now does sort the domains stripped of protocol and www for the mongodb queries producing domain results and ensures the domain list is unique. 4. Split the MongoDBAccess class into 2, with the connection code in MongoDBAccess.java and the querying code in MongoDBQueryer (a subclass of MongoDBAccess) that is so far exclusively used by WebPageURLsListing.java
File size: 1.7 KB
Line 
1package org.greenstone.atea.morphia;
2
3import dev.morphia.annotations.*;
4
5@Entity("Websites")
6public class WebsiteInfo {
7 //public final int id;
8 @Id
9 public final String siteFolderName;
10 public final String domain;
11 //public final String basicDomain; // domain without protocol and www. prefix
12
13 public final int totalPages;
14 public final int countOfWebPagesWithBodyText;
15
16 public final int numPagesInMRI;
17 public final int numPagesContainingMRI;
18
19 public final long siteCrawledTimestamp;
20 public final boolean siteCrawlUnfinished;
21 public final boolean redoCrawl;
22
23 public final String geoLocationCountryCode;
24 public final boolean urlContainsLangCodeInPath;
25
26 public WebsiteInfo(/*int siteCount,*/ String siteFolderName,
27 String domainOfSite, //String baseSiteDomain,
28 int totalPages, int countOfWebPagesWithBodyText,
29 int numPagesInMRI, int numPagesContainingMRI,
30 long siteCrawledTimestamp, boolean siteCrawlUnfinished, boolean redoCrawl,
31 String geoLocationCountryCode, boolean urlContainsLangCodeInPath)
32 {
33 //this.id = siteCount;
34 this.siteFolderName = siteFolderName;
35 this.domain = domainOfSite;
36 //this.basicDomain = baseSiteDomain;
37
38 this.totalPages = totalPages;
39 this.countOfWebPagesWithBodyText = countOfWebPagesWithBodyText;
40
41 this.numPagesInMRI = numPagesInMRI;
42 this.numPagesContainingMRI = numPagesContainingMRI;
43
44 this.siteCrawledTimestamp = siteCrawledTimestamp;
45 this.siteCrawlUnfinished = siteCrawlUnfinished;
46 this.redoCrawl = redoCrawl;
47
48 this.geoLocationCountryCode = geoLocationCountryCode;
49 this.urlContainsLangCodeInPath = urlContainsLangCodeInPath;
50 }
51}
Note: See TracBrowser for help on using the repository browser.