source: other-projects/maori-lang-detection/src/org/greenstone/atea/morphia/WebsiteInfo.java@ 33906

Last change on this file since 33906 was 33906, checked in by ak19, 4 years ago

Code is intermediate state. 1. Introduced basicDomain field to MongoDB and recreated the MongoDB tables/collections, this will help discount duplicated domains under http and https, with and without www. Though webpage URLs may potentially still be unique and not duplicated across all 4 possible variants, I want them counted under the same base domain name. 2. Another issue noticed now is that some of the sites appear to be hosted on multiple countries servers, and so slightly different country code counts and domainlistings are returned. 3. So added code modifications (untested) to sort the domains alphabetically after stripping protocol and www to allow comparing the old domainListing results of MongoDB's now renamed oldWebsites and oldWebpages collections to the new versions of these collections and to then update the differences in manual counts.

File size: 1.6 KB
Line 
1package org.greenstone.atea.morphia;
2
3import dev.morphia.annotations.*;
4
5@Entity("Websites")
6public class WebsiteInfo {
7 //public final int id;
8 @Id
9 public final String siteFolderName;
10 public final String domain;
11 public final String basicDomain; // domain without protocol and www. prefix
12
13 public final int totalPages;
14 public final int countOfWebPagesWithBodyText;
15
16 public final int numPagesInMRI;
17 public final int numPagesContainingMRI;
18
19 public final long siteCrawledTimestamp;
20 public final boolean siteCrawlUnfinished;
21 public final boolean redoCrawl;
22
23 public final String geoLocationCountryCode;
24 public final boolean urlContainsLangCodeInPath;
25
26 public WebsiteInfo(/*int siteCount,*/ String siteFolderName,
27 String domainOfSite, String baseSiteDomain,
28 int totalPages, int countOfWebPagesWithBodyText,
29 int numPagesInMRI, int numPagesContainingMRI,
30 long siteCrawledTimestamp, boolean siteCrawlUnfinished, boolean redoCrawl,
31 String geoLocationCountryCode, boolean urlContainsLangCodeInPath)
32 {
33 //this.id = siteCount;
34 this.siteFolderName = siteFolderName;
35 this.domain = domainOfSite;
36 this.basicDomain = baseSiteDomain;
37
38 this.totalPages = totalPages;
39 this.countOfWebPagesWithBodyText = countOfWebPagesWithBodyText;
40
41 this.numPagesInMRI = numPagesInMRI;
42 this.numPagesContainingMRI = numPagesContainingMRI;
43
44 this.siteCrawledTimestamp = siteCrawledTimestamp;
45 this.siteCrawlUnfinished = siteCrawlUnfinished;
46 this.redoCrawl = redoCrawl;
47
48 this.geoLocationCountryCode = geoLocationCountryCode;
49 this.urlContainsLangCodeInPath = urlContainsLangCodeInPath;
50 }
51}
Note: See TracBrowser for help on using the repository browser.