Ignore:
Timestamp:
2020-02-05T23:36:37+13:00 (4 years ago)
Author:
ak19
Message:

Code is intermediate state. 1. Introduced basicDomain field to MongoDB and recreated the MongoDB tables/collections, this will help discount duplicated domains under http and https, with and without www. Though webpage URLs may potentially still be unique and not duplicated across all 4 possible variants, I want them counted under the same base domain name. 2. Another issue noticed now is that some of the sites appear to be hosted on multiple countries servers, and so slightly different country code counts and domainlistings are returned. 3. So added code modifications (untested) to sort the domains alphabetically after stripping protocol and www to allow comparing the old domainListing results of MongoDB's now renamed oldWebsites and oldWebpages collections to the new versions of these collections and to then update the differences in manual counts.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • other-projects/maori-lang-detection/src/org/greenstone/atea/morphia/WebsiteInfo.java

    r33811 r33906  
    99    public final String siteFolderName;
    1010    public final String domain;
     11    public final String basicDomain; // domain without protocol and www. prefix
    1112   
    1213    public final int totalPages;
     
    2324    public final boolean urlContainsLangCodeInPath;
    2425   
    25     public WebsiteInfo(/*int siteCount,*/ String siteFolderName, String domainOfSite,
     26    public WebsiteInfo(/*int siteCount,*/ String siteFolderName,
     27               String domainOfSite, String baseSiteDomain,
    2628               int totalPages, int countOfWebPagesWithBodyText,
    2729               int numPagesInMRI, int numPagesContainingMRI,
     
    3234    this.siteFolderName = siteFolderName;
    3335    this.domain = domainOfSite;
     36    this.basicDomain = baseSiteDomain;
    3437   
    3538    this.totalPages = totalPages;
Note: See TracChangeset for help on using the changeset viewer.