Ignore:
Timestamp:
2020-02-05T23:36:37+13:00 (4 years ago)
Author:
ak19
Message:

Code is intermediate state. 1. Introduced basicDomain field to MongoDB and recreated the MongoDB tables/collections, this will help discount duplicated domains under http and https, with and without www. Though webpage URLs may potentially still be unique and not duplicated across all 4 possible variants, I want them counted under the same base domain name. 2. Another issue noticed now is that some of the sites appear to be hosted on multiple countries servers, and so slightly different country code counts and domainlistings are returned. 3. So added code modifications (untested) to sort the domains alphabetically after stripping protocol and www to allow comparing the old domainListing results of MongoDB's now renamed oldWebsites and oldWebpages collections to the new versions of these collections and to then update the differences in manual counts.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • other-projects/maori-lang-detection/src/org/greenstone/atea/NutchTextDumpToMongoDB.java

    r33811 r33906  
    7575    private String geoLocationCountryCode = null; /** 2 letter country code */
    7676    private boolean urlContainsLangCodeInPath = false; /** If any URL on this site contains a /mi(/) or http(s)://mi.* in its URL path */
    77    
     77
    7878    private String domainOfSite;
     79    private String baseSiteDomain; // domainOfSite stripped of any http(s)://www.
    7980    private int numPagesInMRI = 0;
    8081    private int numPagesContainingMRI = 0;
     
    202203        String url = firstPage.getPageURL();
    203204        this.domainOfSite = Utility.getDomainForURL(url, true);
     205        this.baseSiteDomain = Utility.stripProtocolAndWWWFromURL(this.domainOfSite);
    204206    }
    205207    else {
    206208        this.domainOfSite = "UNKNOWN";
     209        this.baseSiteDomain = "UNKNOWN";
    207210    }
    208211   
     
    339342    int totalPages = pages.size(); 
    340343
    341     WebsiteInfo website = new WebsiteInfo(/*SITE_COUNTER,*/ this.siteID, this.domainOfSite,
     344    WebsiteInfo website = new WebsiteInfo(/*SITE_COUNTER,*/ this.siteID,
     345          this.domainOfSite, this.baseSiteDomain,
    342346          totalPages, this.countOfWebPagesWithBodyText,
    343347          this.numPagesInMRI, this.numPagesContainingMRI,
Note: See TracChangeset for help on using the changeset viewer.