Ignore:
Timestamp:
2020-02-12T19:02:44+13:00 (4 years ago)
Author:
ak19
Message:
  1. Implementing tables 3 to 5. 2. Rolled back the introduction of the basicDomain field (domain stripped of http/https and www prefixes) as the code can create and sort this field alphabetically, whereas it didn't sort properly in mongodb. 3. The code now does sort the domains stripped of protocol and www for the mongodb queries producing domain results and ensures the domain list is unique. 4. Split the MongoDBAccess class into 2, with the connection code in MongoDBAccess.java and the querying code in MongoDBQueryer (a subclass of MongoDBAccess) that is so far exclusively used by WebPageURLsListing.java
File:
1 edited

Legend:

Unmodified
Added
Removed
  • other-projects/maori-lang-detection/src/org/greenstone/atea/NutchTextDumpToMongoDB.java

    r33906 r33909  
    7777
    7878    private String domainOfSite;
    79     private String baseSiteDomain; // domainOfSite stripped of any http(s)://www.
     79    //private String baseSiteDomain; // domainOfSite stripped of any http(s)://www.
    8080    private int numPagesInMRI = 0;
    8181    private int numPagesContainingMRI = 0;
     
    203203        String url = firstPage.getPageURL();
    204204        this.domainOfSite = Utility.getDomainForURL(url, true);
    205         this.baseSiteDomain = Utility.stripProtocolAndWWWFromURL(this.domainOfSite);
     205        //this.baseSiteDomain = Utility.stripProtocolAndWWWFromURL(this.domainOfSite);
    206206    }
    207207    else {
    208208        this.domainOfSite = "UNKNOWN";
    209         this.baseSiteDomain = "UNKNOWN";
     209        //this.baseSiteDomain = "UNKNOWN";
    210210    }
    211211   
     
    343343
    344344    WebsiteInfo website = new WebsiteInfo(/*SITE_COUNTER,*/ this.siteID,
    345           this.domainOfSite, this.baseSiteDomain,
     345          this.domainOfSite, //this.baseSiteDomain,
    346346          totalPages, this.countOfWebPagesWithBodyText,
    347347          this.numPagesInMRI, this.numPagesContainingMRI,
Note: See TracChangeset for help on using the changeset viewer.