Ignore:
Timestamp:
2020-02-05T23:36:37+13:00 (4 years ago)
Author:
ak19
Message:

Code is intermediate state. 1. Introduced basicDomain field to MongoDB and recreated the MongoDB tables/collections, this will help discount duplicated domains under http and https, with and without www. Though webpage URLs may potentially still be unique and not duplicated across all 4 possible variants, I want them counted under the same base domain name. 2. Another issue noticed now is that some of the sites appear to be hosted on multiple countries servers, and so slightly different country code counts and domainlistings are returned. 3. So added code modifications (untested) to sort the domains alphabetically after stripping protocol and www to allow comparing the old domainListing results of MongoDB's now renamed oldWebsites and oldWebpages collections to the new versions of these collections and to then update the differences in manual counts.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • other-projects/maori-lang-detection/src/org/greenstone/atea/WebPageURLsListing.java

    r33887 r33906  
    5757   
    5858    public void produceURLsForPagesInMRI(File domainsFile) {
    59     ArrayList<Tuple> urlsList = getURLsForWebPages(MongoDBAccess.IS_MRI, domainsFile);
     59    ArrayList<Tuple> urlsList = getURLsForAllWebPagesInSiteListing(MongoDBAccess.IS_MRI, domainsFile);
    6060    File outFile = new File(outFolder, "isMRI_"+domainsFile.getName());
    6161    writeURLsToFile(urlsList, outFile, urlsList.size());
     
    6666   
    6767    public void produceURLsForPagesContainingMRI(File domainsFile) {
    68     ArrayList<Tuple> urlsList = getURLsForWebPages(MongoDBAccess.CONTAINS_MRI, domainsFile);   
     68    ArrayList<Tuple> urlsList = getURLsForAllWebPagesInSiteListing(MongoDBAccess.CONTAINS_MRI, domainsFile);   
    6969    File outFile = new File(outFolder, "containsMRI_"+domainsFile.getName());
    7070    writeURLsToFile(urlsList, outFile, urlsList.size());
     
    7474    }
    7575   
    76     private ArrayList<Tuple> getURLsForWebPages(int filterType, File domainsFile) {
     76    private ArrayList<Tuple> getURLsForAllWebPagesInSiteListing(int filterType, File domainsFile) {
    7777    ArrayList<Tuple> urlsList = new ArrayList<Tuple>();
    7878   
     
    120120    }
    121121   
    122     /** Given a hand curated list of NZ sites with positive numPagesContainingMRI,
    123      * get a listing of all their web pages IN_MRI (or CONTAINS_MRI?).
    124      * Total all these pages in MRI (N), then work out the correct sample size (n)
     122    /** Given a hand curated list of all sites with positive numPagesContainingMRI
     123     * determined by manual inspection, get a listing of all their web pages that
     124     * are IN_MRI (or CONTAINS_MRI?).
     125     * Total all these pages that are inMRI (N), then work out the correct sample size (n)
    125126     * at 90% confidence with 5% margin of error. Then generate a random listing
    126127     * of n of these pages in MRI of these trusted sites and output to a file
    127      * for manual inspection. */
     128     * for manual inspection of the sample webpage URLs at page-level. */
    128129    /* OLD: Given a hand curated list of non-NZ sites that CONTAINS_MRI, get a listing
    129130     * of all their web pages IN_MRI (or CONTAINS_MRI).
     
    138139
    139140    // 0. get a list of all the web pages in the given domain listing where isMRI = true
    140     ArrayList<Tuple> urlsList = getURLsForWebPages(MongoDBAccess.IS_MRI, domainsFile);
     141    ArrayList<Tuple> urlsList = getURLsForAllWebPagesInSiteListing(MongoDBAccess.IS_MRI, domainsFile);
    141142        // produceURLsForPagesInMRI(domainsFile);
    142143   
Note: See TracChangeset for help on using the changeset viewer.