Context Navigation

← Previous Change
Next Change →

Changeset 33909 for other-projects

Timestamp:

2020-02-12T19:02:44+13:00 (4 years ago)

Author:

ak19

Message:

Implementing tables 3 to 5. 2. Rolled back the introduction of the basicDomain field (domain stripped of http/https and www prefixes) as the code can create and sort this field alphabetically, whereas it didn't sort properly in mongodb. 3. The code now does sort the domains stripped of protocol and www for the mongodb queries producing domain results and ensures the domain list is unique. 4. Split the MongoDBAccess class into 2, with the connection code in MongoDBAccess.java and the querying code in MongoDBQueryer (a subclass of MongoDBAccess) that is so far exclusively used by WebPageURLsListing.java

Location:

other-projects/maori-lang-detection/src/org/greenstone/atea

Files:

: 4 edited

MongoDBAccess.java (modified) (6 diffs)
NutchTextDumpToMongoDB.java (modified) (3 diffs)
WebPageURLsListing.java (modified) (14 diffs)
morphia/WebsiteInfo.java (modified) (3 diffs)

Legend:

: Unmodified
: Added
: Removed

other-projects/maori-lang-detection/src/org/greenstone/atea/MongoDBAccess.java

-              r33906
+              r33909
 import org.bson.BsonArray;
 import org.bson.BsonString;
+import org.bson.BsonValue;
 import org.bson.Document;
 import org.bson.conversions.Bson;
 …
 import java.util.List;
 import java.util.Properties;
+import java.util.TreeSet;
 import java.util.regex.Pattern;
 …
     public static final String WEBSITES_COLLECTION = "Websites";
+    public static final String NEWLINE = System.getProperty("line.separator");
+    /** mongodb filter types to execute */
+    public static final int IS_MRI = 0;
+    public static final int CONTAINS_MRI = 1;
+    /** Some reused fieldnames in the Websites collection */
+    private static final String FILTER_NUMPAGES_IN_MRI = "numPagesInMRI";
+    private static final String FILTER_NUMPAGES_CONTAINING_MRI = "numPagesContainingMRI";
     // configuration details, some with fallback values
     private String HOST = "localhost";
     private int PORT = 27017; // mongodb port
     private String USERNAME;
     private String PASSWORD;
     private String DB_NAME ="ateacrawldata";
     private MongoClient mongo = null;
     private MongoDatabase database = null;
+    protected String HOST = "localhost";
+    protected int PORT = 27017; // mongodb port
+    protected String USERNAME;
+    protected String PASSWORD;
+    protected String DB_NAME ="ateacrawldata";
+    protected MongoClient mongo = null;
+    protected MongoDatabase database = null;
     /**
 …
         System.err.println("coll: " + coll);
+    }
+    }
+    protected MongoCollection<Document> getWebpagesCollection() {
+    return this.database.getCollection(WEBPAGES_COLLECTION);
+    }
+    protected MongoCollection<Document> getWebsitesCollection() {
+    return this.database.getCollection(WEBSITES_COLLECTION);
+    }
 …
         .append("siteFolderName", website.siteFolderName)
         .append("domain", website.domain)
-        .append("basicDomain", website.basicDomain)
         .append("totalPages", website.totalPages)
         .append("numPagesWithBodyText", website.countOfWebPagesWithBodyText)
 …
+    }
     */
-    public ArrayList<String> queryAllMatchingIsMRIURLs(String domain) {
-    return queryAllMatchingURLsFilteredBy(domain, IS_MRI);
+    }
-    public ArrayList<String> queryAllMatchingcontainsMRIURLs(String domain) {
-    return queryAllMatchingURLsFilteredBy(domain, CONTAINS_MRI);
+    }
-    /**
-     * Java mongodb find: https://mongodb.github.io/mongo-java-driver/3.4/driver/getting-started/quick-start/
-     * Java mongodb find filters: https://mongodb.github.io/mongo-java-driver/3.4/javadoc/?com/mongodb/client/model/Filters.html
-     * Java mongodb projection: https://stackoverflow.com/questions/44894497/retrieving-data-with-mongodb-java-driver-3-4-using-find-method-with-projection
-     * mongodb projection: https://docs.mongodb.com/v3.2/reference/method/db.collection.find/#db.collection.find
+     *
-     * Parse MongoDB query into Java: https://stackoverflow.com/questions/17326747/parsing-strings-to-mongodb-query-documents-with-operators-in-java
-     * Maybe also https://stackoverflow.com/questions/48000891/parse-mongodb-json-query-in-java-with-multiple-criteria
-     * https://stackoverflow.com/questions/55029222/parse-mongodb-query-to-java
-     * http://pingax.com/trick-convert-mongo-shell-query-equivalent-java-objects/
-     */
-    public ArrayList<String> queryAllMatchingURLsFilteredBy(String domain, int filterType) {
-    final ArrayList<String> urlsList = new ArrayList<String>();
-    // remove any http(s)://(www.) from the start of URL first
-    // since it goes into a regex
-    domain = Utility.stripProtocolAndWWWFromURL(domain);
-    // load the "webpages" db table
-    // in mongodb, the equivalent of db tables are called 'collections'
-    MongoCollection<Document> collection = this.database.getCollection(WEBPAGES_COLLECTION);
-    // code we'll execute in Iterable.forEach() below
-    // see also https://www.baeldung.com/foreach-java
-    Block<Document> storeURL = new Block<Document>() {
-        @Override
-        public void apply(final Document document) {
-            //System.out.println(document.toJson());
-            String url = document.getString("URL");
-            // add to our urlsList
-            //System.out.println(url);
-            urlsList.add(url);
+        }
-        };
-    // Run the following mongodb query:
-    //    db.getCollection('Webpages').find({URL: /domain/, isMRI: true}, {URL: 1, _id: 0})
-    // 1. One way that works:
-    //collection.find(and(eq("isMRI", true), regex("URL", pattern))).projection(fields(include("URL"), excludeId())).forEach(storeURL);
-    // 2. Another way:
-    //String query = "{URL: /DOMAIN/, isMRI: true}";
-    String query = "{URL: /DOMAIN/, ";
-    if(filterType == IS_MRI) {
-        query += "isMRI: true}";
-    } else if(filterType == CONTAINS_MRI) {
-        query += "containsMRI: true}";
+    }
-    domain = domain.replace(".", "\\."); // escape dots in domain for regex
-    query = query.replace("DOMAIN", domain);
-    //System.err.println("Executing find query: " + query);
-    BasicDBObject findObj = BasicDBObject.parse(query);
-    BasicDBObject projectionObj = BasicDBObject.parse("{URL: 1, _id: 0}");
-    collection.find(findObj).projection(projectionObj).forEach(storeURL);
-    return urlsList;
+    }
-    /**
-     * RUNNING A MONGODB COLLECTION.AGGREGATE() in JAVA:
+     *
-     * https://stackoverflow.com/questions/31643109/mongodb-aggregation-with-java-driver
-     * https://stackoverflow.com/questions/48000891/parse-mongodb-json-query-in-java-with-multiple-criteria
-     * Not Java: https://stackoverflow.com/questions/39060221/a-pipeline-stage-specification-object-must-contain-exactly-one-field-with-php-mo
+     *
-     * (https://stackoverflow.com/questions/55029222/parse-mongodb-query-to-java)
-     * https://www.programcreek.com/java-api-examples/?api=com.mongodb.client.model.Aggregates
-     * On using group(TExpression) inside collection.aggregate().
+     *
-     *  For forEach lamba expressions, see also https://www.baeldung.com/foreach-java
-     *  and https://www.javatpoint.com/java-8-foreach
-     *  and https://stackoverflow.com/questions/47979978/ambiguous-reference-to-foreach-when-listing-mongodbs-database-in-java
+     *
-     * Count by country code of non-NZ websites containing a positive number of sentences in MRI,
-     * listing all the base domain strings (no protocol or www) in ALPHABETICAL ORDER
-     * and total counts of numPagesInMRI and numPagesContainingMRI across all these
-     * matching sites.
+     *
-     * The mongodb aggregate() we want to run this time:
+     *
-       db.Websites.aggregate([
+       {
-        $match: {
-            $and: [
-                {numPagesContainingMRI: {$gt: 0}},
-                {$or: [{geoLocationCountryCode:"NZ"},{domain: /\.nz/}]}
+            ]
+          }
-    },
-    { $unwind: "$geoLocationCountryCode" },
+    {
-          $group: {
-            _id: "nz",
-            count: { $sum: 1 },
-        domain: { $addToSet: '$basicDomain' } // domain: {$push: "$basicDomain" }
+          }
-    },
-    { $sort : { count : -1} }
-    ]);
-    */
-    public void aggregateContainsMRIForNZ(Writer writer, int filterType) throws IOException {
-    // working with the WebSites collection, not WebPages collection!
-    MongoCollection<Document> collection = this.database.getCollection(WEBSITES_COLLECTION);
-    String mriFilterString = (filterType == CONTAINS_MRI) ? "{numPagesContainingMRI: {$gt: 0}}" : "{numPagesInMRI: {$gt: 0}}";
-    Bson orQuery = or(
-              BasicDBObject.parse("{geoLocationCountryCode: \"NZ\"}"),
-              BasicDBObject.parse("{domain: /\\.nz/}")
-              );
-    Bson andQuery = and(
-        BasicDBObject.parse(mriFilterString),
-        orQuery);
-    // Hopefully the lambda expression (forEach()) at end means
-    // we write out each result Document as we get it
-    collection.aggregate(Arrays.asList(
-         match(andQuery),
-         unwind("$geoLocationCountryCode"),
-         group("NZ", Arrays.asList(sum("count", 1),
-                   addToSet("domain", "$basicDomain"))),
-         sort(BasicDBObject.parse("{count : -1}"))
-     )).forEach((Block<Document>)doc -> writeDoc(doc, writer));
-    // should only have one doc for NZ since it's a count by geolocation.
-    return;
+    }
-    /**
-     * Count of NZ (incl .nz TLD)  websites containing a positive number of sentences in MRI,
-     * listing all the base domain strings (no protocol or www) in ALPHABETICAL ORDER
-     * and total counts of numPagesInMRI and numPagesContainingMRI across all these
-     * matching sites.
+     *
-     * The aggregate() we want to run this time:
+     *
-       db.Websites.aggregate([
+       {
-         $match: {
-            $and: [
-                {geoLocationCountryCode: {$ne: "NZ"}},
-                {domain: {$not: /\.nz/}},
-                {numPagesContainingMRI: {$gt: 0}},
-                {$or: [{geoLocationCountryCode: "AU"}, {urlContainsLangCodeInPath: false}]}
+            ]
+      }
-    },
-    { $unwind: "$geoLocationCountryCode" },
+    {
-          $group: {
-            _id: {$toLower: '$geoLocationCountryCode'},
-            count: { $sum: 1 },
-        domain: { $addToSet: '$basicDomain' } // domain: {$push: "$basicDomain" }
+          }
-     },
-     { $sort : { count : -1} }
-    ]);
-    */
-    public void aggregateContainsMRIForOverseas(Writer writer, int filterType,
-                        boolean isMiInURLPath) throws UncheckedIOException
+    {
-    // working with the WebSites collection, not WebPages collection!
-    MongoCollection<Document> collection = this.database.getCollection(WEBSITES_COLLECTION);
-    String mriFilterString = (filterType == CONTAINS_MRI) ? "{numPagesContainingMRI: {$gt: 0}}" : "{numPagesInMRI: {$gt: 0}}";
-    Bson orQuery = or(
-              BasicDBObject.parse("{geoLocationCountryCode: \"AU\"}"),
-              BasicDBObject.parse("{urlContainsLangCodeInPath: "+ isMiInURLPath +"}")
-              // e.g. "{urlContainsLangCodeInPath: false}"
-              );
-    Bson andQuery = and(
-        BasicDBObject.parse("{geoLocationCountryCode: {$ne: \"NZ\"}}"),
-        BasicDBObject.parse("{domain: {$not: /\\.nz/}}"),
-        BasicDBObject.parse(mriFilterString),
-        orQuery);
-    collection.aggregate(Arrays.asList(
-         match(andQuery),  //match(BasicDBObject.parse(matchQuery))
-         // match((List<DBObject>)JSON.parse(matchQuery)),
-         unwind("$geoLocationCountryCode"),
-         group("$geoLocationCountryCode", Arrays.asList(sum("count", 1),
-                        addToSet("domain", "$basicDomain"))),
-         sort(BasicDBObject.parse("{count : -1}"))
-       )).forEach((Block<Document>)doc -> writeDoc(doc, writer));
-    // casting to Block<Document> necessary because otherwise we see the error at
-    // https://stackoverflow.com/questions/47979978/ambiguous-reference-to-foreach-when-listing-mongodbs-database-in-java
-    // Less efficient way is to keep all the results in memory and then
-    // write them out one at a time
-    /*
-    AggregateIterable<Document> output
-        = collection.aggregate(Arrays.asList(
-         match(andQuery),  //match(BasicDBObject.parse(matchQuery))
-         // match((List<DBObject>)JSON.parse(matchQuery)),
-         unwind("$geoLocationCountryCode"),
-         group("$geoLocationCountryCode", Arrays.asList(sum("count", 1), addToSet("domain", "$domain"))),
-         sort(BasicDBObject.parse("{count : -1}"))
-     ));
-    for (Document doc : output) {
-        //System.out.println(doc);
-        System.out.println(doc.toJson());
+    }
-    */
-    return;
+    }
-    /** Do the aggregates for writing out tables.
-       Table1:
-    */
-    public void writeTables(File outFolder) {
-    // In this function, we're always dealing with the Websites mongodb collection.
-    MongoCollection<Document> collection = this.database.getCollection(WEBSITES_COLLECTION);
-    String[] tableNames = { "", "1table_allCrawledSites", "2table_sitesWithPagesInMRI"};
-    for (int tableNum = 1; tableNum < tableNames.length; tableNum++) {
-        File outFile = new File(outFolder, tableNames[tableNum] + ".json");
-        File csvFile = new File(outFolder, tableNames[tableNum] + ".csv");
-        try (
-         Writer writer = new BufferedWriter(new FileWriter(outFile));
-         CSVPrinter csvWriter = new CSVPrinter(new FileWriter(csvFile), CSVFormat.DEFAULT);
-         ) {
-        // Write out the CSV column headings
-        // https://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/CSVPrinter.html
-        csvWriter.printRecord("countryCode", "siteCount",
-              "numPagesInMRI count","numPagesContainingMRICount"/*, "domain"*/);
-        AggregateIterable<Document> output = getTable(collection, tableNum); //doTable1().forEach((Block<Document>)doc -> writeDoc(doc, writer));
-        int docNum = 0;
-        for (Document doc : output) {
-            //System.out.println(doc);
-            writeDocAsJsonRecord(++docNum, doc, writer);
-            writeDocAsCSVRecord(++docNum, doc, csvWriter);
+        }
-        logger.info("@@@ Wrote out table into file: " + Utility.getFilePath(outFile) + " and .csv");
-        } catch(UncheckedIOException ioe) {
-        logger.error("Caught UncheckedIOException: " + ioe.getMessage(), ioe);
+        }
-        catch(Exception e) {
-        logger.error("Could not write table to file " + outFile + " or .csv equivalent" , e);
+        }
+    }
+    }
-    public AggregateIterable<Document> getTable(MongoCollection<Document> collection, int tableNum) {
-    AggregateIterable<Document> output = null;
-    switch(tableNum) {
-    case 1:
-        /* 1table_allCrawledSites -
-           db.Websites.aggregate([
-           { $unwind: "$geoLocationCountryCode" },
+           {
-           $group: {
-           _id: "$geoLocationCountryCode",
-           count: { $sum: 1 },
-           //domain: { $addToSet: '$domain' },
-           numPagesInMRICount: { $sum: '$numPagesInMRI' },
-           numPagesContainingMRICount: { $sum: '$numPagesContainingMRI' }
+           }
-           },
-           { $sort : { count : -1} }
-           ]);
-         */
-        output = collection.aggregate(Arrays.asList(
-                           //match(BasicDBObject.parse("{urlContainsLangCodeInPath:true}")),
-         unwind("$geoLocationCountryCode"),
-         group("$geoLocationCountryCode", Arrays.asList(
-                                sum("count", 1),
-                                /*addToSet("domain", "$domain"),*/
-                                sum("numPagesInMRICount", "$numPagesInMRI"),
-                                sum("numPagesContainingMRICount", "$numPagesContainingMRI"))),
-         sort(BasicDBObject.parse("{count : -1}"))
-        ));
-        break;
-    case 2:
-        /*
-          db.Websites.aggregate([
-          { $match: { numPagesInMRI: {$gt: 0} } },
-          { $unwind: "$geoLocationCountryCode" },
+          {
-          $group: {
-          _id: {$toLower: '$geoLocationCountryCode'}, // ignore toLower
-          count: { $sum: 1 },
-          //domain: { $addToSet: '$domain' },
-          numPagesInMRICount: { $sum: '$numPagesInMRI' },
-          numPagesContainingMRICount: { $sum: '$numPagesContainingMRI' }
+          }
-          },
-          { $sort : { count : -1} }
-          ]);
-         */
-         output = collection.aggregate(Arrays.asList(
-                           match(BasicDBObject.parse("{ numPagesInMRI: {$gt: 0} }")),
-         unwind("$geoLocationCountryCode"),
-         group("$geoLocationCountryCode", Arrays.asList(
-                                sum("count", 1),
-                                /*addToSet("domain", "$domain"),*/
-                                sum("numPagesInMRICount", "$numPagesInMRI"),
-                                sum("numPagesContainingMRICount", "$numPagesContainingMRI"))),
-         sort(BasicDBObject.parse("{count : -1}"))
-        ));
-        break;
-    default: logger.error("Unknown table number: " + tableNum);
+    }
-     return output;
+    }
-    /**
-     * called by lambda forEach() call on Document objects to write them out to a file.
-     * Have to deal with unreported exceptions here that can't be dealt with when doing
-     * the actual forEach(). See
-     * https://stackoverflow.com/questions/39090292/how-to-cleanly-deal-with-unreported-exception-ioexception-in-stream-foreach
-     */
-    public void writeDoc(Document doc, Writer writer) throws UncheckedIOException {
-    // If there's a domain field in the json Doc, sort this domain listing alphabetically
-    Object domainList = doc.remove("domain");
-    if(domainList != null) {
-        doc.put("domain", sortAlphabetically(domainList));
+    }
-    //OLD WAY: writer.write(doc.toJson(new JsonWriterSettings(JsonMode.STRICT, true)) + NEWLINE);
-    // Can't control json output to add newlines after each array element,
-    // no matter which JsonMode is used.
-    // https://mongodb.github.io/mongo-java-driver/3.9/javadoc/index.html?org/bson/json/JsonWriterSettings.html
-    // Still can't control array element output,
-    // but this way uses newer mongo java driver 3.9(.1). Tried its various JsonModes too:
-    //JsonWriterSettings writeSettings = new JsonWriterSettings();
-    //writeSettings.builder().outputMode(JsonMode.SHELL).indent(true).build();
-    //writer.write(doc.toJson(writeSettings) + NEWLINE);
-    // Not the JsonWriter of mongodb java driver:
-    // https://stackoverflow.com/questions/54746814/jsonwriter-add-a-new-line
-    // Have to use gson's pretty print to produce a json string that contains
-    // newlines after every array element in the json:
-    String jsonStr = prettyPrintJson(doc.toJson());
-    //System.err.println(jsonStr);
-    try {
-        writer.write(jsonStr + NEWLINE);
-    } catch (IOException ex) {
-        //throw ex;
-        throw new UncheckedIOException(ex);
+    }
+    }
-    private List sortAlphabetically(Object list) {
-    BsonArray domainList = (BsonArray)list;
-    //for(String domain : domainList) {
-    for(int i = domainList.size() - 1; i >= 0; i--) {
-        BsonString domain = domainList.get(i).asString();
-        String domainStr = Utility.stripProtocolAndWWWFromURL(domain.toString());
-        domainList.set(i, new BsonString(domainStr));
+    }
-    return domainList;
+    }
-    public void writeDocAsJsonRecord(int docNum, Document doc, Writer writer) throws UncheckedIOException {
-    String jsonStr = prettyPrintJson(doc.toJson());
-    //System.err.println(jsonStr);
-    try {
-        writer.write("/* " + docNum + " */\n" + jsonStr + NEWLINE);
-    } catch (IOException ex) {
-        //throw ex;
-        throw new UncheckedIOException(ex);
+    }
+    }
-    // TODO
-    //public void writeDocToJsonAndCSV(int docNum, Document doc, Writer writer, CSVPrinter csvWriter) throws UncheckedIOException {
-    public void writeDocAsCSVRecord(int docNum, Document doc, CSVPrinter csvWriter) throws UncheckedIOException {
-    String jsonStr = doc.toJson();
-    JsonParser parser = new JsonParser();
-    JsonElement json = parser.parse(jsonStr);
-    JsonObject jsonObj = (JsonObject)json;
-    String countryCode = jsonObj.get("_id").getAsString();
-    int siteCount = jsonObj.get("count").getAsInt();
-    int numPagesInMRICount = jsonObj.get("numPagesInMRICount").getAsInt();
-    int numPagesContainingMRICount = jsonObj.get("numPagesContainingMRICount").getAsInt();
-    //System.err.println(jsonStr);
-    try {
-        //writer.write("/* " + docNum + " */\n" + prettyPrintJson(jsonStr) + NEWLINE);
-        csvWriter.printRecord(countryCode, siteCount, numPagesInMRICount, numPagesContainingMRICount);
-    } catch (IOException ex) {
-        //throw ex;
-        throw new UncheckedIOException(ex);
+    }
+    }
-    public String prettyPrintJson(String jsonStr) {
-    Gson gson = new GsonBuilder().setPrettyPrinting().create();
-    JsonParser jp = new JsonParser();
-    JsonElement je = jp.parse(jsonStr);
-    String prettyJsonString = gson.toJson(je);
-    return prettyJsonString;
+    }

other-projects/maori-lang-detection/src/org/greenstone/atea/NutchTextDumpToMongoDB.java

-              r33906
+              r33909
     private String domainOfSite;
     private String baseSiteDomain; // domainOfSite stripped of any http(s)://www.
+    //private String baseSiteDomain; // domainOfSite stripped of any http(s)://www.
     private int numPagesInMRI = 0;
     private int numPagesContainingMRI = 0;
 …
         String url = firstPage.getPageURL();
         this.domainOfSite = Utility.getDomainForURL(url, true);
         this.baseSiteDomain = Utility.stripProtocolAndWWWFromURL(this.domainOfSite);
+        //this.baseSiteDomain = Utility.stripProtocolAndWWWFromURL(this.domainOfSite);
+    }
     else {
         this.domainOfSite = "UNKNOWN";
         this.baseSiteDomain = "UNKNOWN";
+        //this.baseSiteDomain = "UNKNOWN";
+    }
 …
     WebsiteInfo website = new WebsiteInfo(/*SITE_COUNTER,*/ this.siteID,
           this.domainOfSite, this.baseSiteDomain,
+          this.domainOfSite, //this.baseSiteDomain,
           totalPages, this.countOfWebPagesWithBodyText,
           this.numPagesInMRI, this.numPagesContainingMRI,

other-projects/maori-lang-detection/src/org/greenstone/atea/WebPageURLsListing.java

-              r33906
+              r33909
     static private final long FIXED_SEED = 1000;
     private final MongoDBAccess mongodbAccess;
+    private final MongoDBQueryer mongodbQueryer;
     private File outFolder;
 …
     public WebPageURLsListing(MongoDBAccess mongodbAccess, File outFolder)
+    public WebPageURLsListing(MongoDBQueryer mongodbQueryer, File outFolder)
+    {
     this.mongodbAccess = mongodbAccess;
+    this.mongodbQueryer = mongodbQueryer;
     this.outFolder = outFolder;
+    }
 …
     public void produceURLsForPagesInMRI(File domainsFile) {
     ArrayList<Tuple> urlsList = getURLsForAllWebPagesInSiteListing(MongoDBAccess.IS_MRI, domainsFile);
+    ArrayList<Tuple> urlsList = getURLsForAllWebPagesInSiteListing(MongoDBQueryer.IS_MRI, domainsFile);
     File outFile = new File(outFolder, "isMRI_"+domainsFile.getName());
     writeURLsToFile(urlsList, outFile, urlsList.size());
 …
     public void produceURLsForPagesContainingMRI(File domainsFile) {
     ArrayList<Tuple> urlsList = getURLsForAllWebPagesInSiteListing(MongoDBAccess.CONTAINS_MRI, domainsFile);
+    ArrayList<Tuple> urlsList = getURLsForAllWebPagesInSiteListing(MongoDBQueryer.CONTAINS_MRI, domainsFile);
     File outFile = new File(outFolder, "containsMRI_"+domainsFile.getName());
     writeURLsToFile(urlsList, outFile, urlsList.size());
 …
             domain = domain.substring(0, index);
+            }
             ArrayList<String> moreURLs = mongodbAccess.queryAllMatchingURLsFilteredBy(domain, filterType);
+            ArrayList<String> moreURLs = mongodbQueryer.queryAllMatchingURLsFilteredBy(domain, filterType);
             // Print out whether there were no isMRI pages for the domain (only containsMRI). A useful thing to know
             if(moreURLs.size() == 0 && filterType == MongoDBAccess.IS_MRI) {
+            if(moreURLs.size() == 0 && filterType == MongoDBQueryer.IS_MRI) {
             System.out.println("   " + countryCode + " domain " + domain + " had no isMRI webpages - only containsMRI.");
+            }
 …
     public void mriWebPageListingForDomainListing(File domainsFile) {
     int filterType = MongoDBAccess.IS_MRI;
+    int filterType = MongoDBQueryer.IS_MRI;
     // for overseas websites,
 …
     // 0. get a list of all the web pages in the given domain listing where isMRI = true
     ArrayList<Tuple> urlsList = getURLsForAllWebPagesInSiteListing(MongoDBAccess.IS_MRI, domainsFile);
+    ArrayList<Tuple> urlsList = getURLsForAllWebPagesInSiteListing(MongoDBQueryer.IS_MRI, domainsFile);
         // produceURLsForPagesInMRI(domainsFile);
 …
     // 2. write all the URLs in urlsList to a file
     //File outFolder = domainsFile.getParentFile();
     String fileName = (filterType == MongoDBAccess.IS_MRI) ? "isMRI_" : "containsMRI_";
+    String fileName = (filterType == MongoDBQueryer.IS_MRI) ? "isMRI_" : "containsMRI_";
     File outFile = new File(outFolder, fileName+domainsFile.getName());
 …
     /* ---------------------------------------- */
     /**
+     * Create the file 5counts_tentativeNonAutotranslatedSites.json
+     * Create the file 5counts_containsMRISites_allNZGrouped.json
+     * that contains the count and domains for NZ sites (NZ origin or nz TLD) with pages
+     * that CONTAIN_MRI, followed by counts and domains listing for overseas sites
+     * that CONTAIN_MRI.
+     * @return full path of file generated
+     */
+    public String writeContainsMRISites_nzSitesAndTLDsGrouped() {
+    File outFile = new File(outFolder, "5counts_containsMRISites_allNZGrouped.json");
+    String filename = Utility.getFilePath(outFile);
+    try (
+         Writer writer = new BufferedWriter(new FileWriter(outFile));
+         ) {
+        // first write out NZ sites and .nz TLD count and domains
+        mongodbQueryer.aggregateContainsMRIForNZ(writer, MongoDBQueryer.CONTAINS_MRI);
+        // next write out all overseas sites (not NZ origin or .nz TLD)
+        // that have no "mi" in the URL path as mi.* or */mi
+        boolean isMiInURLPath = false;
+        mongodbQueryer.aggregateContainsMRIForOverseas(writer, MongoDBQueryer.CONTAINS_MRI);
+    } catch(Exception e) {
+        logger.error("Unable to write to file " + filename);
+        logger.error(e.getMessage(), e);
+    }
+    System.err.println("*** Wrote file: " + filename);
+    return filename;
+    }
+    /**
+     * Create the file 5a_counts_tentativeNonAutotranslatedSites.json
      * that contains the count and domains for NZ sites (NZ origin or nz TLD) that CONTAIN_MRI
      * followed by counts and domain listing for overseas sites that are either from Australia
 …
          ) {
         // first write out NZ sites and .nz TLD count and domains
         mongodbAccess.aggregateContainsMRIForNZ(writer, MongoDBAccess.CONTAINS_MRI);
+        mongodbQueryer.aggregateContainsMRIForNZ(writer, MongoDBQueryer.CONTAINS_MRI);
         // next write out all overseas sites (not NZ origin or .nz TLD)
         // that have no "mi" in the URL path as mi.* or */mi
         boolean isMiInURLPath = false;
         mongodbAccess.aggregateContainsMRIForOverseas(writer, MongoDBAccess.CONTAINS_MRI, isMiInURLPath);
+        mongodbQueryer.aggregateContainsMRIForOverseas(writer, MongoDBQueryer.CONTAINS_MRI, isMiInURLPath);
     } catch(Exception e) {
 …
     /**
+     * Create the file 5b_counts_overseasSitesWithMiInPath.json
      * Listing of the remainder of overseas sites that CONTAIN_MRI not included by
      * writeTentativeNonAutotranslatedSites(): those that have mi in their URL path.
 …
          ) {
         boolean isMiInURLPath = true;
         mongodbAccess.aggregateContainsMRIForOverseas(writer, MongoDBAccess.CONTAINS_MRI, isMiInURLPath);
+        mongodbQueryer.aggregateContainsMRIForOverseas(writer, MongoDBQueryer.CONTAINS_MRI, isMiInURLPath);
     } catch(Exception e) {
 …
     try (
          MongoDBAccess mongodb = new MongoDBAccess();
+         MongoDBQueryer mongodb = new MongoDBQueryer();
          ) {
 …
         //System.err.println("For N = " + 4360 + ", n = " + listing.calcSampleSize(4360));
         //System.err.println("For N = " + 681 + ", n = " + listing.calcSampleSize(681));
+        String filename = listing.writeTentativeNonAutotranslatedSites();
+        // get all sites where >0 pages have containsMRI=true
+        // grouping NZ sites and .nz TLDs together and remainder under overseas
+        // geolocations.
+        String filename = listing.writeContainsMRISites_nzSitesAndTLDsGrouped();
+        // separately:
+        // - all NZ containsMRI + overseas tentative non-product sites with containMRI
+        // - overseas tentative product sites with containMRI
+        filename = listing.writeTentativeNonAutotranslatedSites();
         filename = listing.writeOverseasSitesWithMiInURLPath();

other-projects/maori-lang-detection/src/org/greenstone/atea/morphia/WebsiteInfo.java

-              r33906
+              r33909
     public final String siteFolderName;
     public final String domain;
     public final String basicDomain; // domain without protocol and www. prefix
+    //public final String basicDomain; // domain without protocol and www. prefix
     public final int totalPages;
 …
     public WebsiteInfo(/*int siteCount,*/ String siteFolderName,
                String domainOfSite, String baseSiteDomain,
+               String domainOfSite, //String baseSiteDomain,
                int totalPages, int countOfWebPagesWithBodyText,
                int numPagesInMRI, int numPagesContainingMRI,
 …
     this.siteFolderName = siteFolderName;
     this.domain = domainOfSite;
     this.basicDomain = baseSiteDomain;
+    //this.basicDomain = baseSiteDomain;
     this.totalPages = totalPages;

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 33909 for other-projects

Legend:

other-projects/maori-lang-detection/src/org/greenstone/atea/MongoDBAccess.java

other-projects/maori-lang-detection/src/org/greenstone/atea/NutchTextDumpToMongoDB.java

other-projects/maori-lang-detection/src/org/greenstone/atea/WebPageURLsListing.java

other-projects/maori-lang-detection/src/org/greenstone/atea/morphia/WebsiteInfo.java

Download in other formats: