Context Navigation

← Previous Change
Next Change →

NutchTextDumpToMongoDB.java

Timestamp:

2020-03-10T17:33:20+13:00 (4 years ago)

Author:

ak19

Message:

InfoOnEmptyPagesNotInMongoDB.txt is now written out to a file, instead of redirecting all system.err into a file. Also it's now a csv file with additional information besides the URL, now including (fetch) status, protocolStatus and parseStatus.

File:

: 1 edited

other-projects/maori-lang-detection/src/org/greenstone/atea/NutchTextDumpToMongoDB.java (modified) (7 diffs)

Legend:

: Unmodified
: Added
: Removed

other-projects/maori-lang-detection/src/org/greenstone/atea/NutchTextDumpToMongoDB.java

-              r33988
+              r34005
     static boolean DEBUG_MODE = true; // this is set to false in main() at the end of this class
     /** Counter for number of sites.
      * Should be equal to number of times NutchTextDumpToMongoDB constructor
 …
     public final boolean siteCrawlUnfinished;
     public final long siteCrawledTimestamp; /** When the crawl of the site terminated */
+    // private handle to a csv writer
+    private CSVPrinter emptyWebPageInfoCSVPrinter;
     private int countOfWebPagesWithBodyText = 0;
 …
     /** A NutchTextDumpToMongoDB processes the dump.txt for one site */
     public NutchTextDumpToMongoDB(MongoDBAccess mongodbAccess,
+    public NutchTextDumpToMongoDB(MongoDBAccess mongodbAccess, CSVPrinter emptyWebPageInfoCSVPrinter,
                   MaoriTextDetector maoriTxtDetector, String siteID,
                   File txtDumpFile, long lastModified, boolean siteCrawlUnfinished)
 …
     // increment static counter of sites processed by a NutchTextDumpToMongoDB instance
     SITE_COUNTER++;
+    // keep a handle to the csv file writer
+    this.emptyWebPageInfoCSVPrinter = emptyWebPageInfoCSVPrinter;
     // siteID is of the form %5d (e.g. 00020) and is just the name of a site folder
 …
         if(text.equals("")) {
+        System.err.println("siteID: " + siteID + "- Empty page " + i + " - URL: "
+                   + page.getPageURL());
+        System.err.println(siteID + ",Empty page " + i + "," + page.getPageURL()
+                   + "," + page.get("status")
+                   + "," + page.get("protocolStatus")
+                   + "," + page.get("parseStatus"));
+        // write information about any empty web page into the emptyPage csv file
+        emptyWebPageInfoCSVPrinter.printRecord(siteID, i, page.getPageURL(),
+               page.get("status"), page.get("protocolStatus"),page.get("parseStatus"));
         // don't care about empty pages
 …
     NutchTextDumpToMongoDB.DEBUG_MODE = false;
     try (
          MongoDBAccess mongodb = new MongoDBAccess();
+         CSVPrinter emptyWebPageInfoCSVPrinter = new CSVPrinter(new FileWriter("InfoOnEmptyPagesNotInMongoDB.csv"), CSVFormat.DEFAULT.withQuoteMode(QuoteMode.MINIMAL));
          ) {
         mongodb.connectToDB();
         //mongodb.showCollections();
+        // write out csv column headings into the csv file on empty web pages
+        emptyWebPageInfoCSVPrinter.printRecord("siteID","pagenum","URL","(fetch)status","protocolStatus","parseStatus");
         // print out the column headers for the websites csv file
         // https://commons.apache.org/proper/commons-csv/apidocs/org/apache/commons/csv/CSVPrinter.html
 …
             logger.debug("@@@ Processing siteID: " + siteID);
             NutchTextDumpToMongoDB nutchTxtDump = new NutchTextDumpToMongoDB(
                  mongodb, mriTxtDetector,
+                 mongodb, emptyWebPageInfoCSVPrinter, mriTxtDetector,
                  siteID, txtDumpFile, lastModified, UNFINISHED_FILE.exists());
             // now it's parsed all the web pages in the site's text dump

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 34005 for other-projects/maori-lang-detection/src/org/greenstone/atea/NutchTextDumpToMongoDB.java

Legend:

other-projects/maori-lang-detection/src/org/greenstone/atea/NutchTextDumpToMongoDB.java

Download in other formats: