Context Navigation

← Previous Change
Next Change →

NutchTextDumpProcessor.java

Timestamp:

2019-11-05T21:04:09+13:00 (4 years ago)

Author:

ak19

Message:

Incorporated Dr Nichols earlier suggestion of storing page modified time and char-encoding metadata if present in the crawl dump output. Have done so, but neither modifiedTime nor fetchTime metadata of the dump file appear to be a webpage's actual modified time, as they're from 2019 and set around the period we've been crawling. 2. Moved getDomainFromURL() function from CCWETProcessor.java to Utility.java since it's been reused. 3. MongoDBAccess class successfully connects (at least, no exceptions) and uses the newly added properties in config.properties to make the connection.

File:

: 1 edited

gs3-extensions/maori-lang-detection/src/org/greenstone/atea/NutchTextDumpProcessor.java (modified) (5 diffs)

Legend:

: Unmodified
: Added
: Removed

gs3-extensions/maori-lang-detection/src/org/greenstone/atea/NutchTextDumpProcessor.java

-              r33615
+              r33623
         TextDumpPage firstPage = pages.get(0);
         String url = firstPage.getPageURL();
         this.domainOfSite = CCWETProcessor.getDomainForURL(url, true);
+        this.domainOfSite = Utility.getDomainForURL(url, true);
+    }
     else {
 …
         page.addMRILanguageStatus(isMRI);
         // Even if the entire page is not found to be overall in MÄori,
         // let's still inspect the sentences of the page and count how many (if any)
 …
             webpageCSVPrinter.printRecord(WEBPAGE_COUNTER++,
                           SITE_COUNTER, /* alternative: this.siteID */
+                          url, isMRI, totalSentences, numSentencesInMRI);
+                          url,
+                          //"origCharEncoding", "modifiedTime", "fetchTime",
+                          page.getOriginalCharEncoding(),
+                          page.getModifiedTime(),
+                          page.getFetchTime(),
+                          isMRI, totalSentences, numSentencesInMRI);
             // Write the sentences that are in te reo into the mri-sentences CSV file
 …
            "domainURL","totalPagesInSite", "numPagesInMRI", "numOtherPagesContainingMRI",
            "nutchCrawlTimestamp", "crawlUnfinished", "redoCrawl");
+        webpagesCSVPrinter.printRecord("webpageID", "websiteID", "URL", "isMRI",
+                       "numSentences", "numSentencesInMRI");
+        webpagesCSVPrinter.printRecord("webpageID", "websiteID", "URL",
+                       "origCharEncoding", "modifiedTime", "fetchTime",
+                       "isMRI", "numSentences", "numSentencesInMRI");
         mriSentencesCSVPrinter.printRecord("sentenceID", "webpageID", "sentence");
 …
     } catch(Exception e) {
         // can get an exception when instantiating CCWETProcessor instance
+        // can get an exception when instantiating NutchTextDumpProcessor instance
         // or with CSV file
         logger.error(e.getMessage(), e);

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 33623 for gs3-extensions/maori-lang-detection/src/org/greenstone/atea/NutchTextDumpProcessor.java

Legend:

gs3-extensions/maori-lang-detection/src/org/greenstone/atea/NutchTextDumpProcessor.java

Download in other formats: