Context Navigation

← Previous Change
Next Change →

NutchTextDumpProcessor.java

Timestamp:

2019-10-17T19:31:53+13:00 (5 years ago)

Author:

ak19

Message:

Corrections for compiling the 2 new classes.

File:

: 1 edited

gs3-extensions/maori-lang-detection/src/org/greenstone/atea/NutchTextDumpProcessor.java (modified) (6 diffs)

Legend:

: Unmodified
: Added
: Removed

gs3-extensions/maori-lang-detection/src/org/greenstone/atea/NutchTextDumpProcessor.java

-              r33576
+              r33578
 import java.io.*;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.ArrayList;
+//import java.util.HashMap;
+//import java.util.Map;
 import java.lang.ArrayIndexOutOfBoundsException;
+import org.apache.log4j.Logger;
+/**
+ * Class to process the dump text files produced for each site (e.g. site "00001") that
+ * Nutch has finished crawling and whose text has been dumped out to a file called dump.txt.
+ * This reads in the dump.txt file contained in each site folder within the input folder.
+ * (e.g. input folder "crawled" could contain folders 00001 to 01465. Each contains a dump.txt)
+ * Each dump.txt could contain the text contents for an entire site, or for individual pages.
+ * This class then uses class TextDumpPage to parse each webpage within a dump.txt,
+ * which parses out the actual text body content of each webpage's section within a dump.txt.
+ * Finally, MaoriTextDetector is run over that to determine whether the full body text is
+ * likely to be in Maori or not.
+ *
+ * Potential issues: since a web page's text is dumped out by nutch with neither paragraph
+ * nor even newline separator, it's hard to be sure that the entire page is in language.
+ * If it's in multiple languages, there's no way to be sure there aren't promising Maori language
+ * paragraphs contained in a page, if the majority/the remainder happen to be in English.
+ *
+ * So if we're looking for any paragraphs in Maori to store in a DB, perhaps it's better to run
+ * the MaoriTextDetector.isTextInMaori(BufferedReader reader) over two "lines" at a time,
+ * instead of running it over the entire html body's text.
+ *
+ * TO COMPILE OR RUN, FIRST DO:
+ *    cd maori-lang-detection/apache-opennlp-1.9.1
+ *    export OPENNLP_HOME=`pwd`
+ *    cd maori-lang-detection/src
+ *
+ * TO COMPILE:
+ *    maori-lang-detection/src$
+ *       javac -cp ".:../conf:../lib/*:$OPENNLP_HOME/lib/*" org/greenstone/atea/NutchTextDumpProcessor.java
+ *
+ * TO RUN:
+ *    maori-lang-detection/src$
+ *       java -cp ".:../conf:../lib/*:$OPENNLP_HOME/lib/opennlp-tools-1.9.1.jar" org/greenstone/atea/NutchTextDumpProcessor
+ *
+*/
 public class NutchTextDumpProcessor {
     private static Logger logger = Logger.getLogger(org.greenstone.atea.NutchTextDumpProcessor.class.getName());
     private static MaoriTextDetector maoriTxtDetector = new MaoriTextDetector(false); // false: run non-silent
+    private final MaoriTextDetector maoriTxtDetector;
     public final String siteID; // is this necessary?
 …
     public NutchTextDumpProcessor(String siteID, File txtDumpFile) {
+    public NutchTextDumpProcessor(MaoriTextDetector maoriTxtDetector, String siteID, File txtDumpFile) {
     // siteID is of the form %5d (e.g. 00020) and is just the name of a site folder
     this.siteID = siteID;
+    this.maoriTxtDetector = maoriTxtDetector;
     pages = new ArrayList<TextDumpPage>();
 …
             pageDump.append("\n");
         } else {
             TextDumpPage page = new TextDumpPage(pageDump.toString());
+            TextDumpPage page = new TextDumpPage(siteID, pageDump.toString());
             // parses the fields and body text of a webpage in nutch's txt dump of entire site
             //page.parseFields();
 …
     String text = getTextForPage(pageID);
+    // QTODO: what to do when page body text is empty?
+    if(text.equals("")) return false;
     return maoriTxtDetector.isTextInMaori(text);
+    }
 …
     try {
+        MaoriTextDetector mriTxtDetector = new MaoriTextDetector(false); // false: run non-silent
         File[] sites = sitesDir.listFiles();
         for(File siteDir : sites) { // e.g. 00001
         // look for dump.txt
         File txtDumpFile = new File(siteDir, dump.txt);
+        File txtDumpFile = new File(siteDir, "dump.txt");
         if(!txtDumpFile.exists()) {
             error("Text dump file " + txtDumpFile + " did not exist");
 …
         else {
             String siteID = siteDir.getName();
             NutchTextDumpProcessor nutchTxtDump = NutchTextDumpProcessor(siteID, txtDumpFile);
+            NutchTextDumpProcessor nutchTxtDump = new NutchTextDumpProcessor(mriTxtDetector, siteID, txtDumpFile);
+        }

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 33578 for gs3-extensions/maori-lang-detection/src/org/greenstone/atea/NutchTextDumpProcessor.java

Legend:

gs3-extensions/maori-lang-detection/src/org/greenstone/atea/NutchTextDumpProcessor.java

Download in other formats: