Ignore:
Timestamp:
2020-02-28T22:09:15+13:00 (4 years ago)
Author:
ak19
Message:
  1. Print out which web pages of which web site's dump.txt were empty. Then can run NutchTextDumpToMongoDB > outfile.txt 2>&1. 2. More instructions for before running NutchTextDumpToMongoDB
File:
1 edited

Legend:

Unmodified
Added
Removed
  • other-projects/maori-lang-detection/src/org/greenstone/atea/NutchTextDumpToMongoDB.java

    r33983 r33988  
    4040 *    cd maori-lang-detection/src
    4141 *
     42 * MORE IMPORTANT PRELIMINARIES:
     43 * - Make sure the MongoDB is up and running and accessible.
     44 * - If you want to keep any existing MongoDB collections called Websites and Webpages, then
     45 * first renamed those collections in MongoDB (using Robo3T makes renaming easy) before
     46 * running this program.
     47 *
    4248 * TO COMPILE:
    4349 *    maori-lang-detection/src$
     
    245251       
    246252        if(text.equals("")) {
     253        System.err.println("siteID: " + siteID + "- Empty page " + i + " - URL: "
     254                   + page.getPageURL());
     255       
    247256        // don't care about empty pages
    248257        continue;
Note: See TracChangeset for help on using the changeset viewer.