Changeset 33988

Show
Ignore:
Timestamp:
28.02.2020 22:09:15 (5 weeks ago)
Author:
ak19
Message:

1. Print out which web pages of which web site's dump.txt were empty. Then can run NutchTextDumpToMongoDB > outfile.txt 2>&1. 2. More instructions for before running NutchTextDumpToMongoDB

Files:
1 modified

Legend:

Unmodified
Added
Removed
  • other-projects/maori-lang-detection/src/org/greenstone/atea/NutchTextDumpToMongoDB.java

    r33983 r33988  
    4040 *    cd maori-lang-detection/src 
    4141 * 
     42 * MORE IMPORTANT PRELIMINARIES: 
     43 * - Make sure the MongoDB is up and running and accessible. 
     44 * - If you want to keep any existing MongoDB collections called Websites and Webpages, then 
     45 * first renamed those collections in MongoDB (using Robo3T makes renaming easy) before 
     46 * running this program. 
     47 * 
    4248 * TO COMPILE: 
    4349 *    maori-lang-detection/src$ 
     
    245251         
    246252        if(text.equals("")) { 
     253        System.err.println("siteID: " + siteID + "- Empty page " + i + " - URL: " 
     254                   + page.getPageURL()); 
     255         
    247256        // don't care about empty pages 
    248257        continue;