source: gs3-extensions/maori-lang-detection/src/org/greenstone/atea/MRIWebPageStats.java@ 33587

Last change on this file since 33587 was 33587, checked in by ak19, 5 years ago
  1. Better stats reporting on crawled sites: not just if a page was in MRI or not, but for those that contained any text, there's also reporting on how many sentences were detected as MRI (even if the overall text body of the page was not detected as being primarily MRI). This can be useful later when or if we want to store MRI language sentences/paragraphs. Currently only useful if I've implemented it sensibly. 2. MaoriTextDetector.java::getAllSentencesInMaori() and TextLanguageDetector.java::getAllSentencesInLanguage() now store the total number of sentences in the text parameter as the first element in the ArrayList returned.
File size: 1.4 KB
Line 
1package org.greenstone.atea;
2
3
4//import org.apache.log4j.Logger;
5
6
7public class MRIWebPageStats {
8 //private static Logger logger = Logger.getLogger(org.greenstone.atea.MRIWebPageStats.class.getName());
9
10 public final String siteID; // crawled site's folder name e.g. 00510
11 public final String URL; // URL of webpage
12 public final int pageID; // index into NutchTextDumpProcessor::pages ArrayList
13
14 public final boolean isMRI;
15 public final int numSentences; // count of all sentences in the webpage's body
16 public final int numSentencesInMRI; // count of sentences in the webpage's body in Māori (mri)
17
18
19 public MRIWebPageStats(String siteID, String url, int pageID, boolean isMRI,
20 int numSentences, int numSentencesInMRI)
21 {
22 this.siteID = siteID;
23 this.URL = url;
24 this.pageID = pageID;
25
26 this.isMRI = isMRI;
27 this.numSentences = numSentences;
28 this.numSentencesInMRI = numSentencesInMRI;
29 }
30
31 public String toString() {
32 StringBuilder str = new StringBuilder();
33 str.append("URL: " + this.URL);
34 str.append("\nsiteID: " + this.siteID);
35 str.append("\nnum sentences in MRI: " + this.numSentencesInMRI+"/"+this.numSentences);
36 if(this.isMRI && this.numSentencesInMRI <= 0) {
37 // one or more pages in the site were MRI, but they didn't contain proper sentences
38 str.append(" (no PROPER sentences in MRI)");
39 }
40 return str.toString();
41 }
42}
Note: See TracBrowser for help on using the repository browser.