Context Navigation

← Previous Changeset
Next Changeset →

Changeset 29581

Timestamp:

2014-12-11T14:34:58+13:00 (9 years ago)

Author:

kjdon

Message:

in gs2mgppdemo, a query of government was coming back with totalMatchDocs 127, but in term info, it said 'government' was found in 108 docs. This is because when generating the list of word nums for government, it looks up the equivalent terms (due to casefolding, stemming etc) and there are 2: government and Government. It gets the list of word positions for each one and merges the lists. When you get the list of word positions, you also get back the number of docs/secs that match the word. Government had 42, and government had 108. The merging code says that for total match docs we'll just take the larger number, ie 108. Later on, this figure is used as total number of matching documents for the ranking calculation, and for the info in the query result.
I have added a new variable, actual_num_match_docs, which we increment as we go through the word position lists and generate doc/sec numbers. This is the point when we actually know how many matches we have. For FragsToQueryResult, instead of calculating ranks as we generate each doc num, I am just storing the doc term freq, then once we know the actual number, we can calculate term weight and query term weight to generate the ranks. I still need to modify AndFragsToQueryResult similarly. This currently calculates actual_num_match_docs and uses it in the query result, but it doesn't yet use it for the rank generation.

File:

: 1 edited

main/trunk/greenstone2/common-src/indexers/mgpp/text/Terms.cpp (modified) (9 diffs)

Legend:

: Unmodified
: Added
: Removed

main/trunk/greenstone2/common-src/indexers/mgpp/text/Terms.cpp

-              r26138
+              r29581
   outFragData.matchDocs = (f1.matchDocs > f2.matchDocs) ?
     f1.matchDocs : f2.matchDocs;
   // do or
   mg_u_long f1I = 0, f1Size = f1.fragNums.size();
   mg_u_long f2I = 0, f2Size = f2.fragNums.size();
   while (f1I < f1Size || f2I < f2Size) {
     if (f2I < f2Size &&
 …
   // log (N / ft)
   mg_u_long N = indexData.levels.levelInfo[indexData.curLevel].numEntries;
+  float wordLog = log((double)N / (double)termData.matchDocs);
+  // termData.matchDocs is not accurate - its just the largest docfreq out of the list of equiv terms. We'll delay calculating ranks until after we have worked out exactly how many docs we have
+  //float wordLog = log((double)N / (double)termData.matchDocs);
   // Wqt = fqt * log (N / ft)
   // note: terms are allowed to have a weight of zero so
   // they can be excluded from the ranking
   float Wqt = termWeight * wordLog;
+  //float Wqt = termWeight * wordLog;
   // Wdt = fdt * log (N / ft)
+  float Wdt;
+  //float Wdt;
+  mg_u_long actual_num_match_docs = 0;
+  vector<mg_u_long> docFreqsArray;
   mg_u_long termDataI = 0;
   mg_u_long termDataSize = termData.fragNums.size();
 …
       // add this doc information
       if (needRanks) {
+        Wdt = termDocFreq * wordLog;
+        result.ranks.push_back (Wqt * Wdt);
+        //Wdt = termDocFreq * wordLog;
+        //result.ranks.push_back (Wqt * Wdt);
+        docFreqsArray.push_back(termDocFreq);
+      }
       result.docs.push_back (lastLevelDocNum);
+      ++actual_num_match_docs;
+    }
 …
     // add the last document information
     if (needRanks) {
+      Wdt = termDocFreq * wordLog;
+      result.ranks.push_back (Wqt * Wdt);
+      //Wdt = termDocFreq * wordLog;
+      //result.ranks.push_back (Wqt * Wdt);
+      docFreqsArray.push_back(termDocFreq);
+    }
     result.docs.push_back (lastLevelDocNum);
+    ++actual_num_match_docs;
+  }
+  // Now that we know the actual number of docs containing this term, we can calculate ranks
+  float wordLog = log((double)N / (double)actual_num_match_docs);
+  float Wqt = termWeight * wordLog;
+  float factor = wordLog * Wqt;
+  mg_u_long docFreqI = 0;
+  mg_u_long docFreqSize = docFreqsArray.size();
+  while (docFreqI < docFreqSize) {
+    result.ranks.push_back(docFreqsArray[docFreqI]*factor);
+    ++docFreqI;
+  }
 …
     termFreqData.stemMethod = stemMethod;
     termFreqData.equivTerms = equivTerms;
+    termFreqData.matchDocs = termData.matchDocs;
+    //termFreqData.matchDocs = termData.matchDocs;
+    termFreqData.matchDocs = actual_num_match_docs;
     termFreqData.termFreq = overallwordfreq; // will be zero if needRankInfo
                                               //not true
 …
   mg_u_long resultOutI = 0;
+  mg_u_long actual_num_term_match_docs = 0;
   while (termDataI < termDataSize) {
 …
       if (levelDocNum != lastLevelDocNum) {
     if (lastLevelDocNum > 0) {
+      // add this doc information
+      ++actual_num_term_match_docs;
       Wdt = termDocFreq * wordLog;
 …
   if (lastLevelDocNum > 0) {
+    ++actual_num_term_match_docs;
     // add the last document information
     Wdt = termDocFreq * wordLog;
 …
     termFreqData.stemMethod = stemMethod;
     termFreqData.equivTerms = equivTerms;
+    termFreqData.matchDocs = termData.matchDocs;
+    //termFreqData.matchDocs = termData.matchDocs;
+    termFreqData.matchDocs = actual_num_term_match_docs;
     termFreqData.termFreq = overallwordfreq;
     result.termFreqs.push_back (termFreqData);

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 29581

Legend:

main/trunk/greenstone2/common-src/indexers/mgpp/text/Terms.cpp

Download in other formats: