Context Navigation

← Previous Change
Next Change →

CCWETProcessor.java

Timestamp:

2019-10-24T22:04:37+13:00 (4 years ago)

Author:

ak19

Message:

Incorporating Dr Nichols suggestion to help weed out product sites: if tld of seed URL addresses containing /mi/ is outside NZ, add to list of possible-product-sites.txt. This should be a smaller number hopefully than all urls containing /mi and, because they're located outside nz, more likely to be a product site than not.

File:

: 1 edited

gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java (modified) (6 diffs)

Legend:

: Unmodified
: Added
: Removed

gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java

-              r33582
+              r33603
      */
     public void createSeedURLsFiles(File seedURLsFile, File urlFilterFile,
+                    File domainURLsFile, File topSiteMatchesFile) {
+                    File domainURLsFile, File topSiteMatchesFile,
+                    File possibleProductSitesFile) {
     // Maintain a Map of unique domains mapped to seed urls at that domain
     // TreeSet: by default, "the elements are ordered using their natural ordering"
 …
     final String PROTOCOL_REGEX_PREFIX = "+^https?://";
     final String FILTER_REGEX_PREFIX = PROTOCOL_REGEX_PREFIX + "([a-z0-9-]+\\.)*"; // https?://([a-z0-9-]+\.)* for nutch's regex-urlfilter.txt
+    // keep an eye out on URLs we need to inspect later
+    Set<String> possibleProductDomains = new TreeSet<String>();
+    File geoLiteCityDatFile = new File(MY_CLASSLOADER.getResource("GeoLiteCity.dat").getFile());
     try (
          BufferedReader reader = new BufferedReader(new FileReader(this.keepURLsFile));
+         BufferedWriter possibleProductSitesWriter = new BufferedWriter(new FileWriter(possibleProductSitesFile));
          ) {
 …
+        }
+        // Dr Nichols said that a url that was located outside the country and
+        // which had /mi/ URLs was more likely to be an autotranslated (product) site.
+        // Following Dr Nichols' idea, let's keep a look out for more product sites:
+        // if any URL contains /mi AND the tld of its domain is outside of New Zealand
+        // then add that domain (if not already added) and that url into a file
+        // for later manual inspection
+        if(!domainWithProtocol.endsWith(".nz") && (url.contains("/mi/") || url.endsWith("/mi"))) {
+            if(!possibleProductDomains.contains(domainWithProtocol)) {
+            // more expensive test, so do this only if above conditions are true:
+            if(!Utility.isDomainInCountry(domainWithProtocol, "nz", geoLiteCityDatFile)) {
+                possibleProductDomains.add(domainWithProtocol);
+                // write both domain and URL out to file
+                possibleProductSitesWriter.write(domainWithProtocol + "\n");
+                possibleProductSitesWriter.write("\t" + url + "\n");
+            }
+            } else {
+            // already wrote out domain to file, write just the URL out to file
+            possibleProductSitesWriter.write("\t" + url + "\n");
+            }
+        }
+        }
     } catch (IOException ioe) {
 …
          BufferedWriter seedURLsWriter = new BufferedWriter(new FileWriter(seedURLsFile));
          BufferedWriter urlFilterWriter = new BufferedWriter(new FileWriter(urlFilterFile));
          BufferedWriter topSiteMatchesWriter = new BufferedWriter(new FileWriter(topSiteMatchesFile))
+         BufferedWriter topSiteMatchesWriter = new BufferedWriter(new FileWriter(topSiteMatchesFile));
          ) {
 …
             siteURLsWriter.write(url + "\n");
+            }
             if(allowedURLPatternRegex == null) { // entire site can be crawled
 …
     File domainURLsFile = new File(outFolder, "all-domain-urls.txt");
     File topSitesMatchedFile = new File(outFolder, "unprocessed-topsite-matches.txt");
+    ccWETFilesProcessor.createSeedURLsFiles(seedURLsFile, urlFilterFile, domainURLsFile, topSitesMatchedFile);
+    File possibleProductSitesFile = new File(outFolder, "possible-product-sites.txt");
+    ccWETFilesProcessor.createSeedURLsFiles(seedURLsFile, urlFilterFile, domainURLsFile, topSitesMatchedFile, possibleProductSitesFile);
     info("\n*** Inspect urls in greylist at " + ccWETFilesProcessor.greyListedFile + "\n");

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 33603 for gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java

Legend:

gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java

Download in other formats: