Context Navigation

← Previous Change
Next Change →

CCWETProcessor.java

Timestamp:

2019-11-05T21:04:09+13:00 (4 years ago)

Author:

ak19

Message:

Incorporated Dr Nichols earlier suggestion of storing page modified time and char-encoding metadata if present in the crawl dump output. Have done so, but neither modifiedTime nor fetchTime metadata of the dump file appear to be a webpage's actual modified time, as they're from 2019 and set around the period we've been crawling. 2. Moved getDomainFromURL() function from CCWETProcessor.java to Utility.java since it's been reused. 3. MongoDBAccess class successfully connects (at least, no exceptions) and uses the newly added properties in config.properties to make the connection.

File:

: 1 edited

gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java (modified) (7 diffs)

Legend:

: Unmodified
: Added
: Removed

gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java

-              r33615
+              r33623
+    }
-    /** Work out the 'domain' for a given url.
-     * This retains any www. or subdomain prefix.
-     */
-    public static String getDomainForURL(String url, boolean withProtocol) {
-    int startIndex = startIndex = url.indexOf("//"); // for http:// or https:// prefix
-    startIndex = (startIndex == -1) ? 0 : (startIndex+2); // skip past the protocol's // portion
-    // the keep the URL around in case param withProtocol=true
-    String protocol = (startIndex == -1) ? "" : url.substring(0, startIndex);
-    String domain = url.substring(startIndex);
-    int endIndex = domain.indexOf("/");
-    if(endIndex == -1) endIndex = domain.length();
-    domain = domain.substring(0, endIndex);
-    if(withProtocol) {
-        // now that we have the domain (everything to the first / when there is no protocol)
-        // can glue the protocol back on
-        domain = protocol + domain;
+    }
-    return domain;
+    }
     /** Utility function to help escape regex characters in URL to go into regex-urlfilter.txt */
 …
         // work out domain. This retains any www. or subdomain prefix
         // passing true to further also retain the http(s) protocol
         domainWithProtocol = getDomainForURL(url, true);
+        domainWithProtocol = Utility.getDomainForURL(url, true);
         Set<String> urlsSet;
 …
+        }
+        /*
         // Dr Nichols said that a url that was located outside the country and
         // which had /mi/ URLs was more likely to be an autotranslated (product) site.
 …
         // then add that domain (if not already added) and that url into a file
         // for later manual inspection
+        if(!domainWithProtocol.endsWith(".nz") && (url.contains("/mi/") || url.endsWith("/mi"))) {
+            /*
+        if(!domainWithProtocol.endsWith(".nz")
+           && (url.contains("/mi/") || url.endsWith("/mi"))) {
             if(!possibleProductDomains.contains(domainWithProtocol)) {
 …
             if(!isInNZ) {
                 possibleProductDomains.add(domainWithProtocol);
                 // write both domain and a sample URL on that site out to file
+                // write both domain and a sample seedURL on that site out to file
                 possibleProductSitesWriter.write(countryCode + " : " + domainWithProtocol + "\n");
                 possibleProductSitesWriter.write("\t" + url + "\n");
+            }
+            }*/ /*else {
+            // already wrote out domain to file at some point, write just the URL out to file
+            possibleProductSitesWriter.write("\t" + url + "\n");
+            }*/
+        }
+            }
+            //else {
+            // already wrote out domain to file at some point, write just the URL out to file
+            //possibleProductSitesWriter.write("\t" + url + "\n");
+            //}
+        }
+        */
+        }
     } catch (IOException ioe) {
 …
     // if any portion of the URL contains the word "livejasmin", or even "jasmin" actually,
     // then it's an adult site, so blacklist the entire domain if it wasn't already blacklisted
     String domainWithoutProtocol = getDomainForURL(url, false); // remove protocol
+    String domainWithoutProtocol = Utility.getDomainForURL(url, false); // remove protocol
     if(!isBlackListed && url.contains("jasmin")) {
         logger.warn("### Blacklisting additional domain (likely an adult site): " + domainWithoutProtocol);
 …
     public static void printUsage() {
     System.err.println("Run this program as:");
     System.err.println("\tCCWetProcessor <path to 'ccrawl-data' folder> <output folder path>");
+    System.err.println("\tCCWetProcessor <path to 'ccrawl-data' input folder> <output folder path>");
+    }

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 33623 for gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java

Legend:

gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java

Download in other formats: