Context Navigation

CCWETProcessor.java@ 33560

Last change on this file since 33560 was 33560, checked in by ak19, 5 years ago

Incorporated Dr Bainbridge's suggested improvements: only when there is a subdomain to a seed URL's domain should SUBDOMAIN-COPY be active, otherwise it should be deactivated on topsites match. For example if seedURL's domain is pinky.blogspot.com, then SUBDOMAIN-COPY can crawl that site as it's not all of blogspot. But if the seedURL domain was blogspot.com it would still match the topsite blogspot.com for which SUBDOMAIN-COPY is the value, but the value should be overridden so as not to crawl the site. 2. More complete regex escaping for the regex-urlfilter.txt file. 3. domainToURLs map now contains the domain WITH protocol prefix, which required adjustments to be made in the rest of the code. 4. Together with the changes to the blacklist, whitelist and topsites file (sites-too-big-to-exhaustively crawl file), I think the code is dealing with all the known wanted urls among the topsites now and generating the correct output for the seedURLs and regex-urlfilter file.

File size: 35.7 KB

Rev	Line
[33501]	1	package org.greenstone.atea;
	2
	3
	4	import java.io.*;
	5	import java.util.Properties;
	6	import java.util.zip.GZIPInputStream;
	7	import java.util.Iterator;
[33503]	8	import java.util.HashMap;
	9	import java.util.Map;
[33501]	10	import java.util.Set;
[33518]	11	import java.util.TreeMap;
[33501]	12	import java.util.TreeSet;
	13
	14	import org.apache.log4j.Logger;
	15
	16	/**
	17	* The main() method of this class takes a folder of warc.wet(.gz) files and goes through
	18	* the WET records in each, putting each WET record into a file. Each file is put into a
[33503]	19	* keep or discard or greyListed folder, and its url listed written into a keep, discard
	20	* or greylisted text file, based on based on
	21	*
	22	* 1. whether it's whitelisted, else greylisted else blacklisted
	23	* 2. and if explicitly whitelisted or else not greylisted or blacklisted and there's
	24	* enough content. Formerly, content-length and number of lines were used to determine if
	25	* the content was sufficient. Now it's just word count and number of MAX characters
	26	* (not MINIMUM characters) that determine a string is a word. These settings can be adjusted
	27	* in conf/config.properties.
[33501]	28	*
[33503]	29	* Put a url-blacklist-filter.txt and/or url-greylist-filter.txt and/or url-whitelist-filter.txt
	30	* into the conf folder to control any url patterns that are explicitly included or excluded or
	31	* set aside for inspecting later. These filter text files don't use regexes, instead their
	32	* format is:
	33	* - precede URL by ^ to blacklist urls that match the given prefix
	34	* - succeed URL by $ to blacklist urls that match the given suffix
	35	* - ^url$ will blacklist urls that match the given url completely
	36	* - Without either ^ or $ symbol, urls containing the given url will get blacklisted
	37	*
	38	* WETProcessor.java's current implementation is that explicit whitelisting has precedence
	39	* over greylisting and which takes precedence over blacklisting in turn. However, even
	40	* explicitly whitelisted urls still need to have sufficient content to end up in keepURLs.txt
	41	* and in the seedURLs.txt file used for nutch, along with its domain in regex-urlfilter.txt
	42	* also for nutch.
	43	*
	44	* A CCWETProcessor instance can be configured to process all the .warc.wet(.gz) files
	45	* in the given input folder. Then use a single instance of the WETProcessor class to process
	46	* each single unzipped warc.wet file.
	47	*
[33501]	48	* To compile, including the jars in lib/ for compiling.
	49	* maori-lang-detection/src$ javac -cp ".:../lib/*" org/greenstone/atea/CCWETProcessor.java
	50	*
	51	* To run, passing the log4j and other properties files in conf/ folder:
	52	* maori-lang-detection/src$ java -cp ".:../conf:../lib/*" org.greenstone.atea.CCWETProcessor <folder containing warc.wet(.gz) files> <outputFolder>
	53	*
	54	* e.g.
	55	* - java -cp ".:../conf:../lib/*" org.greenstone.atea.CCWETProcessor ../tmp/processWET /Scratch/ak19/gs3-extensions/maori-lang-detection/tmp/processedWET
	56	* - java -cp ".:../conf:../lib/*" org.greenstone.atea.CCWETProcessor ../tmp/processWET /Scratch/ak19/gs3-extensions/maori-lang-detection/tmp/processedWET 2>&1 \| less
	57	*
	58	*/
	59
	60	public class CCWETProcessor {
	61	private static Logger logger = Logger.getLogger(org.greenstone.atea.CCWETProcessor.class.getName());
	62
	63	// Properties shared across WETProcessor instances
	64	public final int MAX_WORD_LENGTH;
	65	public final int MIN_NUM_WORDS;
	66	public final int MAX_WORDS_CAMELCASE;
[33560]	67
	68	// constants for the possible fixed values in sites-too-big-to-exhaustively-crawl.txt file
	69	public final String SUBDOMAIN_COPY = "SUBDOMAIN-COPY";
	70	public final String SINGLEPAGE = "SINGLEPAGE";
[33501]	71
[33560]	72	/**
	73	* Characters that need escaping if used as a string literal in a regex
	74	* https://stackoverflow.com/questions/399078/what-special-characters-must-be-escaped-in-regular-expressions
	75	* https://www.regular-expressions.info/refcharacters.html
	76	*/
	77	//public final String[] ESCAPE_CHARS_FOR_RE = [".", "^", "$", "*", "+", "?", "(", ")", "[", "{", "\\", "\|"];
	78	// put the \\ at start so we don't the escape character for chars escaped earlier
	79	public final String ESCAPE_CHARS_FOR_RE = "\\.^$*+?()[{\|";
	80
[33501]	81	private Properties configProperties = new Properties();
	82
	83	// File paths shared across WETProcessor instances
[33552]	84	public final File commoncrawlDir;
[33501]	85	public final File outputFolder;
	86	public final File discardFolder;
	87	public final File keepFolder;
[33503]	88	public final File greyListedFolder;
[33501]	89	public final File keepURLsFile;
	90	public final File discardURLsFile;
[33503]	91	public final File greyListedFile;
[33501]	92
[33517]	93	/** Possible values stored in the blackList/whiteList/greyList Maps */
[33503]	94	private final Integer LIST_ENTRY_CONTAINS = new Integer(0);
	95	private final Integer LIST_ENTRY_STARTSWITH = new Integer(1);
	96	private final Integer LIST_ENTRY_ENDSWITH = new Integer(2);
	97	private final Integer LIST_ENTRY_MATCHES = new Integer(3);
[33517]	98
	99	/**
	100	* Store url patterns as keys and values indicated whether a url should
	101	* match it exactly, start/end with it, or contain it
	102	*/
[33503]	103	private HashMap<String, Integer> blackList;
	104	private HashMap<String, Integer> greyList;
	105	private HashMap<String, Integer> whiteList;
	106
[33557]	107	/** map of topsites with allowable regexes: sites too big to exhaustively crawl
	108	* with optional regex defining allowed exceptions, like subdomains or url suffixes
	109	* off that top site. For example, wikipedia.org is a topsite, but mi.wikipedia.org
	110	* is relevant. Or blogspot.com is a top site, but someone's pages in Maori off blogspot
	111	* would be relevant.
	112	* The map would store top site domain suffix and an optional regex string for allowable
	113	* url patterns.
	114	*/
	115	private HashMap<String, String> topSitesMap;
	116
[33517]	117	/** Map of domains we keep and the full urls we're keeping that are of that domain.
[33518]	118	* No need to use a TreeMap which preserves natural (alphabetical) ordering of keys,
	119	* while a HashMap has no notion of ordering, because we just need to store urls with
	120	* their domains. Whether the domains are sorted or the urls per domain are sorted becomes
	121	* irrelevant. (Does it really? What if we have urls followed vs preceded by urls with the
	122	* same prefix, e.g. pinky.com/toto/index.html and pinky.com/toto/nono/file.html
	123	* Is there any benefit to nutch when crawling if these seedURLs are ordered or not?)
[33517]	124	*/
[33518]	125	private Map<String, Set<String>> domainsToURLsMap;
[33517]	126
[33501]	127	// Keep a count of all the records that all WETProcessors instantiated
	128	// by our main method combined have processed
	129	private int totalRecordCount = 0;
	130
	131	private int wetFileCount = 0;
	132
[33503]	133	public CCWETProcessor(File inFolder, File outFolder) throws Exception {
[33552]	134	this.commoncrawlDir = inFolder;
[33501]	135	this.outputFolder = outFolder;
	136
	137	// load up the properties from the config file
[33503]	138	try (InputStream infile = org.greenstone.atea.CCWETProcessor.class.getClassLoader().getResourceAsStream("config.properties")) {
[33501]	139	configProperties = new Properties();
	140	configProperties.load(infile);
[33503]	141	//infile.close(); // not explicitly called in examples of try-with-resources
[33501]	142
	143	} catch(Exception e) {
	144	System.err.println("Exception attempting to read properties from config.properties.");
	145	logger.error("Exception attempting to read properties from config.properties.");
	146	e.printStackTrace();
	147	}
	148
	149	if(configProperties.size() == 0) {
	150	System.err.println("*** Warning: no values read into config properties. Using defaults.");
	151	}
	152
	153	MAX_WORD_LENGTH = Integer.parseInt(configProperties.getProperty("WETprocessor.max.word.length", "15"));
	154	MIN_NUM_WORDS = Integer.parseInt(configProperties.getProperty("WETprocessor.min.num.words", "20"));
	155	MAX_WORDS_CAMELCASE = Integer.parseInt(configProperties.getProperty("WETprocessor.max.words.camelcase", "10"));
	156
	157
	158	this.discardFolder = new File(outFolder, "discard");
	159	if(!discardFolder.exists()) {
	160	discardFolder.mkdir();
	161	}
	162	this.keepFolder = new File(outFolder, "keep");
	163	if(!keepFolder.exists()) {
	164	keepFolder.mkdir();
	165	}
[33503]	166
	167	this.greyListedFolder = new File(outFolder, "greylisted");
	168	if(!greyListedFolder.exists()) {
	169	greyListedFolder.mkdir();
	170	}
	171
[33501]	172	this.keepURLsFile = new File(outFolder, "keepURLs.txt");
	173	if(keepURLsFile.exists() && !keepURLsFile.delete()) {
[33503]	174	throw new Exception("Warning: Unable to delete " + this.keepURLsFile + ". Unable to proceed.");
[33501]	175	}
	176	this.discardURLsFile = new File(outFolder, "discardURLs.txt");
	177	if(discardURLsFile.exists() && !discardURLsFile.delete()) {
[33503]	178	throw new Exception ("Warning Unable to delete " + discardURLsFile + ". Unable to proceed.");
[33501]	179	}
[33503]	180	this.greyListedFile = new File(outFolder, "greyListed.txt");
	181	if(greyListedFile.exists() && !greyListedFile.delete()) {
	182	throw new Exception ("Warning Unable to delete " + greyListedFile + ". Unable to proceed.");
	183	}
	184
[33517]	185	// prepare our blacklist, greylist (for inspection) and whitelist
[33503]	186	System.err.println("Loading blacklist.");
	187	blackList = new HashMap<String, Integer>();
	188	initURLFilterList(blackList, "url-blacklist-filter.txt");
[33517]	189
[33503]	190	System.err.println("Loading greylist.");
	191	greyList = new HashMap<String, Integer>();
	192	initURLFilterList(greyList, "url-greylist-filter.txt");
[33517]	193
[33503]	194	System.err.println("Loading whitelist.");
	195	whiteList = new HashMap<String, Integer>();
	196	initURLFilterList(whiteList, "url-whitelist-filter.txt");
	197
[33557]	198	// Create the map of topSites
	199	System.err.println("Loading map of topsites with regex of allowable url patterns for each topsite.");
	200	topSitesMap = new HashMap<String, String>();
	201	//File topSitesFile = new File(outFolder, "sites-too-big-to-exhaustively-crawl.txt");
	202
	203	try (
	204	BufferedReader reader = new BufferedReader(new InputStreamReader(org.greenstone.atea.CCWETProcessor.class.getClassLoader().getResourceAsStream("sites-too-big-to-exhaustively-crawl.txt"), "UTF-8"));
	205	) {
	206
	207	String str = null;
	208	while((str = reader.readLine()) != null) {
	209	str = str.trim();
	210	if(str.equals("") \|\| str.startsWith("#")) {
	211	continue;
	212	}
	213
	214	int tabindex = str.indexOf("\t");
	215	if(tabindex == -1) {
	216	topSitesMap.put(str, "");
	217	} else {
	218	String topsite = str.substring(0, tabindex).trim();
	219	String allowed_url_pattern = str.substring(tabindex+1).trim();
	220	topSitesMap.put(topsite, allowed_url_pattern);
	221	}
	222	}
	223	} catch (IOException ioe) {
	224	ioe.printStackTrace();
	225	System.err.println("\n@@@@@@@@@ Error reading in from top sites file conf/sites-too-big-to-exhaustively-crawl.txt");
	226	}
	227
[33503]	228	//System.err.println("Prematurely terminating for testing purposes.");
	229	//System.exit(-1);
[33501]	230	}
[33557]	231
	232	/** Work out the 'domain' for a given url.
	233	* This retains any www. or subdomain prefix.
	234	*/
[33560]	235	private String getDomainForURL(String url, boolean withProtocol) {
	236	int startIndex = startIndex = url.indexOf("//"); // for http:// or https:// prefix
[33557]	237	startIndex = (startIndex == -1) ? 0 : (startIndex+2); // skip past the protocol's // portion
[33560]	238	// the keep the URL around in case param withProtocol=true
	239	String protocol = (startIndex == -1) ? "" : url.substring(0, startIndex);
	240
[33557]	241	String domain = url.substring(startIndex);
	242	int endIndex = domain.indexOf("/");
	243	if(endIndex == -1) endIndex = domain.length();
	244	domain = domain.substring(0, endIndex);
[33560]	245
	246	if(withProtocol) {
	247	// now that we have the domain (everything to the first / when there is no protocol)
	248	// can glue the protocol back on
	249	domain = protocol + domain;
	250	}
[33557]	251
	252	return domain;
	253	}
[33560]	254
	255	/** Utility function to help escape regex characters in URL to go into regex-urlfilter.txt */
	256	private String escapeStringForRegex(String str) {
	257	for(int i = 0; i < ESCAPE_CHARS_FOR_RE.length(); i++) {
	258	char c = ESCAPE_CHARS_FOR_RE.charAt(i);
	259	str = str.replace(Character.toString(c), "\\"+c);
	260	}
	261	return str;
	262	}
[33501]	263
	264	/**
[33552]	265	* Using the keepURLs.txt file generated by running WETProcessor instances, this produces
[33517]	266	* as output the URL seed list and regex-urlfilter text files required by nutch, see
[33501]	267	* https://cwiki.apache.org/confluence/display/nutch/NutchTutorial
	268	*/
[33557]	269	public void createSeedURLsFiles(File seedURLsFile, File urlFilterFile,
	270	File domainURLsFile, File topSiteMatchesFile) {
[33560]	271	// Maintain a Map of unique domains mapped to seed urls at that domain
[33501]	272	// TreeSet: by default, "the elements are ordered using their natural ordering"
	273	// (or by a Comparator provided at set creation time).
	274	// Whereas HashSet doesn't guarantee ordering.
	275	// So we get alphabetic sorting for free. And guaranteed log(n) for basic operations.
[33560]	276	// Would be a similar distinction for Maps.
	277	domainsToURLsMap = new TreeMap<String, Set<String>>();
[33501]	278
[33560]	279	final String PROTOCOL_REGEX_PREFIX = "+^https?://";
	280	final String FILTER_REGEX_PREFIX = PROTOCOL_REGEX_PREFIX + "([a-z0-9-]+\\.)"; // https?://([a-z0-9-]+\.) for nutch's regex-urlfilter.txt
[33501]	281
	282	try (
	283	BufferedReader reader = new BufferedReader(new FileReader(this.keepURLsFile));
	284	) {
	285
	286	// read a URL at a time from urlsFile
	287	String url = null;
[33560]	288	String domainWithProtocol = null;
[33501]	289	while((url = reader.readLine()) != null) { // readLine removes newline separator
	290
[33560]	291	// work out domain. This retains any www. or subdomain prefix
	292	// passing true to further also retain the http(s) protocol
	293	domainWithProtocol = getDomainForURL(url, true);
[33501]	294
[33518]	295	Set<String> urlsSet;
[33560]	296	if(!domainsToURLsMap.containsKey(domainWithProtocol)) {
[33518]	297	urlsSet = new TreeSet<String>();
	298	urlsSet.add(url);
[33560]	299	domainsToURLsMap.put(domainWithProtocol, urlsSet);
[33518]	300	} else {
[33560]	301	urlsSet = domainsToURLsMap.get(domainWithProtocol);
[33518]	302	urlsSet.add(url);
	303	}
	304
[33501]	305	}
	306	} catch (IOException ioe) {
	307	ioe.printStackTrace();
	308	System.err.println("\n@@@@@@@@@ Error reading in urls from file " + this.keepURLsFile);
	309	}
[33518]	310
[33552]	311	// We'd have pruned out duplicates by now and have a sorted list of domains,
	312	// each of which maps to seed URLs in the commoncrawl for that domain
[33557]	313
[33519]	314	int domainCount = 0;
	315	File sitesFolder = new File(outputFolder, "sites");
	316	if(!sitesFolder.exists()) {
	317	sitesFolder.mkdir();
	318	}
	319	final String FORMATSTR = "%05d";
	320
[33518]	321	// write out each domain followed in sequence by all urls we found in that domain
	322	// (urls with tab up front)
[33519]	323	try (
[33552]	324	// global lists of all domains, seedURLs and regex-urlfilters across all wet files of all commoncrawls
[33557]	325	// Also a global file listing any urls that matched top sites that didn't specify
	326	// allowed regex patterns
[33552]	327	BufferedWriter domainURLsWriter = new BufferedWriter(new FileWriter(domainURLsFile));
[33519]	328	BufferedWriter seedURLsWriter = new BufferedWriter(new FileWriter(seedURLsFile));
[33557]	329	BufferedWriter urlFilterWriter = new BufferedWriter(new FileWriter(urlFilterFile));
	330	BufferedWriter topSiteMatchesWriter = new BufferedWriter(new FileWriter(topSiteMatchesFile))
[33519]	331	) {
[33557]	332
	333	// initialise topSiteMatchesFile with some instructional text.
	334	topSiteMatchesWriter.write("The following domain with seedURLs are on a major/top 500 site\n");
	335	topSiteMatchesWriter.write("for which no allowed URL pattern regex has been specified.\n");
	336	topSiteMatchesWriter.write("Specify one for this domain in the tab-spaced sites-too-big-to-exhaustively-crawl.txt file\n");
	337
[33518]	338	//Set<Map.Entry<String, Set<String>>> domainsSet = domainsToURLsMap.keySet();
	339	Set<String> domainsSet = domainsToURLsMap.keySet();
	340	Iterator<String> domainIterator = domainsSet.iterator();
[33557]	341
	342	/*
	343	// DEBUG
	344	String value = topSitesMap.get("wikipedia.org");
	345	if(value == null) {
	346	System.err.println("### wikipedia.org had null value");
	347	} else {
	348	System.err.println("### wikipedia.org had value: " + value);
	349	} // DEBUG
	350	*/
[33519]	351
[33518]	352	while(domainIterator.hasNext()) {
[33560]	353	String domainWithProtocol = domainIterator.next();
	354	int startIndex = domainWithProtocol.indexOf("//"); // http:// or https:// prefix
	355	startIndex = (startIndex == -1) ? 0 : (startIndex+2); // skip past the protocol's // portion
	356	String domain = domainWithProtocol.substring(startIndex);
[33557]	357
[33560]	358	System.err.println("domain with protocol: " + domainWithProtocol);
	359	System.err.println("domain: " + domain);
	360
[33557]	361	String allowedURLPatternRegex = isURLinTopSitesMap(domain);
	362	// If the domain is of a topsite for which no allowed URL pattern has been provided
	363	// in sites-too-big-to-exhaustively-crawl.txt,
	364	// then we don't know how to crawl the site. Warn the user by writing the affected
	365	// domain and seedURLs to the topSiteMatchesFile.
	366	if(allowedURLPatternRegex != null && allowedURLPatternRegex.equals("")) {
[33560]	367
[33557]	368	// topsite, but we don't (yet) know what portion can be crawled
	369	// Append the top site and url to a global/toplevel file that
	370	// the user needs to check later and we're done with this domain as it
	371	// won't go into any other file hereafter
	372
[33560]	373	Set<String> urlsForDomainSet = domainsToURLsMap.get(domainWithProtocol);
[33557]	374	Iterator<String> urlIterator = urlsForDomainSet.iterator();
	375	while(urlIterator.hasNext()) {
	376	String url = urlIterator.next();
	377	topSiteMatchesWriter.write("\t" + url + "\n");
	378	}
	379
	380	continue; // done with this domain
	381	}
	382
	383	// start counting the domains we're actually going to process
	384	domainCount++;
	385
[33519]	386	String siteID = String.format(FORMATSTR, domainCount);
	387	File domainFolder = new File(sitesFolder, siteID);
	388	domainFolder.mkdir();
	389
[33557]	390	// write out the domain
[33560]	391	//seedURLsWriter.write(domainWithProtocol + "\n");
[33557]	392
[33519]	393
[33552]	394	// for every domain, we need a sites/0000x/ folder, where x is domain#, containing
	395	// its own INDIVIDUAL seedURLs.txt and regex-urlfilter.txt
[33519]	396	// We still have a global seedURLs.txt and regex-urlfilter.txt too.
	397	File siteSeedsFile = new File(domainFolder, "seedURLs.txt"); // e.g. sites/00001/seedURLs.txt
	398	File siteRegexFile = new File(domainFolder, "regex-urlfilter.txt"); // e.g. sites/00001/regex-urlfilter.txt
	399	try (
	400	BufferedWriter siteURLsWriter = new BufferedWriter(new FileWriter(siteSeedsFile));
	401	BufferedWriter siteRegexWriter = new BufferedWriter(new FileWriter(siteRegexFile));
	402	) {
[33552]	403
	404	// write all sorted unique domains into global domains file
[33560]	405	// Using the domain withuot protocol since the global domains file is for
	406	// informational purposes
[33552]	407	domainURLsWriter.write(domain + "\n");
	408
	409	// Only write urls and no domain into single global seedurls file
	410	// But write domain and tabbed urls into individual sites/0000#/seedURLs.txt
	411	// files (and write regexed domain into each sites/0000#/regex-urlfilter.txt)
	412	// If we ever run nutch on a single seedURLs listing containing
[33560]	413	// all seed pages to crawl sites from, the above two files will work for that.
[33519]	414
[33557]	415	if(allowedURLPatternRegex == null) { // entire site can be crawled
[33560]	416	siteURLsWriter.write(domainWithProtocol + "\n");
[33557]	417
	418	// Write out filter in the following form for a site, e.g. for nutch.apache.org:
	419	// nutch.apache.org => +^https?://([a-z0-9-]+\.)*nutch\.apache\.org/
[33560]	420	String regexed_domain = FILTER_REGEX_PREFIX + escapeStringForRegex(domain) + "/";
	421	//String regexed_domain = FILTER_REGEX_PREFIX + domain.replace(".", "\\.") + "/";
[33557]	422	urlFilterWriter.write(regexed_domain + "\n"); //global file
	423	siteRegexWriter.write(regexed_domain + "\n"); // site file
	424	}
	425	else { // domain belongs to a top site where only portion of site can be crawled
	426
[33560]	427	if(allowedURLPatternRegex.equals(SUBDOMAIN_COPY)) { // COPY existing domain as url-filter
	428	siteURLsWriter.write(domainWithProtocol + "\n");
[33557]	429	// e.g. pinky.blogspot.com will add a filter for pinky.blogspot.com
	430	// and not for all of blogspot.com
	431
[33560]	432	String regexed_domain = PROTOCOL_REGEX_PREFIX+escapeStringForRegex(domain) + "/";
	433	//String regexed_domain = PROTOCOL_REGEX_PREFIX+domain.replace(".", "\\.") + "/";
	434	urlFilterWriter.write(regexed_domain + "\n");
	435	siteRegexWriter.write(regexed_domain + "\n");
	436
	437	} else if(allowedURLPatternRegex.equals(SINGLEPAGE)) {
[33557]	438	// don't write out domain. We want individual pages
[33560]	439	//DON'T DO THIS HERE: siteURLsWriter.write(domainWithProtocol + "\n");
[33557]	440
[33560]	441	// don't write out domain as a regex expression url filter either,
[33557]	442	// write out the individual seed urls for the domain instead
	443	// since we will only be downloading the single page
	444
[33560]	445	Set<String> urlsForDomainSet = domainsToURLsMap.get(domainWithProtocol);
	446	for(String urlInDomain : urlsForDomainSet) {
	447	// don't append slash to end this time
	448	String regexed_url = "+^"+escapeStringForRegex(urlInDomain);
	449	//String regexed_url = "+^"+urlInDomain.replace(".", "\\.");
[33557]	450	urlFilterWriter.write(regexed_url + "\n");
	451	siteRegexWriter.write(regexed_url + "\n");
	452	}
	453	} else { // allowedURLPatternRegex is a url-form - convert to regex
[33560]	454	if(!allowedURLPatternRegex.endsWith("/")) {
	455	allowedURLPatternRegex += "/";
	456	}
	457	String regexed_pattern = PROTOCOL_REGEX_PREFIX+escapeStringForRegex(allowedURLPatternRegex);
	458	//String regexed_pattern = PROTOCOL_REGEX_PREFIX+allowedURLPatternRegex.replace(".", "\\.");
	459	siteURLsWriter.write(domainWithProtocol + "\n");
[33557]	460	urlFilterWriter.write(regexed_pattern + "\n");
	461	siteRegexWriter.write(regexed_pattern + "\n");
	462
	463	}
	464	}
	465
	466	// next write out the urls for the domain into the sites/0000x/seedURLs.txt file
[33560]	467	// also write into the global seeds file (with a tab prefixed to each?)
	468	Set<String> urlsForDomainSet = domainsToURLsMap.get(domainWithProtocol);
[33519]	469	Iterator<String> urlIterator = urlsForDomainSet.iterator();
	470	while(urlIterator.hasNext()) {
	471	String url = urlIterator.next();
[33560]	472	seedURLsWriter.write(url + "\n"); // global seedURLs file
	473	siteURLsWriter.write(url + "\n");
[33519]	474	}
	475	} catch (IOException ioe) {
	476	ioe.printStackTrace();
[33560]	477	System.err.println("\n@@@@@@@@@ Error writing to one of:" + siteSeedsFile + " or " + siteRegexFile);
[33519]	478	}
[33557]	479
[33518]	480	}
[33557]	481
[33518]	482	} catch (IOException ioe) {
	483	ioe.printStackTrace();
[33560]	484	System.err.println("\n@@@@@@@@@ Error writing to one of: ");
	485	System.err.println("\t" + seedURLsFile);
	486	System.err.println("\t" + urlFilterFile);
	487	System.err.println("\t" + domainURLsFile);
	488	System.err.println("\t" + topSiteMatchesFile);
	489	}
[33557]	490
	491	/*
	492	// BEGIN DEBUG
	493	System.err.println("@@@@ TopSitesMap contains: ");
	494	for(Map.Entry<String, String> entry : topSitesMap.entrySet()) {
	495	String topSite = entry.getKey();
	496	String urlPattern = entry.getValue();
	497	System.err.println(topSite + " - " + urlPattern);
	498	} // END DEBUG
	499	*/
[33501]	500	}
	501
[33557]	502	private String stripSubDomain(String url) {
	503	int index = url.indexOf(".");
	504	if(index != -1) {
	505	url = url.substring(index+1);
	506	}
	507	return url;
	508	}
[33560]	509
	510
	511	/**
	512	* @return true when a seedURL's domain exactly matches a topsite such as blogspot.com,
	513	* with or without www. prefix. This method tests for such as case as it would be dangerous
	514	* to do a SUBDOMAIN-COPY on such a site and thereby crawl that entire domain.
	515	*/
	516	private boolean isExactDomainMatch(String seedURLDomain, String domain) {
	517	// check for an exact match as-is
	518	if(seedURLDomain.equals(domain)) {
	519	return true;
	520	}
	521
	522	// else check if with or without a www. prefix we have an exact match with domain
	523	if(seedURLDomain.startsWith("www.")) {
	524	if(seedURLDomain.substring(4).equals(domain)) {
	525	return true;
	526	}
	527	} else {
	528	if(domain.equals("www."+seedURLDomain)) {
	529	return true;
	530	}
	531	}
	532
	533	return false;
	534	}
[33557]	535
[33560]	536
[33557]	537	/**
[33560]	538	* Check if the domain of the seedurl, either in its entirety or when stripped of
	539	* www/subdomains, is in the list of top sites.
[33557]	540	* If it is, and the given url matches the regex for that topsite, then add the url to the
	541	* whitelist and a regex disallowing the rest of the topsite to the url regex filter file.
[33560]	542	* @param fullSeedDomain: domain of seedURL without the protocol. May include www. prefix.
	543	* @return one of the following values:
	544	* - This function returns null if the seedURL's domain does not match any of the topsites.
	545	* - The empty String is returned if the seedURL's domain matched a topsite but no (allowed-
	546	* url-pattern) value was defined for it. The empty String is also returned if the seedURL's
	547	* domain exactly matched a topsite and had a value of SUBDOMAIN-COPY, because we still don't
	548	* want to blindly crawl a topsite (as would happen with SUBDOMAIN-COPY).
	549	* - A non-emptry String is returned if the seedURL's domain matched a topsite and a value
	550	* was defined for it. (The value will be one of "SUBDOMAIN-COPY", "SINGLEPAGE" or an allowed
	551	* URL pattern.
[33557]	552	*/
[33560]	553	private String isURLinTopSitesMap(String fullSeedDomain) {
[33557]	554	boolean keepLooping = true;
	555
[33560]	556	String domain = fullSeedDomain;
	557
[33557]	558	// domain aprameter will have retained www or subdomains, but is stripped of protocol
	559
	560	// keep looping, stripping subdomains from url and checking if it matches a topsite domain
	561	// if it does, return the value for that topsite domain in the topSitesMap
	562	// If no match at all, return null.
	563	do {
	564
	565	String allowed_url_pattern = topSitesMap.get(domain);
	566	if(allowed_url_pattern != null) { // if topSitesMap.containsKey(domain);
	567	// there's an entry for the URL in the topSitesMap
[33560]	568	System.err.println("##### A top site matches URL domain " + domain);
	569
	570	// if we're dealing with SUBDOMAIN-COPY, then the fullSeedDomain, with or without
	571	// www prefix, should not exactly match the topSitesMap domain
	572	// e.g. we don't want to crawl a seed URL with domain www.blogspot.com
	573	// despite it matching topsite blogspot.com with a value of SUBDOMAIN-COPY.
	574
	575	if(allowed_url_pattern.equals(SUBDOMAIN_COPY) && isExactDomainMatch(fullSeedDomain, domain)) {
	576	return ""; // means don't crawl site, write url into unprocessed-topsite-matches file
	577	}
	578	return allowed_url_pattern;
[33557]	579	}
	580	// else, no entry for the URL in the topSitesMap
[33560]	581	// We're not done yet: strip subDomain from URL and check it against topSitesMap again
[33557]	582
[33560]	583	String newDomain = stripSubDomain(domain);
	584	if(domain.equals(newDomain)) {
	585	keepLooping = false;
	586	} else {
	587	domain = newDomain;
	588	}
[33557]	589	} while(keepLooping);
	590
	591	// url in entirety or stripped of subdomains did not match any of the topsites
	592	return null;
	593	}
	594
[33503]	595	private boolean isListedInFilterList(Map<String, Integer> filterListMap, String url) {
[33557]	596	//Set<Map.Entry<String,Integer>> entries = filterListMap.entrySet();
	597	//Iterator<Map.Entry<String, Integer>> i = entries.iterator();
	598	//while(i.hasNext()) {
	599	// Map.Entry<String, Integer> entry = i.next();
	600	for(Map.Entry<String,Integer> entry : filterListMap.entrySet()) {
[33503]	601	String urlPattern = entry.getKey();
	602	Integer matchRule = entry.getValue();
	603
	604	if(matchRule == LIST_ENTRY_CONTAINS && url.contains(urlPattern)) {
	605	return true;
	606	}
	607	else if(matchRule == LIST_ENTRY_STARTSWITH && url.startsWith(urlPattern)) {
	608	return true;
	609	}
	610	else if(matchRule == LIST_ENTRY_ENDSWITH && url.endsWith(urlPattern)) {
	611	return true;
	612	}
	613	else if(matchRule == LIST_ENTRY_MATCHES && url.equals(urlPattern)) {
	614	return true;
	615	}
	616	// else check the rest of the filter list against this url
	617	// before returning false to be certain it's not been listed in the filter list
	618	}
	619
[33501]	620	return false;
	621	}
	622
[33503]	623	/**
	624	* Returns true if the url or pattern is found in the blacklist file.
	625	* Note that if eventually the same url pattern is found in the greylist or whitelist too,
	626	* it won't get blacklisted after all. But that's not implemented here.
	627	*/
	628	public boolean isBlacklisted(String url) {
	629	return isListedInFilterList(blackList, url);
	630	}
	631
	632	/**
	633	* Returns true if the url or pattern is explicitly mentioned in the greylist file.
	634	* Will eventually take precedence over if the same URL pattern was mentioned in the blacklist.
	635	* Will eventually be pre-empted into the whitelist if mentioned in the whitelist.
	636	*/
[33501]	637	public boolean isGreylisted(String url) {
[33557]	638	// auto-translated product sites
[33503]	639	return isListedInFilterList(greyList, url);
[33501]	640	}
[33503]	641
	642	/**
	643	* Returns true if the url or pattern is explicitly mentioned in the whitelist file
	644	* Its mention in a whitelist moreover overrides any mention in the blacklist and greylist.
	645	*/
	646	public boolean isWhitelisted(String url) {
	647	return isListedInFilterList(whiteList, url);
	648	}
[33501]	649
	650	/**
[33552]	651	* Checks URL parameter against each line ("filter") of conf/url-black\|grey\|whitelist-filter.txt to decide
	652	* whether it is in the mentioned black\|grey\|white list.
[33501]	653	* Filters don't represent actual regex, just ^ and $ as start and end terminators.
	654	* By not having this method deal with actual regex for filters, this has the advantage that
	655	* we don't have to remember to escape or double escape each filter to turn it into a regex.
	656	*/
[33503]	657	public void initURLFilterList(Map<String, Integer> list, String filterListFilename) {
	658
	659	// if filterListFilename does not exist in the conf folder, just return
	660	if(org.greenstone.atea.CCWETProcessor.class.getClassLoader().getResource(filterListFilename) == null) {
	661	System.err.println(filterListFilename + " does not exist");
	662	return;
	663	}
[33501]	664
[33503]	665	try (
	666	BufferedReader reader = new BufferedReader(new InputStreamReader(org.greenstone.atea.CCWETProcessor.class.getClassLoader().getResourceAsStream(filterListFilename), "UTF-8"));
	667	) {
	668	String filter = null;
	669	while((filter = reader.readLine()) != null) {
	670	// skip comments and empty lines
	671	filter = filter.trim();
	672	if(filter.equals("") \|\| filter.startsWith("#")) {
	673	continue;
	674	}
	675
	676	if(filter.startsWith("^") && filter.endsWith("$")) {
	677	filter = filter.substring(1, filter.length()-1);
	678	list.put(filter, LIST_ENTRY_MATCHES);
	679	}
	680	else if(filter.startsWith("^")) {
	681	filter = filter.substring(1);
	682	list.put(filter, LIST_ENTRY_STARTSWITH);
	683	System.err.println("Match filter startswith: " + filter);
	684	}
	685	else if(filter.endsWith("$")) {
	686	filter = filter.substring(0, filter.length()-1);
	687	list.put(filter, LIST_ENTRY_ENDSWITH);
	688	}
	689	else {
	690	list.put(filter, LIST_ENTRY_CONTAINS);
	691	}
	692	//System.err.println("Got filter: " + filter);
	693	}
	694
	695	} catch (IOException ioe) {
	696	ioe.printStackTrace();
	697	System.err.println("\n@@@@@@@@@ Error reading into map from file " + filterListFilename);
	698	}
	699
	700	}
[33501]	701
	702	/** Maintain a count of all WET files processed. */
	703	public void setWETFileCount(int count) { this.wetFileCount = count; }
	704
	705	/** Maintain a count of all WET records processed. */
	706	//public int getRecordCount() { return this.totalRecordCount; }
	707	//public void addToRecordCount(int count) { this.totalRecordCount += count; }
	708	public void setRecordCount(int count) { this.totalRecordCount = count; }
[33552]	709
	710	public void processAllWETFilesOfCrawl(File ccrawlWETFileDir) {
	711
	712	// Will list all the warc.wet files in the input directory or else their gzipped versions
	713	File[] WETFiles = ccrawlWETFileDir.listFiles(new WETFilenameFilter());
	714
	715	int wetRecordCount = 0;
	716	int wetFileCount = 0;
	717
	718	for(int i = 0; i < WETFiles.length; i++) {
	719	File WETFile = WETFiles[i];
	720	logger.debug("Processing WETfile: " + WETFile);
	721
	722	// Any .gz files listed means they haven't been unzipped yet. So unzip.
	723	String WETFilename = WETFile.toString();
	724	if(WETFilename.endsWith(".gz")) {
	725	File GZippedWETFile = WETFile;
	726	String WETGZippedFilename = WETFilename;
	727	WETFilename = WETFilename.substring(0, WETFilename.lastIndexOf(".gz"));
	728
	729	WETFile = new File(WETFilename);
	730	Utility.unzipFile(GZippedWETFile, WETFile);
	731	}
	732	// hereafter all WETFiles should refer to the unzipped version
	733	// Check the unzipped WETFile exists
	734
	735	if(!WETFile.exists() \|\| !WETFile.isFile()) {
	736	System.err.println("Error: " + WETFile + " does not exist (failure to unzip?)");
	737	logger.error("Error: " + WETFile + " does not exist (failure to unzip?)");
	738	return;
	739	}
	740
	741	// Finally, we can process this WETFile's records into the keep and discard pile
	742	wetFileCount++;
	743	logger.debug("Off to process " + WETFile);
	744	String crawlID = ccrawlWETFileDir.getName(); // something like CC-MAIN-YYYY-##-wet-files
	745	crawlID = crawlID.substring("CC-MAIN-".length(), crawlID.indexOf("-wet-files")); // YYYY-##
	746	WETProcessor wetFileProcessor = new WETProcessor(WETFile, crawlID, this);
	747	wetFileProcessor.processWETFile();
	748	wetRecordCount += wetFileProcessor.getRecordCount();
	749	}
	750
	751	// for information purposes
	752	this.setWETFileCount(wetFileCount);
	753	this.setRecordCount(wetRecordCount);
	754	}
[33560]	755
	756
	757	// --------------- STATIC METHODS AND INNER CLASSED USED BY MAIN -------------- //
[33501]	758	public static void printUsage() {
	759	System.err.println("Run this program as:");
	760	System.err.println("\tWetProcessor <folder containing wet(.gz) files> <output folder path>");
	761	}
	762
	763	/** Filename filter to only list warc.wet files or else warc.wet.gz files
	764	* for which unzipped warc.wet equivalents don't yet exist.
	765	*/
	766	private static class WETFilenameFilter implements FilenameFilter {
	767
	768	public boolean accept(File dir, String name) {
	769	if(name.endsWith(".warc.wet")) {
	770	logger.debug("Will include " + name + " for processing.");
	771	return true;
	772	}
	773
	774	if(name.endsWith(".warc.wet.gz")) {
	775	String nameWithoutGZext = name.substring(0, name.lastIndexOf(".gz"));
	776	File unzippedVersion = new File(dir, nameWithoutGZext);
	777	if(unzippedVersion.exists()) {
	778	logger.debug("--- Unzipped version " + unzippedVersion + " exists.");
	779	logger.debug("Skipping " + name);
	780	return false; // don't count gzipped version if unzipped version exists.
	781	}
	782	else {
	783	logger.debug("Only zipped version " + name + " exists.");
	784	return true; // No unzipped version, so have to work with gzipped version
	785	}
	786	}
	787
	788	// we're not even interested in any other file extensions
	789	logger.debug("Not a WET file. Skipping " + name);
	790	return false;
	791	}
	792	}
[33552]	793
	794
	795	private static class CCrawlWETFolderFilenameFilter implements FilenameFilter {
[33501]	796
[33552]	797	public boolean accept(File dir, String name) {
	798	File f = new File (dir, name);
	799	if(f.isDirectory()) {
	800	if(name.matches("CC-MAIN-\\d{4}-\\d{2}-wet-files")) {
	801	return true;
	802	}
	803	}
	804	else {
	805	System.err.println("File " + f + " is not a directory");
	806	}
	807	return false;
	808	}
	809	}
[33501]	810
	811	public static void main(String[] args) {
	812	if(args.length != 2) {
	813	printUsage();
	814	return;
	815	}
	816
[33552]	817	File commoncrawlDir = new File(args[0]);
	818	if(!commoncrawlDir.exists() \|\| !commoncrawlDir.isDirectory()) {
[33501]	819	System.out.println("Error: " + args[0] + " does not exist or is not a directory");
	820	return;
	821	}
	822
	823	File outFolder = new File(args[1]);
	824	if(!outFolder.exists() \|\| !outFolder.isDirectory()) {
	825	System.out.println("Error: " + args[1] + " does not exist or is not a directory.");
	826	return;
	827	}
	828
[33503]	829	try {
[33552]	830	CCWETProcessor ccWETFilesProcessor = new CCWETProcessor(commoncrawlDir, outFolder);
[33501]	831
[33552]	832	File[] ccrawlFolders = commoncrawlDir.listFiles(new CCrawlWETFolderFilenameFilter());
	833
	834	for(int i = 0; i < ccrawlFolders.length; i++) {
	835	File ccrawlFolder = ccrawlFolders[i];
	836	System.err.println("About to process commoncrawl WET files folder: " + ccrawlFolder);
	837	ccWETFilesProcessor.processAllWETFilesOfCrawl(ccrawlFolder);
[33501]	838	}
	839
[33557]	840
	841	// create the global files of all domains, seedURLs and regex-urlfilters across all wet files of all commoncrawls
[33552]	842	// The former is the only unique one. seedURLs and regex-urlfilters are
	843	// repeated on a per site/domain basis too, stored in the sites folder
[33501]	844	File seedURLsFile = new File(outFolder, "seedURLs.txt");
	845	File urlFilterFile = new File(outFolder, "regex-urlfilter.txt");
[33552]	846	File domainURLsFile = new File(outFolder, "all-domain-urls.txt");
[33557]	847	File topSitesMatchedFile = new File(outFolder, "unprocessed-topsite-matches.txt");
	848
	849	ccWETFilesProcessor.createSeedURLsFiles(seedURLsFile, urlFilterFile, domainURLsFile, topSitesMatchedFile);
[33517]	850
	851	System.out.println("\n*** Inspect urls in greylist at " + ccWETFilesProcessor.greyListedFile + "\n");
[33557]	852
[33560]	853	System.out.println("\n*** Check " + topSitesMatchedFile + " for sites not prepared for crawling because they matched top sites for which no regex of allowed url patterns were specified in sites-too-big-to-exhaustively-crawl.txt.\n");
[33517]	854
[33557]	855
[33503]	856	} catch(Exception e) {
	857	// can get an exception when instantiating CCWETProcessor instance
	858	e.printStackTrace();
	859	System.err.println(e.getMessage());
	860	}
[33501]	861
	862	return;
	863
	864	}
	865	}

Note: See TracBrowser for help on using the repository browser.

Context Navigation

source: gs3-extensions/maori-lang-detection/src/org/greenstone/atea/CCWETProcessor.java@ 33560

Download in other formats: