Context Navigation

source: gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt@ 33391

Last change on this file since 33391 was 33391, checked in by ak19, 5 years ago
Some rough bash scripting lines that work but aren't complete.
File size: 14.3 KB

Line
1	NEXT PROBLEMS: prefixes to basic domain should not be counted.
2	e.g. cs.waikato.ac.nz
3	and waikato.ac.nz should count as one?
4
5	https://stackoverflow.com/questions/1915636/is-there-a-way-to-uniq-by-column
6	It's not enough to cut off http:// and then anything before first ., since some won't have a prefix to the domain. How to detect which ones do and don't and only attempt to remove the prefix from those urls that have a prefix?
7
8	* With -r can leave out \( for ( and \? for ? wildcard:
9
10	tikauka:[203]/Scratch/anupama/maori-lang-detection>cat nz-only-TLDs-2019-08-xt \| cut -d ' ' -f4 \| sed 's@\(https\?://\)\(www\.\)\?\([^/]\).@http://\3@' \| sed 's@\.$@@' \| sed 's@^"@@' \| sed 's@",$@@' \| uniq \| less
11
12
13	* Also want to get read of starting " and ending ,"
14	FINAL ONE:
15	[[Atikauka:[203]/Scratch/anupama/maori-lang-detection>cat nz-only-TLDs-2019-08-xt \| cut -d ' ' -f4 \| sed 's@\(https\?://\)\(www\.\)\?\([^/]\).@http://\3@' \| sed 's@\.$@@' \| sed 's@^"@@' \| sed 's@",$@@' \| uniq \| less
16
17	UNIQ requires 2 consecutive duplicates in order to detect duplicates.
18	So if there's 2 different lines followed by a 3rd line that's a duplicate of the first, then uniq won't detect that.
19	And this happens in our case because some urls are http and https, some have www and some don't. And Massey university's domain URL strangely ends with . sometimes though usually not.
20
21	tikauka:[199]/Scratch/anupama/maori-lang-detection>cat nz-only-TLDs-2019-08-07.txt \| cut -d ' ' -f4 \| sed 's@\(https\?://\)\(www\.\)\?\([^/]\).@http://\3@' \| sed 's@\.$@@' \| uniq \| less
22
23	tikauka:[194]/Scratch/anupama/maori-lang-detection>cat nz-only-TLDs-2019-08-xt \| cut -d ' ' -f4 \| sed 's@\(https\?://\)\(www\.\)\?\([^/]\).@http://\3@' \| uniq \| less
24
25
26
27
28	tikauka:[182]/Scratch/anupama/maori-lang-detection>echo "http://100health.nz/ProdList.asp?p=1&ClassID=196", \| sed 's@\(https\?://[^/]\).@\1@'
29	http://100health.nz
30
31
32	tikauka:[178]/Scratch/anupama/maori-lang-detection>echo "http://100health.nz/ProdList.asp?p=1&ClassID=196", \| sed 's@\(https\?://[^/]*\)@boo@'
33	boo/ProdList.asp?p=1&ClassID=196,
34
35
36	maori-lang-detection>cat nz-only-TLDs-2019-08-07.txt \| cut -d ' ' -f4 \| less
37	where
38	cut -d ' ' -f4
39	gets the 4th field (the urls) where each field is separated by a space instead of the default tab delimiter.
40
41
42	http://webdatacommons.org/
43
44	https://dzone.com/articles/need-billions-of-web-pages-dont-bother-crawling
45
46	Ran Geva 2017-04-09
47	Like (0)
48
49	Excellent article! CommonCrawl is an amazing resourouce. You should also check out webdatacommons.org that is using their data and extract structured data (using RDFa, Microdata..)
50
51	If I may add a shameless plug here and tell you about Webhose.io [PAYWARE/SERVICES]. We provide an API to structured web data. The idea is the same as the one you presented. Instead of crawling the web, we already crawl millions of sites, download the data, structure and organize it so anyone can easily consume it and plug into their own system.
52	Reply
53
54
55	https://stackoverflow.com/questions/12097848/finding-all-domains-of-a-country
56
57	-> http://urlsearch.commoncrawl.org/
58	-> http://index.commoncrawl.org/
59	-> INSTRUCTIONS: https://groups.google.com/forum/#!msg/common-crawl/3QmQjFA_3y4/vTbhGqIBBQAJ
60
61
62	Go to: http://index.commoncrawl.org/
63	Grab the newest gzipped archive file.
64	Then open it and find the cluster.idx file listed in it.
65	Copy its relative URL, prefix with https://commoncrawl.s3.amazonaws.com/
66
67	THEN:
68
69	wharariki:[101]/Scratch/ak19/heritrix/heritrix-3.4.0-SNAPSHOT>wget https://commoncrawl.s3.amazonaws.com/cc-index/collections/CC-MAIN-2019-26/indexes/cluster.idx
70	--2019-07-29 17:40:45-- https://commoncrawl.s3.amazonaws.com/cc-index/collections/CC-MAIN-2019-26/indexes/cluster.idx
71	Resolving commoncrawl.s3.amazonaws.com (commoncrawl.s3.amazonaws.com)... 52.216.8.171
72	Connecting to commoncrawl.s3.amazonaws.com (commoncrawl.s3.amazonaws.com)\|52.216.8.171\|:443... connected.
73	HTTP request sent, awaiting response... 200 OK
74	Length: 125059234 (119M) [binary/octet-stream]
75	Saving to: âcluster.idxâ
76
77	cluster.idx 100%[============================================================>] 119.27M 8.51MB/s in 15s
78
79	2019-07-29 17:41:01 (7.83 MB/s) - âcluster.idxâ saved [125059234/125059234]
80
81	wharariki:[102]/Scratch/ak19/heritrix/heritrix-3.4.0-SNAPSHOT>grep '^nz,' cluster.idx \| cut -f2 \| uniq
82	cdx-00237.gz
83	cdx-00238.gz
84
85	Prefix "https://commoncrawl.s3.amazonaws.com/cc-index/collections/CC-MAIN-2019-26/indexes/" to the listed gz files and wget them:
86
87	https://commoncrawl.s3.amazonaws.com/cc-index/collections/CC-MAIN-2019-26/indexes/cdx-00237.gz
88	https://commoncrawl.s3.amazonaws.com/cc-index/collections/CC-MAIN-2019-26/indexes/cc-index/collections/CC-MAIN-2019-26/indexes/cdx-00238.gz
89
90
91
92	Unzip those, and we have all URLs with TLD .nz:
93	wharariki:[131]/Scratch/ak19/heritrix/heritrix-3.4.0-SNAPSHOT>gunzip cdx-00237.gz
94	wharariki:[132]/Scratch/ak19/heritrix/heritrix-3.4.0-SNAPSHOT>gunzip cdx-00238.gz
95
96
97	The first of these files includes Norwegian TLDs (start with "no,") and the second gz file includes TLDs that start with "org,".
98	So extract just those that start with "^nz," [https://www.unix.com/shell-programming-and-scripting/176608-how-copy-lines-starts-either-3-4-into-new-file.html].
99
100	wharariki:[107]/Scratch/ak19/heritrix/heritrix-3.4.0-SNAPSHOT>egrep "^nz," cdx-00238 > nz-only-TLDs-from-237-238.txt
101	wharariki:[108]/Scratch/ak19/heritrix/heritrix-3.4.0-SNAPSHOT>egrep "^nz," cdx-00238 >> nz-only-TLDs-from-237-238.txt
102
103
104	Checking the abacusinstitute.ac.nz is also in the current June 2019 list:
105	egrep "ac,abacusinstitute" nz-only-TLDs-from-237-238.txt
106
107
108	OTHER:
109	https://www.tutorialspoint.com/hadoop/hadoop_mapreduce
110	http://stormcrawler.net/
111	http://storm.apache.org/getting-help.html
112
113
114	https://dzone.com/articles/need-billions-of-web-pages-dont-bother-crawling
115	Basically, each release is split into 100 segments. Each segment has three types of files: WARC, WAT, and WET. As explained on the Get Started page:
116
117	WARC files store the raw crawl data.
118	WAT files store computed metadata for the data stored in the WARC.
119	WET files store extracted plaintext from the data stored in the WARC.
120
121	Note that WAT and WET are in the WARC format too! In fact, the WARC format is nothing more than an envelope with metadata and content. In the case of the WARC files, that content is the HTTP requests and responses, whereas, for the WET files, it is simply the plain text extracted from the WARCs. The WAT files contain a JSON representation of metadata extracted from the WARCs, e.g. title, links etc.
122
123
124	Resources
125
126	The Get Started page on the CommonCrawl website contains useful pointers to libraries and code in various programming languages to process the datasets. There is also a list of tutorials and presentations.
127
128	It is also worth noting that CommonCrawl provides an index per release, allowing you to search for URLs (including wildcards) and retrieve the segment and offset therein where the content of the URL is stored, e.g.:
129
130	{ "urlkey": "org,apache)/", "timestamp": "20170220105827", "status": "200", "url": "http://apache.org/", "filename": "crawl-data/CC-MAIN-2017-09/segments/1487501170521.30/warc/CC-MAIN-20170219104610-00206-ip-10-171-10-108.ec2.internal.warc.gz", "length": "13315", "mime": "text/html", "offset": "14131184", "digest": "KJREISJSKKGH6UX5FXGW46KROTC6MBEM" }
131
132
133	This is useful but only if you are interested in a limited number of URLs which you know in advance. In many cases, what you know in advance is what you want to extract, not where it will be extracted from. For situations such as these, you will need distributed batch-processing using MapReduce in Apache Hadoop or Apache Spark.
134
135
136	https://www.forbes.com/sites/kalevleetaru/2017/09/28/common-crawl-and-unlocking-web-archives-for-research/#7067d4313b83
137	One large web archive has bucked this trend and stood alone among its peers: Common Crawl. Similar to other large web archiving initiatives like the Internet Archive, Common Crawl conducts regular web wide crawls of the open web and preserves all of the content it downloads in the standard WARC file format. Unlike many other archives, it focuses primarily on preserving HTML web pages and does not archive images, videos, JavaScript files, CSS stylesheets, etc. Its goal is not to preserve the exact look and feel of a website on a given snapshot in time, but rather to collect a vast cross section of HTML web pages from across the web in a single place to enable large-scale data mining at web scale.
138	...
139	The project excludes sites which have robots.txt exclusion policies, following the historical policy of many other web archives, though it is worth noting that the Internet Archive earlier this year began slowly phasing out its reliance on such files due to their detrimental effect on preservation completeness. Common Crawl also allows sites to request removal from their index. Other than these cases, Common Crawl attempts to crawl as much of the remaining web as possible, aiming for a representative sample of the open web.
140	...
141	Ms. Crouse [Director of Common Crawl] noted the risk adverse nature of the web archiving community as a whole (historically many adhered and still adhere to a strict âopt inâ policy requiring prior approval before crawling a site) and the unwillingness of many archives to modernize their thinking on copyright and to engage more closely with the legal community in ways that could help them expand fair use horizons. In particular, she noted âsince we [in the US] are beholden to the Copyright Act, while living in a digital age, many well-intentioned organizations devoted to web science, archiving, and information provision may benefit from a stronger understanding of how copyright is interpreted in present day, and its hard boundariesâ and that âmany talented legal advisers and groups are interested in the precedent-setting nature of this topic; some are willing to work Pro Bono.â
142	...
143	Returning to the difference between Common Crawlâs datasets and traditional preservation-focused web archiving, Ms. Crouse emphasized that they capture only HTML pages and exclude multimedia content like images, video and other dynamic content.
144
145	She noted that a key aspect of their approach to fair use is that web pages are intended for consumption by human beings one at a time using a web browser, while Common Crawl concatenates billions of pages together in the specialized WARC file format designed for machine data mining. Specifically, âCommon Crawl does not offer separate/individual web pages for easy consumption. The three data formats that are provided include text, metadata, and raw data, and the data is concatenatedâ and âthe format of the output is not a downloaded web page. The output is in WARC file format which contains the components of a page that are beneficial to machine-level analysis and make for space- efficient archiving (essentially: header, text, and some metadata).â
146
147	As Ms. Crouse put it, âthis is big data intended for machine learning/readability. Further, our intention for its use is for public benefit i.e. to encourage research and innovation, not direct consumption.â She noted that âfrom the laypersonâs perspective, it is not at all trivial at present to extract a specific websiteâs content (that is, text) from a Common Crawl dataset. This task generally requires one to know how to install and run a Hadoop cluster, among other things. This is not structured data. Further it is likely that not all pages of that website will be included (depending on the parameters for depth set for the specific crawl).â This means that âthe bulk of [Common Crawlâs] users are from the noncommercial, educational, and research sectors. At a higher level, itâs important to note that we provide a broad and representative sample of the web, in the form of web crawl data, each month. No one really knows how big the web is, and at present, we limit our monthly data publication to approximately 3 billion pages.â
148
149
150	Common Crawl believes it addresses this through the fact that its archive represents only a sample of each website crawled, rather than striving for 100% coverage. Specifically, Ms. Crouse noted that âat present, [crawls are] in monthly increments that are discontinuous month-to-month. We do only what is reasonable, necessary, and economical to achieve a representative sample. For instance, we limit the number of pages crawled from any given domain so, for large content owners, it is highly probable that their content, if included in a certain monthâs crawl data, is not wholly represented and thus not ideal for mining for comprehensive results âŠ if the content owner is not a large site, or in a niche market, their URL is less likely to be included in the seeds in the frontier, and, since we limit depth (# of links followed) for the sake of both economy and broader representative web coverage, 'niche' content may not even appear in a given monthâs dataset.â
151
152	To put it another way, Common Crawlâs mission is to create a ârepresentative sampleâ of the web at large by crawling a sampling of pages and limiting the number of pages from each site they capture. Thus, their capture of any given site will represent a discontinuous sampling of pages that can change from month to month. A researcher wishing to analyze a single web site in its entirety would therefore not be able to turn to Common Crawl and would instead have to conduct their own crawl of the site or turn to a commercial aggregator that partners with the content holder to license the complete contents of the site.
153
154	In Common Crawlâs view this is a critical distinction that sets it apart from both traditional web archiving and the commercial content aggregators that generate data mining revenue for content owners. By focusing on creating a ârepresentative sampleâ of the web at large, rather than attempting to capture a single site in its entirety (and in fact ensuring that it does not include more than a certain number of pages per site), the crawl self-limits itself to being applicable only to macro-level research examining web scale questions. Such âweb scaleâ questions cannot be answered through any existing open dataset and by incorporating specific design features Common Crawl ensures that more traditional research questions, like data mining the entirety of a single site, which might be viewed as redistribution of that site or competing with its ownerâs ability to license its content for data mining, is simply not possible.
155	------------

Note: See TracBrowser for help on using the repository browser.

Download in other formats: