Context Navigation

source: gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt@ 33428

Last change on this file since 33428 was 33428, checked in by ak19, 5 years ago
Working commoncrawl cc-warc-examples' WET wordcount example using Hadoop. And some more links.
File size: 22.9 KB

Line
1	To run firefox/anything graphical inside the VM run by vagrant, have to ssh -Y onto both analytics and then to the vagrant VM from analytics:
2	1. ssh analytics -Y
3	2. [anupama@analytics vagrant-hadoop-hive-spark]$ vagrant ssh -- -Y
4	or
5	vagrant ssh -- -Y node1
6	(the -- flag tells the vagrant command that the subsequent -Y flag should be passed to the ssh cmd that vagrant runs)
7
8	Only once ssh-ed with vagrant into the VM whose hostname is "node1", do you have access to node1's assigned IP: 10.211.55.101
9	- Connecting machines, like analytics, must access node1 or use port forwarding to view the VM's servers on localhost. For example, on analytics, can view Yarn pages at http://localhost:8088/
10	- If firefox is launched inside the VM (so inside node1), then can access pages off their respective ports at any of localhost\|10.211.55.101\|node1.
11
12
13
14
15	WET example from https://github.com/commoncrawl/cc-warc-examples
16
17	vagrant@node1:~/cc-warc-examples$ hdfs dfs -mkdir /user/vagrant/data
18	vagrant@node1:~/cc-warc-examples$ hdfs dfs -put data/CC-MAIN-20190715175205-20190715200159-00000.warc.wet.gz hdfs:///user/vagrant/data/.
19	vagrant@node1:~/cc-warc-examples$ hdfs dfs -ls data
20	Found 1 items
21	-rw-r--r-- 1 vagrant supergroup 154823265 2019-08-19 08:23 data/CC-MAIN-20190715175205-20190715200159-00000.warc.wet.gz
22	vagrant@node1:~/cc-warc-examples$ hadoop jar target/cc-warc-examples-0.3-SNAPSHOT-jar-with-dependencies.jar org.commoncrawl.examples.mapreduce.WETWordCount
23
24	<ONCE FINISHED:>
25
26	vagrant@node1:~/cc-warc-examples$ hdfs dfs -cat /tmp/cc/part*
27
28
29
30	INFO ON HADOOP/HDFS:
31	https://www.bluedata.com/blog/2016/08/hadoop-hdfs-upgrades-painful/
32
33	---------------
34	More examples to try:
35	https://github.com/commoncrawl/cc-warc-examples
36
37
38	A bit outdated?
39	https://www.journaldev.com/20342/apache-spark-example-word-count-program-java
40	https://www.journaldev.com/20261/apache-spark
41
42	--------
43
44	sudo apt-get install maven
45	(or sudo apt update
46	sudo apt install maven)
47	git clone https://github.com/commoncrawl/cc-index-table.git
48	cd cc-index-table
49	mvn package
50	vagrant@node1:~/cc-index-table$ ./src/script/convert_url_index.sh https://commoncrawl.s3.amazonaws.com/cc-index/collections/CC-MAIN-2019-30/indexes/cdx-00000.gz hdfs:///user/vagrant/cc-index-table
51
52
53
54
55	spark:
56	https://jaceklaskowski.gitbooks.io/mastering-apache-spark/spark-shell.html
57
58	============
59	Dr Bainbridge found the following vagrant file that will set up hadoop and spark presumably for cluster computing:
60
61	https://github.com/martinprobson/vagrant-hadoop-hive-spark
62
63	Vagrant:
64	* Guide: https://www.vagrantup.com/intro/getting-started/index.html
65	* Common cmds: https://blog.ipswitch.com/5-vagrant-commands-you-need-to-know
66	* vagrant reload = vagrant halt + vagrant up https://www.vagrantup.com/docs/cli/reload.html
67	* https://stackoverflow.com/questions/46903623/how-to-use-firefox-ui-in-vagrant-box
68	* https://stackoverflow.com/questions/22651399/how-to-install-firefox-in-precise64-vagrant-box
69	sudo apt-get -y install firefox
70	* vagrant install emacs: https://medium.com/@AnnaJS15/getting-started-with-virtualbox-and-vagrant-8d98aa271d2a
71
72	* hadoop conf: sudo vi /usr/local/hadoop-2.7.6/etc/hadoop/mapred-site.xml
73	* https://data-flair.training/forums/topic/mkdir-cannot-create-directory-data-name-node-is-in-safe-mode/
74	---
75	==> node1: Forwarding ports...
76	node1: 8080 (guest) => 8081 (host) (adapter 1)
77	node1: 8088 (guest) => 8089 (host) (adapter 1)
78	node1: 9083 (guest) => 9084 (host) (adapter 1)
79	node1: 4040 (guest) => 4041 (host) (adapter 1)
80	node1: 18888 (guest) => 18889 (host) (adapter 1)
81	node1: 16010 (guest) => 16011 (host) (adapter 1)
82	node1: 22 (guest) => 2200 (host) (adapter 1)
83	==> node1: Running 'pre-boot' VM customizations...
84
85
86	==> node1: Checking for guest additions in VM...
87	node1: The guest additions on this VM do not match the installed version of
88	node1: VirtualBox! In most cases this is fine, but in rare cases it can
89	node1: prevent things such as shared folders from working properly. If you see
90	node1: shared folder errors, please make sure the guest additions within the
91	node1: virtual machine match the version of VirtualBox you have installed on
92	node1: your host and reload your VM.
93	node1:
94	node1: Guest Additions Version: 5.1.38
95	node1: VirtualBox Version: 5.2
96
97	------------
98
99	At http://commoncrawl.org/2018/10/september-2018-crawl-archive-now-available/, it says
100	"The September crawl contains 500 million new URLs, not contained in any crawl archive before. New URLs stem from
101
102	the continued seed donation of URLs from mixnode.com
103	..."
104
105	https://www.mixnode.com/
106	"The entire web, in your hands
107
108	Mixnode turns the web into a database that you can run queries against. Say goodbye to web crawling, forget about web scraping, never run a spider again: get all the web data that you need using simple SQL queries."
109
110	--------------
111	https://commoncrawl.github.io/cc-crawl-statistics/plots/languages
112	http://commoncrawl.org/2018/08/august-2018-crawl-archive-now-available/
113
114	The JSON for the index files (that we downloaded for .nz) already contained a "languages:" field. The above page mentions that this shows the primary, upto 3, detected languages of the document.
115
116	"Language Annotations
117
118	We now run the Compact Language Detector 2 (CLD2) on HTML pages to identify the language of a document. CLD2 is able to identify 160 different languages and up to 3 languages per document. The detected languages resp. the ISO-639-3 code are shown in the URL index as a new field, e.g., "languages": "zho,eng". The WARC metadata records contain the full CLD2 response including scores and text coverage:
119
120	languages-cld2: {"reliable":true,"text-bytes":3783,"languages":[{"code":"zh","code-iso-639-3":"zho","text-covered":0.93,"score":1943.0,"name":"Chinese"},{"code":"en","code-iso-639-3":"eng","text-covered":0.05,"score":523.0,"name":"ENGLISH"}]}
121
122	On github youâll find the Java bindings to the CLD2 native library and the distribution of the primary document languages as part of our crawl statistics.
123
124	Please note that the columnar index does not contain the detected languages for now. "
125
126
127	http://commoncrawl.org/2018/10/september-2018-crawl-archive-now-available/
128	"the columnar index contains the content language of a web page as a new field. Please read the instructions below how to upgrade your tools to read newly added fields."
129
130	http://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/
131
132	SPARK (Spark SQL): https://github.com/commoncrawl/cc-index-table
133	with example on selecting languages
134	https://commoncrawl.s3.amazonaws.com/cc-index/collections/CC-MAIN-2019-30/indexes/cluster.idx
135
136	./convert_url_index.sh https://commoncrawl.s3.amazonaws.com/cc-index/collections/CC-MAIN-2019-30/indexes/cdx-00000.gz hdfs:///user/vagrant/cc-index-table
137	---
138
139	https://www.aclweb.org/anthology/L16-1443 (2016, as per https://pbn.nauka.gov.pl/sedno-webapp/getReport/38108)
140
141	https://dkpro.github.io/dkpro-c4corpus/
142	"DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate removal, language detection, and near-duplicate removal."
143
144	https://zoidberg.ukp.informatik.tu-darmstadt.de/jenkins/job/DKPro%20C4Corpus/org.dkpro.c4corpus$dkpro-c4corpus-doc/doclinks/1/#_including_c4corpustools_in_your_java_projects
145	- Including C4CorpusTools in your Java projects
146	- Working with C4Corpus - Word count example
147
148	https://github.com/farhansiddiqui/webscale_nlp
149
150	https://github.com/commoncrawl/language-detection-cld2
151	---------
152	There's already python code for getting text:
153
154	https://spark-in.me/post/parsing-common-crawl-in-two-simple-commands
155	https://gist.github.com/Smerity/afe7430fdb4371015466
156
157	https://spark-in.me/post/parsing-common-crawl-in-two-simple-commands
158
159	"But it turns out - it is not. This can be attributed to the effort that has been made to make the CC more accessible. The killer feature for me was the presence of their index weighting only ~200Gb, that also features a language detection option, i.e. you do not need to analyze top-level-domains or do any significant data mining."
160
161	What does the "language detection option" discussion above mean?
162
163	------------
164	Skipping CrawlDiagnostics (see below) and robots.txt gz files:
165
166	http://commoncrawl.org/2018/08/august-2018-crawl-archive-now-available/
167
168	"HTTP 304 notmodified" responses are now stored as WARC revisit records in the "crawldiagnostics" subset along with 404s, redirects and other non-200 responses. For now the revisit records contain a payload digest although there is no payload sent together with HTTP 304 responses. The stupid reason is that the columnar index requires the digest field and we want to make sure that all tools continue to work as expected. The SHA-1 digest of an empty payload (zero bytes) is used for the revisit records.
169
170	http://iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/#revisit
171	ârevisitâ
172	General
173
174	A ârevisitâ record describes the revisitation of content already archived, and might include only an abbreviated content body which has to be interpreted relative to a previous record. Most typically, a ârevisitâ record is used instead of a âresponseâ or âresourceâ record to indicate that the content visited was either a complete or substantial duplicate of material previously archived.
175	...
176
177	-------
178
179	WET FILES:
180
181	https://stackoverflow.com/questions/16649535/access-a-common-crawl-aws-public-dataset/25297965#25297965
182
183
184	http://commoncrawl.org/2019/07/june-2019-crawl-archive-now-available/
185	File List #Files Total Size Compressed (TiB)
186	WET files CC-MAIN-2019-26/wet.paths.gz 56000 7.59
187
188
189	http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/
190	(Instructions)
191
192	https://gist.github.com/svemir/4207353
193	(Hadoop related) A Common Crawl Experiment
194
195	https://gist.github.com/Smerity/afe7430fdb4371015466
196
197	Extract just the text from Common Crawl WARC WET files
198
199	https://stackoverflow.com/tags/common-crawl/hot?filter=all
200
201	https://stackoverflow.com/questions/45920527/get-offset-and-length-of-a-subset-of-a-wat-archive-from-common-crawl-index-serve/46152773#46152773
202
203
204	"The Common Crawl index does not contain offsets into WAT and WET files. So, the only way is to search the whole WAT/WET file for the desired record/URL. Eventually, it would be possible to estimate the offset because the record order in WARC and WAT/WET files is the same."
205
206	https://dmorgan.info/posts/common-crawl-python/
207	https://groups.google.com/forum/#!topic/common-crawl/pdI3w09AAbQ
208
209	Example:
210	WARC:
211	tikauka:[142]/Scratch/anupama/maori-lang-detection>wget https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-30/segments/1563195526237.47/crawldiagnostics/CC-MAIN-20190719115720-20190719141720-00077.warc.gz
212	WET:
213	tikauka:[142]/Scratch/anupama/maori-lang-detection>wget https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-30/segments/1563195526237.47/wet/CC-MAIN-20190719115720-20190719141720-00508.warc.wet.gz
214	tikauka:[142]/Scratch/anupama/maori-lang-detection>gunzip CC-MAIN-20190719115720-20190719141720-00508.warc.wet.gz
215
216
217	--------------------------------------------
218	http://webdatacommons.org/
219
220	https://dzone.com/articles/need-billions-of-web-pages-dont-bother-crawling
221
222	Ran Geva 2017-04-09
223	Like (0)
224
225	Excellent article! CommonCrawl is an amazing resourouce. You should also check out webdatacommons.org that is using their data and extract structured data (using RDFa, Microdata..)
226
227	If I may add a shameless plug here and tell you about Webhose.io [PAYWARE/SERVICES]. We provide an API to structured web data. The idea is the same as the one you presented. Instead of crawling the web, we already crawl millions of sites, download the data, structure and organize it so anyone can easily consume it and plug into their own system.
228	Reply
229
230
231	https://stackoverflow.com/questions/12097848/finding-all-domains-of-a-country
232
233	-> http://urlsearch.commoncrawl.org/
234	-> http://index.commoncrawl.org/
235	-> INSTRUCTIONS: https://groups.google.com/forum/#!msg/common-crawl/3QmQjFA_3y4/vTbhGqIBBQAJ
236
237
238	Go to: http://index.commoncrawl.org/
239	Grab the newest gzipped archive file.
240	Then open it and find the cluster.idx file listed in it.
241	Copy its relative URL, prefix with https://commoncrawl.s3.amazonaws.com/
242
243	THEN:
244
245	wharariki:[101]/Scratch/ak19/heritrix/heritrix-3.4.0-SNAPSHOT>wget https://commoncrawl.s3.amazonaws.com/cc-index/collections/CC-MAIN-2019-26/indexes/cluster.idx
246	--2019-07-29 17:40:45-- https://commoncrawl.s3.amazonaws.com/cc-index/collections/CC-MAIN-2019-26/indexes/cluster.idx
247	Resolving commoncrawl.s3.amazonaws.com (commoncrawl.s3.amazonaws.com)... 52.216.8.171
248	Connecting to commoncrawl.s3.amazonaws.com (commoncrawl.s3.amazonaws.com)\|52.216.8.171\|:443... connected.
249	HTTP request sent, awaiting response... 200 OK
250	Length: 125059234 (119M) [binary/octet-stream]
251	Saving to: âcluster.idxâ
252
253	cluster.idx 100%[============================================================>] 119.27M 8.51MB/s in 15s
254
255	2019-07-29 17:41:01 (7.83 MB/s) - âcluster.idxâ saved [125059234/125059234]
256
257	wharariki:[102]/Scratch/ak19/heritrix/heritrix-3.4.0-SNAPSHOT>grep '^nz,' cluster.idx \| cut -f2 \| uniq
258	cdx-00237.gz
259	cdx-00238.gz
260
261	Prefix "https://commoncrawl.s3.amazonaws.com/cc-index/collections/CC-MAIN-2019-26/indexes/" to the listed gz files and wget them:
262
263	https://commoncrawl.s3.amazonaws.com/cc-index/collections/CC-MAIN-2019-26/indexes/cdx-00237.gz
264	https://commoncrawl.s3.amazonaws.com/cc-index/collections/CC-MAIN-2019-26/indexes/cc-index/collections/CC-MAIN-2019-26/indexes/cdx-00238.gz
265
266
267
268	Unzip those, and we have all URLs with TLD .nz:
269	wharariki:[131]/Scratch/ak19/heritrix/heritrix-3.4.0-SNAPSHOT>gunzip cdx-00237.gz
270	wharariki:[132]/Scratch/ak19/heritrix/heritrix-3.4.0-SNAPSHOT>gunzip cdx-00238.gz
271
272
273	The first of these files includes Norwegian TLDs (start with "no,") and the second gz file includes TLDs that start with "org,".
274	So extract just those that start with "^nz," [https://www.unix.com/shell-programming-and-scripting/176608-how-copy-lines-starts-either-3-4-into-new-file.html].
275
276	wharariki:[107]/Scratch/ak19/heritrix/heritrix-3.4.0-SNAPSHOT>egrep "^nz," cdx-00238 > nz-only-TLDs-from-237-238.txt
277	wharariki:[108]/Scratch/ak19/heritrix/heritrix-3.4.0-SNAPSHOT>egrep "^nz," cdx-00238 >> nz-only-TLDs-from-237-238.txt
278
279
280	Checking the abacusinstitute.ac.nz is also in the current June 2019 list:
281	egrep "ac,abacusinstitute" nz-only-TLDs-from-237-238.txt
282
283
284	OTHER:
285	https://www.tutorialspoint.com/hadoop/hadoop_mapreduce
286	http://stormcrawler.net/
287	http://storm.apache.org/getting-help.html
288
289
290	https://dzone.com/articles/need-billions-of-web-pages-dont-bother-crawling
291	Basically, each release is split into 100 segments. Each segment has three types of files: WARC, WAT, and WET. As explained on the Get Started page:
292
293	WARC files store the raw crawl data.
294	WAT files store computed metadata for the data stored in the WARC.
295	WET files store extracted plaintext from the data stored in the WARC.
296
297	Note that WAT and WET are in the WARC format too! In fact, the WARC format is nothing more than an envelope with metadata and content. In the case of the WARC files, that content is the HTTP requests and responses, whereas, for the WET files, it is simply the plain text extracted from the WARCs. The WAT files contain a JSON representation of metadata extracted from the WARCs, e.g. title, links etc.
298
299
300	Resources
301
302	The Get Started page on the CommonCrawl website contains useful pointers to libraries and code in various programming languages to process the datasets. There is also a list of tutorials and presentations.
303
304	It is also worth noting that CommonCrawl provides an index per release, allowing you to search for URLs (including wildcards) and retrieve the segment and offset therein where the content of the URL is stored, e.g.:
305
306	{ "urlkey": "org,apache)/", "timestamp": "20170220105827", "status": "200", "url": "http://apache.org/", "filename": "crawl-data/CC-MAIN-2017-09/segments/1487501170521.30/warc/CC-MAIN-20170219104610-00206-ip-10-171-10-108.ec2.internal.warc.gz", "length": "13315", "mime": "text/html", "offset": "14131184", "digest": "KJREISJSKKGH6UX5FXGW46KROTC6MBEM" }
307
308
309	This is useful but only if you are interested in a limited number of URLs which you know in advance. In many cases, what you know in advance is what you want to extract, not where it will be extracted from. For situations such as these, you will need distributed batch-processing using MapReduce in Apache Hadoop or Apache Spark.
310
311
312	https://www.forbes.com/sites/kalevleetaru/2017/09/28/common-crawl-and-unlocking-web-archives-for-research/#7067d4313b83
313	One large web archive has bucked this trend and stood alone among its peers: Common Crawl. Similar to other large web archiving initiatives like the Internet Archive, Common Crawl conducts regular web wide crawls of the open web and preserves all of the content it downloads in the standard WARC file format. Unlike many other archives, it focuses primarily on preserving HTML web pages and does not archive images, videos, JavaScript files, CSS stylesheets, etc. Its goal is not to preserve the exact look and feel of a website on a given snapshot in time, but rather to collect a vast cross section of HTML web pages from across the web in a single place to enable large-scale data mining at web scale.
314	...
315	The project excludes sites which have robots.txt exclusion policies, following the historical policy of many other web archives, though it is worth noting that the Internet Archive earlier this year began slowly phasing out its reliance on such files due to their detrimental effect on preservation completeness. Common Crawl also allows sites to request removal from their index. Other than these cases, Common Crawl attempts to crawl as much of the remaining web as possible, aiming for a representative sample of the open web.
316	...
317	Ms. Crouse [Director of Common Crawl] noted the risk adverse nature of the web archiving community as a whole (historically many adhered and still adhere to a strict âopt inâ policy requiring prior approval before crawling a site) and the unwillingness of many archives to modernize their thinking on copyright and to engage more closely with the legal community in ways that could help them expand fair use horizons. In particular, she noted âsince we [in the US] are beholden to the Copyright Act, while living in a digital age, many well-intentioned organizations devoted to web science, archiving, and information provision may benefit from a stronger understanding of how copyright is interpreted in present day, and its hard boundariesâ and that âmany talented legal advisers and groups are interested in the precedent-setting nature of this topic; some are willing to work Pro Bono.â
318	...
319	Returning to the difference between Common Crawlâs datasets and traditional preservation-focused web archiving, Ms. Crouse emphasized that they capture only HTML pages and exclude multimedia content like images, video and other dynamic content.
320
321	She noted that a key aspect of their approach to fair use is that web pages are intended for consumption by human beings one at a time using a web browser, while Common Crawl concatenates billions of pages together in the specialized WARC file format designed for machine data mining. Specifically, âCommon Crawl does not offer separate/individual web pages for easy consumption. The three data formats that are provided include text, metadata, and raw data, and the data is concatenatedâ and âthe format of the output is not a downloaded web page. The output is in WARC file format which contains the components of a page that are beneficial to machine-level analysis and make for space- efficient archiving (essentially: header, text, and some metadata).â
322
323	As Ms. Crouse put it, âthis is big data intended for machine learning/readability. Further, our intention for its use is for public benefit i.e. to encourage research and innovation, not direct consumption.â She noted that âfrom the laypersonâs perspective, it is not at all trivial at present to extract a specific websiteâs content (that is, text) from a Common Crawl dataset. This task generally requires one to know how to install and run a Hadoop cluster, among other things. This is not structured data. Further it is likely that not all pages of that website will be included (depending on the parameters for depth set for the specific crawl).â This means that âthe bulk of [Common Crawlâs] users are from the noncommercial, educational, and research sectors. At a higher level, itâs important to note that we provide a broad and representative sample of the web, in the form of web crawl data, each month. No one really knows how big the web is, and at present, we limit our monthly data publication to approximately 3 billion pages.â
324
325
326	Common Crawl believes it addresses this through the fact that its archive represents only a sample of each website crawled, rather than striving for 100% coverage. Specifically, Ms. Crouse noted that âat present, [crawls are] in monthly increments that are discontinuous month-to-month. We do only what is reasonable, necessary, and economical to achieve a representative sample. For instance, we limit the number of pages crawled from any given domain so, for large content owners, it is highly probable that their content, if included in a certain monthâs crawl data, is not wholly represented and thus not ideal for mining for comprehensive results âŠ if the content owner is not a large site, or in a niche market, their URL is less likely to be included in the seeds in the frontier, and, since we limit depth (# of links followed) for the sake of both economy and broader representative web coverage, 'niche' content may not even appear in a given monthâs dataset.â
327
328	To put it another way, Common Crawlâs mission is to create a ârepresentative sampleâ of the web at large by crawling a sampling of pages and limiting the number of pages from each site they capture. Thus, their capture of any given site will represent a discontinuous sampling of pages that can change from month to month. A researcher wishing to analyze a single web site in its entirety would therefore not be able to turn to Common Crawl and would instead have to conduct their own crawl of the site or turn to a commercial aggregator that partners with the content holder to license the complete contents of the site.
329
330	In Common Crawlâs view this is a critical distinction that sets it apart from both traditional web archiving and the commercial content aggregators that generate data mining revenue for content owners. By focusing on creating a ârepresentative sampleâ of the web at large, rather than attempting to capture a single site in its entirety (and in fact ensuring that it does not include more than a certain number of pages per site), the crawl self-limits itself to being applicable only to macro-level research examining web scale questions. Such âweb scaleâ questions cannot be answered through any existing open dataset and by incorporating specific design features Common Crawl ensures that more traditional research questions, like data mining the entirety of a single site, which might be viewed as redistribution of that site or competing with its ownerâs ability to license its content for data mining, is simply not possible.
331	------------

Note: See TracBrowser for help on using the repository browser.

Download in other formats: