1 | https://codereview.stackexchange.com/questions/198343/crawl-and-gather-all-the-urls-recursively-in-a-domain
|
---|
2 | http://lucene.472066.n3.nabble.com/Using-nutch-just-for-the-crawler-fetcher-td611918.html
|
---|
3 |
|
---|
4 | https://www.quora.com/What-are-some-Web-crawler-tips-to-avoid-crawler-traps
|
---|
5 |
|
---|
6 |
|
---|
7 | http://www.basicsbehind.com/extract-text-webpage/
|
---|
8 | http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf
|
---|
9 | https://stackoverflow.com/questions/16649535/access-a-common-crawl-aws-public-dataset/25297965#25297965
|
---|
10 |
|
---|
11 |
|
---|
12 | http://commoncrawl.org/2019/07/june-2019-crawl-archive-now-available/
|
---|
13 | File List #Files Total Size Compressed (TiB)
|
---|
14 | WET files CC-MAIN-2019-26/wet.paths.gz 56000 7.59
|
---|
15 |
|
---|
16 |
|
---|
17 | http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/
|
---|
18 | (Instructions)
|
---|
19 |
|
---|
20 | https://gist.github.com/svemir/4207353
|
---|
21 | (Hadoop related) A Common Crawl Experiment
|
---|
22 |
|
---|
23 | https://gist.github.com/Smerity/afe7430fdb4371015466
|
---|
24 |
|
---|
25 | Extract just the text from Common Crawl WARC WET files
|
---|
26 |
|
---|
27 | https://stackoverflow.com/tags/common-crawl/hot?filter=all
|
---|
28 |
|
---|
29 | https://stackoverflow.com/questions/45920527/get-offset-and-length-of-a-subset-of-a-wat-archive-from-common-crawl-index-serve/46152773#46152773
|
---|
30 |
|
---|
31 |
|
---|
32 | "The Common Crawl index does not contain offsets into WAT and WET files. So, the only way is to search the whole WAT/WET file for the desired record/URL. Eventually, it would be possible to estimate the offset because the record order in WARC and WAT/WET files is the same."
|
---|
33 |
|
---|
34 | https://dmorgan.info/posts/common-crawl-python/
|
---|
35 | https://groups.google.com/forum/#!topic/common-crawl/pdI3w09AAbQ
|
---|
36 |
|
---|
37 | Example:
|
---|
38 | WARC:
|
---|
39 | tikauka:[142]/Scratch/anupama/maori-lang-detection>wget https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-30/segments/1563195526237.47/crawldiagnostics/CC-MAIN-20190719115720-20190719141720-00077.warc.gz
|
---|
40 | WET:
|
---|
41 | tikauka:[142]/Scratch/anupama/maori-lang-detection>wget https://commoncrawl.s3.amazonaws.com/crawl-data/CC-MAIN-2019-30/segments/1563195526237.47/wet/CC-MAIN-20190719115720-20190719141720-00508.warc.wet.gz
|
---|
42 | tikauka:[142]/Scratch/anupama/maori-lang-detection>gunzip CC-MAIN-20190719115720-20190719141720-00508.warc.wet.gz
|
---|
43 |
|
---|