Context Navigation

other.txt@ 33404

Last change on this file since 33404 was 33404, checked in by ak19, 5 years ago
Links to other Java ways of extracting text from web content. 2. Getting at commoncrawl WET files
File size: 1.4 KB

Line
1	https://codereview.stackexchange.com/questions/198343/crawl-and-gather-all-the-urls-recursively-in-a-domain
2	http://lucene.472066.n3.nabble.com/Using-nutch-just-for-the-crawler-fetcher-td611918.html
3
4	https://www.quora.com/What-are-some-Web-crawler-tips-to-avoid-crawler-traps
5
6
7	http://www.basicsbehind.com/extract-text-webpage/
8	http://www.l3s.de/~kohlschuetter/publications/wsdm187-kohlschuetter.pdf
9	https://stackoverflow.com/questions/16649535/access-a-common-crawl-aws-public-dataset/25297965#25297965
10
11
12	http://commoncrawl.org/2019/07/june-2019-crawl-archive-now-available/
13	File List #Files Total Size Compressed (TiB)
14	WET files CC-MAIN-2019-26/wet.paths.gz 56000 7.59
15
16
17	http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/
18	(Instructions)
19
20	https://gist.github.com/svemir/4207353
21
22	https://gist.github.com/Smerity/afe7430fdb4371015466
23
24	Extract just the text from Common Crawl WARC WET files
25
26	https://stackoverflow.com/tags/common-crawl/hot?filter=all
27
28	https://stackoverflow.com/questions/45920527/get-offset-and-length-of-a-subset-of-a-wat-archive-from-common-crawl-index-serve/46152773#46152773
29
30
31	"The Common Crawl index does not contain offsets into WAT and WET files. So, the only way is to search the whole WAT/WET file for the desired record/URL. Eventually, it would be possible to estimate the offset because the record order in WARC and WAT/WET files is the same."
32
33	https://dmorgan.info/posts/common-crawl-python/

Note: See TracBrowser for help on using the repository browser.