- Timestamp:
- 2019-09-05T17:26:27+12:00 (5 years ago)
- File:
-
- 1 edited
Legend:
- Unmodified
- Added
- Removed
-
gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt
r33440 r33456 1 WARC, WET, WAT FILES 2 https://pypi.org/project/warc3-wet/ 3 https://gist.github.com/Smerity/afe7430fdb4371015466 4 https://github.com/commoncrawl/commoncrawl/issues/11 5 6 https://groups.google.com/forum/#!topic/common-crawl/hsb90GHq6to 7 Sebastian Nagel 8 05/07/2017 9 Hi, 10 11 unfortunately, we do not provide (yet) text extracts of the CC-NEWS archives. 12 13 But it's easy to run the WET extractor on the WARC files, see: 14 https://groups.google.com/d/topic/common-crawl/imv4hlLob4s/discussion 15 https://groups.google.com/d/topic/common-crawl/b6yDG7EmnhM/discussion 16 17 That's what you have to do: 18 19 # download the WARC files and place them in a directory "warc/" 20 # create sibling folders wat and wet 21 # | 22 # |-- warc/ 23 # | |-- CC-NEWS-20161001224340-00008.warc.gz 24 # | |-- CC-NEWS-20161017145313-00000.warc.gz 25 # | `-- ... 26 # | 27 # |-- wat/ 28 # | 29 # `-- wet/ 30 31 git clone https://github.com/commoncrawl/ia-web-commons 32 cd ia-web-commons 33 mvn install 34 35 cd .. 36 git clone https://github.com/commoncrawl/ia-hadoop-tools 37 cd ia-hadoop-tools 38 mvn package 39 40 java -jar $PWD/target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator \ 41 -strictMode -skipExisting batch-id-xyz .../warc/*.warc.gz 42 43 The folders wat/ and wet/ will then contain the exports. 44 45 Best, 46 Sebastian 47 ======================= 48 Latest version of the index's schema: 49 https://commoncrawl.s3.amazonaws.com/cc-index/table/cc-main/index.html 50 1 51 2 52 At http://commoncrawl.org/2018/10/september-2018-crawl-archive-now-available/, it says
Note:
See TracChangeset
for help on using the changeset viewer.