Context Navigation

← Previous Change
Next Change →

CommonCrawl.txt

Timestamp:

2019-09-05T17:26:27+12:00 (5 years ago)

Author:

ak19

Message:

Link to discussion on how to convert WARC to WET

File:

: 1 edited

gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt (modified) (1 diff)

Legend:

: Unmodified
: Added
: Removed

gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt

-              r33440
+              r33456
+WARC, WET, WAT FILES
+https://pypi.org/project/warc3-wet/
+https://gist.github.com/Smerity/afe7430fdb4371015466
+https://github.com/commoncrawl/commoncrawl/issues/11
+https://groups.google.com/forum/#!topic/common-crawl/hsb90GHq6to
+Sebastian Nagel
+/07/2017
+Hi,
+unfortunately, we do not provide (yet) text extracts of the CC-NEWS archives.
+But it's easy to run the WET extractor on the WARC files, see:
+  https://groups.google.com/d/topic/common-crawl/imv4hlLob4s/discussion
+  https://groups.google.com/d/topic/common-crawl/b6yDG7EmnhM/discussion
+That's what you have to do:
+# download the WARC files and place them in a directory "warc/"
+# create sibling folders wat and wet
+# |
+# |-- warc/
+# |   |-- CC-NEWS-20161001224340-00008.warc.gz
+# |   |-- CC-NEWS-20161017145313-00000.warc.gz
+# |   `-- ...
+# |
+# |-- wat/
+# |
+# `-- wet/
+git clone https://github.com/commoncrawl/ia-web-commons
+cd ia-web-commons
+mvn install
+cd ..
+git clone https://github.com/commoncrawl/ia-hadoop-tools
+cd ia-hadoop-tools
+mvn package
+java -jar $PWD/target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator \
+   -strictMode -skipExisting batch-id-xyz .../warc/*.warc.gz
+The folders wat/ and wet/ will then contain the exports.
+Best,
+Sebastian
+=======================
+Latest version of the index's schema:
+https://commoncrawl.s3.amazonaws.com/cc-index/table/cc-main/index.html
 At http://commoncrawl.org/2018/10/september-2018-crawl-archive-now-available/, it says

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 33456 for gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt

Legend:

gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt

Download in other formats: