Context Navigation

← Previous Changeset
Next Changeset →

Changeset 33456

Timestamp:

2019-09-05T17:26:27+12:00 (5 years ago)

Author:

ak19

Message:

Link to discussion on how to convert WARC to WET

Location:

gs3-extensions/maori-lang-detection/MoreReading

Files:

: 2 edited

CommonCrawl.txt (modified) (1 diff)
Vagrant-Spark-Hadoop.txt (modified) (2 diffs)

Legend:

: Unmodified
: Added
: Removed

gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt

-              r33440
+              r33456
+WARC, WET, WAT FILES
+https://pypi.org/project/warc3-wet/
+https://gist.github.com/Smerity/afe7430fdb4371015466
+https://github.com/commoncrawl/commoncrawl/issues/11
+https://groups.google.com/forum/#!topic/common-crawl/hsb90GHq6to
+Sebastian Nagel
+/07/2017
+Hi,
+unfortunately, we do not provide (yet) text extracts of the CC-NEWS archives.
+But it's easy to run the WET extractor on the WARC files, see:
+  https://groups.google.com/d/topic/common-crawl/imv4hlLob4s/discussion
+  https://groups.google.com/d/topic/common-crawl/b6yDG7EmnhM/discussion
+That's what you have to do:
+# download the WARC files and place them in a directory "warc/"
+# create sibling folders wat and wet
+# |
+# |-- warc/
+# |   |-- CC-NEWS-20161001224340-00008.warc.gz
+# |   |-- CC-NEWS-20161017145313-00000.warc.gz
+# |   `-- ...
+# |
+# |-- wat/
+# |
+# `-- wet/
+git clone https://github.com/commoncrawl/ia-web-commons
+cd ia-web-commons
+mvn install
+cd ..
+git clone https://github.com/commoncrawl/ia-hadoop-tools
+cd ia-hadoop-tools
+mvn package
+java -jar $PWD/target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator \
+   -strictMode -skipExisting batch-id-xyz .../warc/*.warc.gz
+The folders wat/ and wet/ will then contain the exports.
+Best,
+Sebastian
+=======================
+Latest version of the index's schema:
+https://commoncrawl.s3.amazonaws.com/cc-index/table/cc-main/index.html
 At http://commoncrawl.org/2018/10/september-2018-crawl-archive-now-available/, it says

gs3-extensions/maori-lang-detection/MoreReading/Vagrant-Spark-Hadoop.txt

-              r33448
+              r33456
+Hadoop/Map-reduce
+https://www.guru99.com/create-your-first-hadoop-program.html
+--------------
 To run firefox/anything graphical inside the VM run by vagrant, have to ssh -Y onto both analytics and then to the vagrant VM from analytics:
 . ssh analytics -Y
 …
 vagrant@node1:~/cc-index-table/src/script$ hdfs dfs -cat hdfs:///user/vagrant/cc-mri-unzipped-csv/cc-mri.csv | wc -l
+vagrant@node1:~/cc-index-table$ hdfs dfs -cat hdfs:///user/vagrant/cc-mri-csv/part* > file.csv.gz
+vagrant@node1:~/cc-index-table$ less file.csv.gz
+https://www.patricia-anong.com/blog/2017/11/1/extend-vmdk-on-virtualbox
+When not using LIKE '%mri%' but = 'mri' instead:
+vagrant@node1:~/cc-index-table$ hdfs dfs -cat hdfs:///user/vagrant/cc-mri-unzipped-csv/cc-mri.csv | wc -l
 -----------------------------------------

Note: See TracChangeset for help on using the changeset viewer.

Context Navigation

Changeset 33456

Legend:

gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt

gs3-extensions/maori-lang-detection/MoreReading/Vagrant-Spark-Hadoop.txt

Download in other formats: