Changeset 33456
- Timestamp:
- 2019-09-05T17:26:27+12:00 (5 years ago)
- Location:
- gs3-extensions/maori-lang-detection/MoreReading
- Files:
-
- 2 edited
Legend:
- Unmodified
- Added
- Removed
-
gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt
r33440 r33456 1 WARC, WET, WAT FILES 2 https://pypi.org/project/warc3-wet/ 3 https://gist.github.com/Smerity/afe7430fdb4371015466 4 https://github.com/commoncrawl/commoncrawl/issues/11 5 6 https://groups.google.com/forum/#!topic/common-crawl/hsb90GHq6to 7 Sebastian Nagel 8 05/07/2017 9 Hi, 10 11 unfortunately, we do not provide (yet) text extracts of the CC-NEWS archives. 12 13 But it's easy to run the WET extractor on the WARC files, see: 14 https://groups.google.com/d/topic/common-crawl/imv4hlLob4s/discussion 15 https://groups.google.com/d/topic/common-crawl/b6yDG7EmnhM/discussion 16 17 That's what you have to do: 18 19 # download the WARC files and place them in a directory "warc/" 20 # create sibling folders wat and wet 21 # | 22 # |-- warc/ 23 # | |-- CC-NEWS-20161001224340-00008.warc.gz 24 # | |-- CC-NEWS-20161017145313-00000.warc.gz 25 # | `-- ... 26 # | 27 # |-- wat/ 28 # | 29 # `-- wet/ 30 31 git clone https://github.com/commoncrawl/ia-web-commons 32 cd ia-web-commons 33 mvn install 34 35 cd .. 36 git clone https://github.com/commoncrawl/ia-hadoop-tools 37 cd ia-hadoop-tools 38 mvn package 39 40 java -jar $PWD/target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator \ 41 -strictMode -skipExisting batch-id-xyz .../warc/*.warc.gz 42 43 The folders wat/ and wet/ will then contain the exports. 44 45 Best, 46 Sebastian 47 ======================= 48 Latest version of the index's schema: 49 https://commoncrawl.s3.amazonaws.com/cc-index/table/cc-main/index.html 50 1 51 2 52 At http://commoncrawl.org/2018/10/september-2018-crawl-archive-now-available/, it says -
gs3-extensions/maori-lang-detection/MoreReading/Vagrant-Spark-Hadoop.txt
r33448 r33456 1 Hadoop/Map-reduce 2 3 https://www.guru99.com/create-your-first-hadoop-program.html 4 5 -------------- 1 6 To run firefox/anything graphical inside the VM run by vagrant, have to ssh -Y onto both analytics and then to the vagrant VM from analytics: 2 7 1. ssh analytics -Y … … 41 46 vagrant@node1:~/cc-index-table/src/script$ hdfs dfs -cat hdfs:///user/vagrant/cc-mri-unzipped-csv/cc-mri.csv | wc -l 42 47 345625 48 49 50 51 vagrant@node1:~/cc-index-table$ hdfs dfs -cat hdfs:///user/vagrant/cc-mri-csv/part* > file.csv.gz 52 vagrant@node1:~/cc-index-table$ less file.csv.gz 53 54 55 https://www.patricia-anong.com/blog/2017/11/1/extend-vmdk-on-virtualbox 56 57 58 When not using LIKE '%mri%' but = 'mri' instead: 59 vagrant@node1:~/cc-index-table$ hdfs dfs -cat hdfs:///user/vagrant/cc-mri-unzipped-csv/cc-mri.csv | wc -l 60 5767 43 61 44 62 -----------------------------------------
Note:
See TracChangeset
for help on using the changeset viewer.