Changeset 33456

Show
Ignore:
Timestamp:
05.09.2019 17:26:27 (13 days ago)
Author:
ak19
Message:

Link to discussion on how to convert WARC to WET

Location:
gs3-extensions/maori-lang-detection/MoreReading
Files:
2 modified

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt

    r33440 r33456  
     1WARC, WET, WAT FILES 
     2https://pypi.org/project/warc3-wet/ 
     3https://gist.github.com/Smerity/afe7430fdb4371015466 
     4https://github.com/commoncrawl/commoncrawl/issues/11 
     5 
     6https://groups.google.com/forum/#!topic/common-crawl/hsb90GHq6to 
     7Sebastian Nagel      
     805/07/2017 
     9Hi, 
     10 
     11unfortunately, we do not provide (yet) text extracts of the CC-NEWS archives. 
     12 
     13But it's easy to run the WET extractor on the WARC files, see: 
     14  https://groups.google.com/d/topic/common-crawl/imv4hlLob4s/discussion 
     15  https://groups.google.com/d/topic/common-crawl/b6yDG7EmnhM/discussion 
     16 
     17That's what you have to do: 
     18 
     19# download the WARC files and place them in a directory "warc/" 
     20# create sibling folders wat and wet 
     21# | 
     22# |-- warc/ 
     23# |   |-- CC-NEWS-20161001224340-00008.warc.gz 
     24# |   |-- CC-NEWS-20161017145313-00000.warc.gz 
     25# |   `-- ... 
     26# | 
     27# |-- wat/ 
     28# | 
     29# `-- wet/ 
     30 
     31git clone https://github.com/commoncrawl/ia-web-commons 
     32cd ia-web-commons 
     33mvn install 
     34 
     35cd .. 
     36git clone https://github.com/commoncrawl/ia-hadoop-tools 
     37cd ia-hadoop-tools 
     38mvn package 
     39 
     40java -jar $PWD/target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator \ 
     41   -strictMode -skipExisting batch-id-xyz .../warc/*.warc.gz 
     42 
     43The folders wat/ and wet/ will then contain the exports. 
     44 
     45Best, 
     46Sebastian  
     47======================= 
     48Latest version of the index's schema: 
     49https://commoncrawl.s3.amazonaws.com/cc-index/table/cc-main/index.html 
     50 
    151 
    252At http://commoncrawl.org/2018/10/september-2018-crawl-archive-now-available/, it says 
  • gs3-extensions/maori-lang-detection/MoreReading/Vagrant-Spark-Hadoop.txt

    r33448 r33456  
     1Hadoop/Map-reduce 
     2 
     3https://www.guru99.com/create-your-first-hadoop-program.html 
     4 
     5-------------- 
    16To run firefox/anything graphical inside the VM run by vagrant, have to ssh -Y onto both analytics and then to the vagrant VM from analytics: 
    271. ssh analytics -Y 
     
    4146vagrant@node1:~/cc-index-table/src/script$ hdfs dfs -cat hdfs:///user/vagrant/cc-mri-unzipped-csv/cc-mri.csv | wc -l 
    4247345625 
     48 
     49 
     50 
     51vagrant@node1:~/cc-index-table$ hdfs dfs -cat hdfs:///user/vagrant/cc-mri-csv/part* > file.csv.gz 
     52vagrant@node1:~/cc-index-table$ less file.csv.gz  
     53 
     54 
     55https://www.patricia-anong.com/blog/2017/11/1/extend-vmdk-on-virtualbox 
     56 
     57 
     58When not using LIKE '%mri%' but = 'mri' instead: 
     59vagrant@node1:~/cc-index-table$ hdfs dfs -cat hdfs:///user/vagrant/cc-mri-unzipped-csv/cc-mri.csv | wc -l 
     605767 
    4361 
    4462-----------------------------------------