Changeset 33456 for gs3-extensions


Ignore:
Timestamp:
2019-09-05T17:26:27+12:00 (5 years ago)
Author:
ak19
Message:

Link to discussion on how to convert WARC to WET

Location:
gs3-extensions/maori-lang-detection/MoreReading
Files:
2 edited

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/MoreReading/CommonCrawl.txt

    r33440 r33456  
     1WARC, WET, WAT FILES
     2https://pypi.org/project/warc3-wet/
     3https://gist.github.com/Smerity/afe7430fdb4371015466
     4https://github.com/commoncrawl/commoncrawl/issues/11
     5
     6https://groups.google.com/forum/#!topic/common-crawl/hsb90GHq6to
     7Sebastian Nagel     
     805/07/2017
     9Hi,
     10
     11unfortunately, we do not provide (yet) text extracts of the CC-NEWS archives.
     12
     13But it's easy to run the WET extractor on the WARC files, see:
     14  https://groups.google.com/d/topic/common-crawl/imv4hlLob4s/discussion
     15  https://groups.google.com/d/topic/common-crawl/b6yDG7EmnhM/discussion
     16
     17That's what you have to do:
     18
     19# download the WARC files and place them in a directory "warc/"
     20# create sibling folders wat and wet
     21# |
     22# |-- warc/
     23# |   |-- CC-NEWS-20161001224340-00008.warc.gz
     24# |   |-- CC-NEWS-20161017145313-00000.warc.gz
     25# |   `-- ...
     26# |
     27# |-- wat/
     28# |
     29# `-- wet/
     30
     31git clone https://github.com/commoncrawl/ia-web-commons
     32cd ia-web-commons
     33mvn install
     34
     35cd ..
     36git clone https://github.com/commoncrawl/ia-hadoop-tools
     37cd ia-hadoop-tools
     38mvn package
     39
     40java -jar $PWD/target/ia-hadoop-tools-jar-with-dependencies.jar WEATGenerator \
     41   -strictMode -skipExisting batch-id-xyz .../warc/*.warc.gz
     42
     43The folders wat/ and wet/ will then contain the exports.
     44
     45Best,
     46Sebastian
     47=======================
     48Latest version of the index's schema:
     49https://commoncrawl.s3.amazonaws.com/cc-index/table/cc-main/index.html
     50
    151
    252At http://commoncrawl.org/2018/10/september-2018-crawl-archive-now-available/, it says
  • gs3-extensions/maori-lang-detection/MoreReading/Vagrant-Spark-Hadoop.txt

    r33448 r33456  
     1Hadoop/Map-reduce
     2
     3https://www.guru99.com/create-your-first-hadoop-program.html
     4
     5--------------
    16To run firefox/anything graphical inside the VM run by vagrant, have to ssh -Y onto both analytics and then to the vagrant VM from analytics:
    271. ssh analytics -Y
     
    4146vagrant@node1:~/cc-index-table/src/script$ hdfs dfs -cat hdfs:///user/vagrant/cc-mri-unzipped-csv/cc-mri.csv | wc -l
    4247345625
     48
     49
     50
     51vagrant@node1:~/cc-index-table$ hdfs dfs -cat hdfs:///user/vagrant/cc-mri-csv/part* > file.csv.gz
     52vagrant@node1:~/cc-index-table$ less file.csv.gz
     53
     54
     55https://www.patricia-anong.com/blog/2017/11/1/extend-vmdk-on-virtualbox
     56
     57
     58When not using LIKE '%mri%' but = 'mri' instead:
     59vagrant@node1:~/cc-index-table$ hdfs dfs -cat hdfs:///user/vagrant/cc-mri-unzipped-csv/cc-mri.csv | wc -l
     605767
    4361
    4462-----------------------------------------
Note: See TracChangeset for help on using the changeset viewer.