Changeset 33448

Show
Ignore:
Timestamp:
30.08.2019 18:27:21 (3 weeks ago)
Author:
ak19
Message:

Minor clarification and inclusion of helpful command

Files:
1 modified

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/MoreReading/Vagrant-Spark-Hadoop.txt

    r33446 r33448  
    116116Hints to solve it were at https://stackoverflow.com/questions/45972929/scala-dataframereader-keep-column-headers 
    117117The actual solution is to edit the CCIndexWarcExport.java as follows: 
    118 1. set option(header) to false since the csv file contains no header row, only data rows. 
     1181. set option(header) to false since the csv file contains no header row, only data rows. You can confirm the csv has no header row by doing 
     119   hdfs dfs -cat hdfs:///user/vagrant/cc-mri-csv/part* | head -5 
     120    
    1191212. The 4 column names are inferred as _c0 to _c3, not as url/warc_filename etc. 
    120122