Changeset 33448 for gs3-extensions


Ignore:
Timestamp:
2019-08-30T18:27:21+12:00 (5 years ago)
Author:
ak19
Message:

Minor clarification and inclusion of helpful command

File:
1 edited

Legend:

Unmodified
Added
Removed
  • gs3-extensions/maori-lang-detection/MoreReading/Vagrant-Spark-Hadoop.txt

    r33446 r33448  
    116116Hints to solve it were at https://stackoverflow.com/questions/45972929/scala-dataframereader-keep-column-headers
    117117The actual solution is to edit the CCIndexWarcExport.java as follows:
    118 1. set option(header) to false since the csv file contains no header row, only data rows.
     1181. set option(header) to false since the csv file contains no header row, only data rows. You can confirm the csv has no header row by doing
     119   hdfs dfs -cat hdfs:///user/vagrant/cc-mri-csv/part* | head -5
     120   
    1191212. The 4 column names are inferred as _c0 to _c3, not as url/warc_filename etc.
    120122
Note: See TracChangeset for help on using the changeset viewer.