Changeset 33809


Ignore:
Timestamp:
12/17/19 19:53:17 (16 months ago)
Author:
ak19
Message:

Some more GS_README.txt instructions. Not put the mongodb queries in here yet. They're still in MoreReading/mongodb.txt, but the final quweries that are useful will end up in this file later on.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • other-projects/maori-lang-detection/hdfs-cc-work/GS_README.TXT

    r33618 r33809  
    1717I. Setting up Nutch v2 on its own Vagrant VM machine
    1818J. Automated crawling with Nutch v2.3.1 and post-processing
     19K. Sending the crawled data into mongodb with NutchTextDumpProcessor.java
     20---
     21
     22APPENDIX: Reading data from hbase tables and backing up hbase
    1923
    2024----------------------------------------
     
    665669
    666670
     671------------------------------------------------------------------------
     672K. Sending the crawled data into mongodb with NutchTextDumpProcessor.java
     673------------------------------------------------------------------------
     6741. The crawled folder should contain all the batch crawls done with nutch (section J above).
     675
     6762. Set up mongodb connection properties in conf/config.properties
     677By default, the mongodb database name is configured to be ateacrawldata.
     678
     6793. Create a mongodb database by the specified name. A database named "ateacrawldata" to be created, unless the default db name is changed.
     680
     6814. Set up the environment and compile NutchTextDumpProcessor:
     682   cd maori-lang-detection/apache-opennlp-1.9.1
     683   export OPENNLP_HOME=`pwd`
     684   cd maori-lang-detection/src
     685
     686   javac -cp ".:../conf:../lib/*:$OPENNLP_HOME/lib/opennlp-tools-1.9.1.jar" org/greenstone/atea/NutchTextDumpToMongoDB.java
     687
     6884. Pass the crawled folder to NutchTextDumpProcessor:
     689   java -cp ".:../conf:../lib/*:$OPENNLP_HOME/lib/opennlp-tools-1.9.1.jar" org/greenstone/atea/NutchTextDumpToMongoDB /PATH/TO/crawled
     690     
     6915. It may take 1.5 hours or so to ingest the approximately 1450 crawled sites' data into mongodb.
     692
     6936. Launch the Robo 3T (version 1.3 is one we tested) MongoDB client. Use it to connect to MongoDB's "ateacrawldata" database.
     694Now you can run queries.
     695
    667696--------------------------------------------------------
    668 K. Reading data from hbase tables and backing up hbase
     697APPENDIX: Reading data from hbase tables and backing up hbase
    669698--------------------------------------------------------
    670699
Note: See TracChangeset for help on using the changeset viewer.