source: other-projects/hathitrust/wcsa/extracted-features-solr/trunk/solr-ingest

Revision Log Mode:


Legend:

Added
Modified
Copied or renamed
Diff Rev Age Author Log Message
(edit) @32109   5 years davidb Changes made after testing through YARN
(edit) @32108   5 years davidb Useful breadcrumb for compiling
(edit) @32107   5 years davidb Rekindling the ability to run a JSON-filelist Spark run via YARN
(edit) @32106   5 years davidb Rekindle ability to process a json-filelist.txt using Spark
(edit) @32104   5 years davidb Serial version
(edit) @32103   5 years davidb Tidy up of output
(edit) @32102   5 years davidb Version to project local JSON list serially
(edit) @32101   5 years davidb Tweaks to allow serial ingest to run
(edit) @31786   6 years davidb extra param in call; change to case-folding _htrctokentext
(edit) @31784   6 years davidb Output to highlight skipping per-page indexing
(edit) @31783   6 years davidb Solr Doc Add changed to include volume-level metadata within every …
(edit) @31779   6 years davidb Change in how POS words are checked against the Whitelist. Previously …
(edit) @31677   6 years davidb Supress processing governmentDocument for now in JSON metadata record, …
(edit) @31676   6 years davidb To make it easier to remember how to kill off a YARN task at the …
(edit) @31675   6 years davidb More careful set of metadata fields indexed
(edit) @31598   6 years davidb Easier to remember what to do
(edit) @31597   6 years davidb Additional _s and _ss fields to help with faceting. Temporarily …
(edit) @31510   6 years davidb Turns out some languages fields can be empty. Need to test for this
(edit) @31509   6 years davidb LangPos determination changed to lock into first match, rather than …
(edit) @31506   6 years davidb Forgot to add initialization line. Doh!
(edit) @31505   6 years davidb Added in storing of top-level document metadata as separate solr-doc
(edit) @31504   6 years davidb Adjusted call to work with added parameter
(edit) @31503   6 years davidb Monitor for missing POS keys, and print out details first time each …
(edit) @31502   6 years davidb Comment out section, useful for controlling a smaller run
(edit) @31501   6 years davidb No longer used
(edit) @31500   6 years davidb Synchronize on reading in of white-list and universal-lang-pos
(edit) @31499   6 years davidb Better exception handling
(edit) @31498   6 years davidb Tidy up on print statements
(edit) @31453   6 years davidb Added size() method
(edit) @31452   6 years davidb Additional Spark progs to run
(edit) @31451   6 years davidb shift to using solr-base-url and a specified solr-collection
(edit) @31450   6 years davidb Some debugging output to help see what is happening with …
(edit) @31378   6 years davidb Fixed loop limit test
(edit) @31377   6 years davidb Switch to using URI not string
(edit) @31376   6 years davidb Universal language mappings for opennlp POS model tags
(edit) @31375   6 years davidb Initial cut at including POS information to solr index
(edit) @31374   6 years davidb simplified command line usage
(edit) @31372   6 years davidb Reworked to use sequenceFiles
(edit) @31371   6 years davidb Trying to get saveAsSequenceFile working
(edit) @31369   6 years davidb Trial new save
(edit) @31368   6 years davidb downsample-100 added
(edit) @31365   6 years davidb Quick code added to downsample
(edit) @31364   6 years davidb removed sample() line
(edit) @31363   6 years davidb Control num of partitions on sort
(edit) @31362   6 years davidb use Spark sample() to make for smaller test with Sequence files
(edit) @31361   6 years davidb Change from String to Text
(edit) @31360   6 years davidb Seems to be Text class not a String class coming out of the seuquenceFiles
(edit) @31359   6 years davidb Changed over to use sequenceFiles as input
(edit) @31320   6 years davidb build Document rather than parse JSON string
(edit) @31319   6 years davidb Changed to replace existing MongoDB entry. Fixed up printt statement
(edit) @31318   6 years davidb change to using contains()
(edit) @31317   6 years davidb added debug statement
(edit) @31316   6 years davidb fixed typo
(edit) @31315   6 years davidb Further tweak
(edit) @31314   6 years davidb Another go at avoiding concurrency update exception
(edit) @31313   6 years davidb Alternative to avoid concurrency update exception
(edit) @31312   6 years davidb MongoDB can't have 'period' and 'dollar' in key, as reserved characters
(edit) @31311   6 years davidb Processing print statement added
(edit) @31310   6 years davidb Initial cut at files for working with MongoDB
(edit) @31309   6 years davidb Sparked MongoDB connector added
(edit) @31308   6 years davidb Minor tidy-up
(edit) @31294   6 years davidb Version for language counting the catalog assignment language …
(edit) @31278   6 years davidb To avoid null pointer on ids.iterator()
(edit) @31277   6 years davidb Tweak to minimum value
(edit) @31276   6 years davidb Min num partition guard put in
(edit) @31274   6 years davidb Need to use JSONArray no JSONObject for a multifield item
(edit) @31273   6 years davidb Code moved to store fields for multilingual use using dynamic Solr …
(edit) @31272   6 years davidb Use disk and memory to store main language RDD
(edit) @31271   6 years davidb Updating of POS code to new files-per-partition paramater, plus some …
(edit) @31270   6 years davidb Changed over to repartition approach
(edit) @31269   6 years davidb Some variable name changes, and printing tidy up
(edit) @31268   6 years davidb Adjustments to memory allocation in response to test runs on 10% of dataset
(edit) @31267   6 years davidb Values trialed on gsliscluster1. Rekindling idea of per-vol processing
(edit) @31266   6 years davidb Rekindling of per-volume approach. Also some tweaking to verbosity …
(edit) @31264   6 years davidb Switching to 'long' in counts to allow higher number representation
(edit) @31263   6 years davidb Change to using long for higher word counts
(edit) @31261   6 years davidb Overlooked changes from POS to lang
(edit) @31260   6 years davidb Language counting
(edit) @31259   6 years davidb Lambda sort had wrong boolean arg to sort descending. Now fixed
(edit) @31258   6 years davidb POS Label count, similar to Whitelist word count
(edit) @31257   6 years davidb Fixed typo
(edit) @31256   6 years davidb Earlier check of output directory to prevent large scale processing, …
(edit) @31255   6 years davidb Changed to using lambda functions
(edit) @31254   6 years davidb Experimenting with Lucene lowercase filter
(edit) @31253   6 years davidb Identified a typo, and changed to being true anyway
(edit) @31252   6 years davidb Support for icu-tokenize property added, plus relevant refactoring.
(edit) @31251   6 years davidb Code tidy up. Timed experiment showed sorting by key with …
(edit) @31250   6 years davidb Minor mods
(edit) @31247   6 years davidb Change sort order. Pick better output directory name
(edit) @31246   6 years davidb Experimenting with sorting
(edit) @31245   6 years davidb Refactored so processing of words from TokenPosCount now done by the …
(edit) @31244   6 years davidb Tidy up
(edit) @31243   6 years davidb Experimenting with Lucene/Solr's ICU tokenizer
(edit) @31242   6 years davidb Method name refactor
(edit) @31228   6 years davidb Change to see if code can be made more unified. If so, then …
(edit) @31227   6 years davidb Code tidy up
(edit) @31226   6 years davidb Fixed bloom test for init
(edit) @31225   6 years davidb Relocated bloomfilter creation to within call() method, so done on the …
(edit) @31224   6 years davidb Debug added
(edit) @31223   6 years davidb Exception printStackTrace
Note: See TracRevisionLog for help on using the revision log.