Ignore:
Timestamp:
2017-12-08T17:58:07+13:00 (6 years ago)
Author:
ak19
Message:

Martin (mwilliman email id) on the mailing list found that solr got SIGPIPE errors when he built his 3020 doc sorl collection. The problem occurred when the docs were sent in a single stream for solr ingestion using the SimplePostTool (post.jar/solr-post.jar). The problem is that the data stream becomes to large, since SimplePostTool doesn't cause a commit until after the pipe to it is closed. Initially other methods were attempted: increasing the Java VM mem size from 512 to 2048, which only helped process a certain additional number of docs before resulting in a SIGPIPE again. We tried changing the solr update url to have ?commit=true and ?commitWithin=15000 (ms) suffixed to it, but as the commit isn't done until after the pipe to SimplePostTool is closed, the url change had no effect with SimplePostTool. Though we retained an increase to 1024 of the Java VM when launching SimplePostTool, the actual present solution was to close and reopen the pipe to the post tool jar file executable after every x number of docs. Currently this batch size is set to 20. However, if any file is gigantic, we could get to see this problem again: it has to do with the overall size of the data stream rather than number of docs. The actual problem lies in HttpURLConnection that SimplePostTool opens, rather than how often we open/close the open to the post tool. This commit contains 3 changes: 1. changed Java VM memory to 1024 when launching SimplePostTool (solr-post.jar); 2. code changes to solrutil.pm and solr_passes.pl to close and reopen the pipe to flush the data after every 20 docs to force a commit to solr; 3. the existing code changes work with the old solr-post.jar (version 1.3) but committing version 1.5 since it has a larger buffer and is found to be better by Dr Bainbridge. The new, v1.5 solr-post.jar is from solr-4.7.2's example/examples/post.jar, renamed to the usual solr-post.jar.

File:
1 edited

Legend:

Unmodified
Added
Removed
  • gs3-extensions/solr/trunk/src/perllib/solrutil.pm

    r31490 r32088  
    112112}
    113113
    114 
    115 sub open_post_pipe
     114sub get_post_pipe_cmd
    116115{
    117116    my ($core, $solr_base_url) = @_;
     
    125124   
    126125    # Now run solr-post command
     126    # See https://wiki.apache.org/solr/UpdateXmlMessages
     127    # also https://lucene.apache.org/solr/4_2_1/tutorial.html
     128        # suffixing commit=true/commitWithin=10000 to solr's /update servlet didn't work, because
     129        # when using SimplePostTool, the commit only happens after the pipe to the tool is closed
    127130    my $post_props = "-Durl=$solr_base_url/$core/update"; # robustness of protocol is taken care of too
    128131
    129132    $post_props .= " -Ddata=stdin";
    130133    $post_props .= " -Dcommit=yes";
     134
     135    # increased VM mem from 512 to 1024, but increasing to 2048M didn't help either when too much
     136    # data streamed to SimplePostTool before commit. Nothing works short of committing before the
     137    # data streamed gets too large. The solution is to close and reopen the pipe to force commits.
     138    my $post_java_cmd = "java -Xmx1024M $post_props -jar \"$full_post_jar\"";
    131139   
    132     my $post_java_cmd = "java -Xmx512M $post_props -jar \"$full_post_jar\"";
     140       ##print STDERR "**** post cmd = $post_java_cmd\n";
    133141   
    134     ##print STDERR "**** post cmd = $post_java_cmd\n";
     142    return $post_java_cmd;
     143}
     144
     145sub open_post_pipe
     146{
     147    my ($core, $solr_base_url) = @_;
     148    my $post_java_cmd = &get_post_pipe_cmd($core, $solr_base_url);
     149
     150    open (PIPEOUT, "| $post_java_cmd")
     151    || die "Error in solr_passes.pl: Failed to run $post_java_cmd\n!$\n";
     152
     153    return $post_java_cmd; # return the post_java_cmd so caller can store it and reopen_post_pipe()
     154}
     155
     156sub reopen_post_pipe
     157{
     158    my $post_java_cmd = shift(@_);
    135159   
    136160    open (PIPEOUT, "| $post_java_cmd")
    137     || die "Error in solr_passes.pl: Failed to run $post_java_cmd\n!$\n";
     161    || die "Error in solrutil::reopen_post_pipe: Failed to run $post_java_cmd\n!$\n";
    138162   
    139163}
Note: See TracChangeset for help on using the changeset viewer.