Changeset 31920

Show
Ignore:
Timestamp:
30.08.2017 19:29:36 (3 months ago)
Author:
ak19
Message:

Untested on Windows as yet. 1. Major overhaul to WgetDownload?'s useWget() and useWgetMonitored() subroutines. Their use of open3 was wrong and would cause blocking if proxy set wrong or if https_proxy not set/set wrong and the url entered was http but resolves to https. The problem was more fundamental than the symptoms indicated the open3() calls were used wrong and resulted in blocking. The blocking could be indefinite. To generally avoid blocking, needed to use IO::select() to loop to check any child streams that are ready. To avoid possibly indefinite blocking, needed to use IO::select() with a timeout on the can_read() method. The need for all these and their use is indicated in the links added to the committed version of this module. 2. After the use of select() worked in principle, there was still the large problem that terminating unnaturally did not stop a second wget that had been launched. This unexpectedly had to do with doublequotes around wget's path that attempted to preserve any spaces in the path, but which behaved differently with open3(): any double quotes launched a subshell to run the command passed to open3(). And the wget cmd launched by the subshell cmd wasn't actually a child process, so it could not be terminated via the parentpid used as a processgrouppid when doing the kill TERM -processgroupid. The solution lay with the unexpected cause of the problem, which was the double quotes. Now the command passed to open3() is an array of parameters and no double quotes. The array is meant to preserve spaces in any filepaths. 3. Removed the 2 tries parameter passed to wget, since we now loop a certain number of times trying to read from the child process' streams each time this times out. If it times out n times, then we give up and assume that the URL could not be read.

Location:
main/trunk/greenstone2/perllib/downloaders
Files:
2 modified

Legend:

Unmodified
Added
Removed
  • main/trunk/greenstone2/perllib/downloaders/WebDownload.pm

    r31898 r31920  
    118118    } 
    119119    #my $cmdWget = "-N -k -x -t 2 -P \"".$hashGeneralOptions->{"cache_dir"}."\" $strWgetOptions $strOptions ".$self->{'url'}; 
    120     my $cmdWget = "-N -k -x --tries=2 $strWgetOptions $strOptions $cache_dir " .$self->{'url'}; 
     120    #my $cmdWget = "-N -k -x --tries=2 $strWgetOptions $strOptions $cache_dir " .$self->{'url'}; 
     121    my $cmdWget = "-N -k -x $strWgetOptions $strOptions $cache_dir " .$self->{'url'}; 
    121122 
    122123    #print STDOUT "\n@@@@ RUNNING WGET CMD: $cmdWget\n\n"; 
     
    191192    my $strOptions = $self->getWgetOptions(); 
    192193 
    193     my $strBaseCMD = $strOptions." --tries=2 -q -O - \"$self->{'url'}\""; 
     194    #my $strBaseCMD = $strOptions." --tries=2 -q -O - \"$self->{'url'}\""; 
     195    my $strBaseCMD = $strOptions." -q -O - $self->{'url'}"; 
    194196 
    195197    #&util::print_env(STDERR, "https_proxy", "http_proxy", "ftp_proxy"); 
  • main/trunk/greenstone2/perllib/downloaders/WgetDownload.pm

    r31880 r31920  
    112112my ($serverSocket, $read_set); 
    113113 
     114my $TIMEOUT = 1; # seconds 
     115my $NUM_TRIES = 10; 
     116 
    114117# The port this script's server socket will be listening on, to handle  
    115118# incoming signals from GLI to terminate wget. This is also file global,  
     
    133136    # See http://perldoc.perl.org/perlipc.html#Signals  
    134137    # Warning on using kill at http://perldoc.perl.org/perlfork.html 
    135     kill("TERM", $childpid);  
     138    kill("TERM", $childpid); # prefix - to signal to kill process group 
    136139 
    137140    # If the SIGTERM sent on Linux calls this handler, we want to make 
     
    261264# terminate wget). True if we are running gli, or if the particular type 
    262265# of WgetDownload is *not* OAIDownload (in that case, the original way of  
    263 # terminating the perl script from Java terminated wget as well). 
     266# terminating the perl script from Java would terminate wget as well). 
    264267sub dealingWithSockets() { 
    265268    my ($self) = @_; 
     
    268271} 
    269272 
    270  
     273# Shouldn't use double quotes around wget path after all? See final comment at 
     274# http://www.perlmonks.org/?node_id=394709 
     275# http://coldattic.info/shvedsky/pro/blogs/a-foo-walks-into-a-bar/posts/63 
    271276sub useWget 
    272277{ 
     
    308313 
    309314    my $wget_file_path = &FileUtils::filenameConcatenate($ENV{'GSDLHOME'}, "bin", $ENV{'GSDLOS'}, "wget"); 
    310     $command = "\"$wget_file_path\" $cmdWget"; 
    311     #print STDOUT "Command is: $command\n"; # displayed in GLI output 
    312     #print STDERR "Command is: $command\n"; # goes into ServerInfoDialog 
     315    # compose the command as an array rather than as a string, to preserve spaces in the filepath 
     316    # because single/double quotes using open3 seem to launch a shell, see final comment at  
     317    # http://www.perlmonks.org/?node_id=394709 and that ends up causing problems in terminating wget 
     318    # as 2 processes then got launched which don't have parent-child pid relationship. 
     319    my @commandargs = split(' ', $cmdWget); 
     320    unshift(@commandargs, $wget_file_path); 
     321    $command = "$wget_file_path $cmdWget"; 
     322#    print STDOUT "Command is: $command\n"; # displayed in GLI output 
     323#    print STDERR "Command is: $command\n"; # goes into ServerInfoDialog 
    313324     
    314325    # Wget's output needs to be monitored to find out when it has naturally terminated. 
     
    316327    # On linux, 2>&1 launches a subshell which then launches wget, meaning that killing 
    317328    # the childpid does not kill wget on Linux but the subshell that launched it instead.  
    318     # Therefore, we use open3. Though the child process wget sends output only to its stdout,  
     329    # Therefore, we use open3. Though the child process wget sends output only to its stdout [is this meant to be "stderr"?],  
    319330    # using open3 says chld_err is undefined and the output of wget only comes in chld_out(!) 
    320331    # However that may be, it works with open3. But to avoid the confusion of managing and 
     
    326337    # Both open2 and open3 don't return on failure, but raise an exception. The handling 
    327338    # of the exception is described on p.568 of the Perl Cookbook 
    328     eval {  
    329     $childpid = open3($chld_in, $chld_out, $chld_out, $command); 
     339    eval { 
     340    #$childpid = open3($chld_in, $chld_out, $chld_out, $command); 
     341    $childpid = open3($chld_in, $chld_out, $chld_out, @commandargs);     
    330342    }; 
    331343    if ($@) { 
     
    336348    } 
    337349 
     350    # Switching to use IO::Select, which allows timeouts, instead of doing the potentially blocking 
     351    #     if defined(my $strLine=<$chld_out>) 
     352    # Google: perl open3 read timeout 
     353    # Google: perl open3 select() example 
     354    # https://stackoverflow.com/questions/10029406/why-does-ipcopen3-get-deadlocked 
     355    # https://codereview.stackexchange.com/questions/84496/the-right-way-to-use-ipcopen3-in-perl 
     356    # https://gist.github.com/shalk/6988937 
     357    # https://stackoverflow.com/questions/18373500/how-to-check-if-command-executed-with-ipcopen3-is-hung 
     358    # http://perldoc.perl.org/IO/Select.html 
     359    # http://perldoc.perl.org/IPC/Open3.html - explains the need for select()/IO::Select with open3 
     360    # http://www.perlmonks.org/?node_id=951554 
     361    # http://search.cpan.org/~dmuey/IPC-Open3-Utils-0.91/lib/IPC/Open3/Utils.pm 
     362    # https://stackoverflow.com/questions/3000907/wget-not-behaving-via-ipcopen3-vs-bash?rq=1 
     363 
     364    # create the select object and add our streamhandle(s) 
     365    my $sel = new IO::Select; 
     366    $sel->add($chld_out); 
     367 
     368    my $num_consecutive_timedouts = 0; 
     369    my $error = 0; 
    338370    my $loop = 1; 
     371     
    339372    while($loop) 
    340373    { 
    341     if (defined(my $strLine=<$chld_out>)) { # we're reading in from child process' stdout 
    342         if($blnShow) { 
    343         print STDERR "$strLine\n"; 
     374    # assume we're going to timeout trying to read from child process 
     375    $num_consecutive_timedouts++; 
     376 
     377     
     378    # block until data is available on the registered filehandles or until the timeout specified     
     379    if(my @readyhandles = $sel->can_read($TIMEOUT)) { 
     380 
     381        $num_consecutive_timedouts = 0; # re-zero, as we didn't timeout reading from child process after all 
     382        # since we're in this if statement 
     383         
     384        # now there's a list of registered filehandles we can read from to loop through reading from. 
     385        # though we've registered only one, chld_out 
     386        foreach my $fh (@readyhandles) { 
     387        my $strLine; 
     388        #sleep 3; 
     389         
     390        # read up to 4096 bytes from this filehandle fh. 
     391        # if there is less than 4096 bytes, we'll only get 
     392        # those available bytes and won't block.  If there  
     393        # is more than 4096 bytes, we'll only read 4096 and 
     394        # wait for the next iteration through the loop to  
     395        # read the rest. 
     396        my $len = sysread($fh, $strLine, 4096); 
     397         
     398        if($len) { # read something 
     399            if($blnShow) { 
     400            print STDERR "$strLine\n"; 
     401            } 
     402            $strReadIn .= $strLine; 
     403        } 
     404        else { # error or EOF: (!defined $len || $len == 0)          
     405             
     406            if(!defined $len) { # error reading          
     407            print STDERR "WgetDownload: Error reading from child stream: $!\n"; 
     408            # SHOULD THIS 'die "errmsg";' instead? - no, sockets may need closing 
     409            $error = 1; 
     410            } 
     411            elsif ($len == 0) { # EOF            
     412            # Finished reading from this filehand $fh because we read 0 bytes. 
     413            # wget finished, terminate naturally 
     414            print STDERR "WgetDownload: wget finished\n"; 
     415            #print STDOUT "\nPerl: open3 command, input streams closed. Wget terminated naturally.\n"; 
     416            } 
     417 
     418            $loop = 0; # error or EOF, either way will need to clean up and break out of outer loop 
     419             
     420            # last; # if we have more than one filehandle registered with IO::Select 
     421             
     422            $sel->remove($fh); # if more than one filehandle registered, we should unregister all of them here on error          
     423             
     424        } # end else error or EOF 
     425         
     426        } # end foreach on readyhandles 
     427    } # end if on can_read 
     428     
     429    if($num_consecutive_timedouts >= $NUM_TRIES) { 
     430        $error = 1; 
     431        $loop = 0;                          # to break out of outer while loop 
     432 
     433        $num_consecutive_timedouts = 0; 
     434 
     435        print STDERR "WARNING from WgetDownload: wget timed out $NUM_TRIES times waiting for a response\n"; 
     436        print STDERR "\tThe URL may be inaccessible or the proxy configuration is wrong or incomplete.\n"; 
     437        #print STDERR "\tConsider cancelling this download?\n";  
     438    } 
     439 
     440    if($loop == 0) { # error or EOF, either way, clean up 
     441        if($error) { 
     442        $self->{'forced_quit'} = 1;         # subclasses need to know we're quitting 
     443         
     444        if(kill(0, $childpid)) {  
     445            # If kill(0, $childpid) returns true, then the process is running 
     446            # and we need to kill it. 
     447            close($chld_in); 
     448            close($chld_out); 
     449            kill('TERM', $childpid); # kill the process group by prefixing - to signal 
     450 
     451            # https://coderwall.com/p/q-ovnw/killing-all-child-processes-in-a-shell-script 
     452            # https://stackoverflow.com/questions/392022/best-way-to-kill-all-child-processes 
     453            print STDERR "SENT SIGTERM TO CHILD PID: $childpid\n"; 
     454             
     455            #die "Perl terminated wget after timing out repeatedly and is about to exit\n"; 
     456        } 
    344457        } 
    345         $strReadIn .= $strLine; 
    346     } 
    347     else { # wget finished, terminate naturally 
    348         #print STDOUT "\nPerl: open2 command, input stream closed. Wget terminated naturally.\n"; 
    349         close($chld_in); 
    350         close($chld_out); 
    351         waitpid $childpid, 0; 
    352         $loop = 0; 
    353          
     458        else { # wget finished (no errors), terminate naturally 
     459        #print STDOUT "\nPerl: open2 command, input stream closed. Wget terminated naturally.\n"; 
     460        close($chld_in); 
     461        close($chld_out); 
     462        waitpid $childpid, 0;        
     463        } 
     464 
     465        # error or not 
    354466        $childpid = undef; 
     467        # Stop monitoring the read_handle and close the serverSocket  
     468        # (the Java end will close the client socket that Java opened) 
    355469        if(defined $port) { 
    356470        $read_set->remove($serverSocket); 
     
    359473    } 
    360474 
     475    # If we've already terminated, we can get out of the loop 
     476    #next if($loop == 0); 
     477     
    361478    # if we run this script from the command-line (as opposed to from GLI),  
    362     # then we're not working with sockets and can therefore can skip the next bits 
     479    # then we're not working with sockets and can therefore skip the next bits 
    363480    next unless(defined $port); 
    364  
     481     
    365482    # http://www.perlfect.com/articles/select.shtml 
    366483    # "multiplex between several filehandles within a single thread of control,  
     
    389506            # for it to start up, checking for whether it is running in order to kill it. 
    390507            for(my $seconds = 1; $seconds <= 5; $seconds++) { 
    391             if(kill(0, $childpid)) {  
     508            if(kill(0, $childpid)) { 
    392509                # If kill(0, $childpid) returns true, then the process is running 
    393510                # and we need to kill it. 
    394511                close($chld_in); 
    395512                close($chld_out); 
    396                 kill("TERM", $childpid); 
     513                kill("TERM", $childpid); # prefix - to signal to kill process group 
    397514                 
    398515                $childpid = undef; 
     
    460577 
    461578    my $wget_file_path = &FileUtils::filenameConcatenate($ENV{'GSDLHOME'}, "bin", $ENV{'GSDLOS'}, "wget"); 
    462     my $command = "\"$wget_file_path\" $cmdWget"; 
     579    # compose the command as an array for open3, to preserve spaces in any filepath 
     580    my @commandargs = split(' ', $cmdWget); 
     581    unshift(@commandargs, $wget_file_path); 
     582    my $command = "$wget_file_path $cmdWget"; 
    463583    #print STDOUT "Command is: $command\n"; 
    464584 
    465585    eval {     # see p.568 of Perl Cookbook 
    466     $childpid = open3($chld_in, $chld_out, $chld_out, $command); 
     586    $childpid = open3($chld_in, $chld_out, $chld_out, @commandargs); 
    467587    }; 
    468588    if ($@) { 
     
    478598    my $line; 
    479599 
     600    # create the select object and add our streamhandle(s) 
     601    my $sel = new IO::Select; 
     602    $sel->add($chld_out); 
     603     
     604    my $num_consecutive_timedouts = 0; 
     605    my $error = 0; 
    480606    my $loop = 1; 
    481607    while($loop) 
    482608    { 
    483     if (defined($line=<$chld_out>)) { # we're reading in from child process' stdout 
    484         if((defined $blnShow) && $blnShow) 
    485         { 
    486         print STDERR "$line"; 
     609    # assume we're going to timeout trying to read from child process 
     610    $num_consecutive_timedouts++; 
     611 
     612    # block until data is available on the registered filehandles or until the timeout specified     
     613    if(my @readyhandles = $sel->can_read($TIMEOUT)) { 
     614        $num_consecutive_timedouts = 0; # re-zero, as we didn't timeout reading from child process after all 
     615        # since we're in this if statement 
     616         
     617        foreach my $fh (@readyhandles) { 
     618        my $len = sysread($fh, $line, 4096); # read up to 4k from current ready filehandle 
     619        if($len) { # read something 
     620         
     621             
     622            if((defined $blnShow) && $blnShow) 
     623            { 
     624            print STDERR "$line"; 
     625            } 
     626             
     627            if ($line =~ m/^Location:\s*(.*?)\s*\[following\]\s*$/i) { 
     628            my $follow_url = $1; 
     629            push(@follow_list,$follow_url); 
     630            } 
     631             
     632            if ($line =~ m/ERROR\s+\d+/) { 
     633            $error_text .= $line; 
     634            } 
     635             
     636            $full_text .= $line; 
     637        } else { # error or EOF 
     638            if(!defined $len) { # error reading 
     639            #print STDERR "WgetDownload: Error reading from child stream: $!\n"; 
     640            $error = 1; 
     641            } 
     642            elsif ($len == 0) { # EOF, finished with this filehandle because 0 bytes read 
     643            #print STDERR "WgetDownload: wget finished\n"; # wget terminated naturally 
     644            } 
     645 
     646            $loop = 0; # error or EOF, either way will need to clean up and break out of outer loop 
     647             
     648            # last; # if we have more than one filehandle registered with IO::Select 
     649             
     650            $sel->remove($fh); # if more than one filehandle registered, we should unregister all of them here on error          
     651        } # end else error or EOF 
     652         
     653        } # end foreach on readyhandles 
     654    }  # end if on can_read 
     655 
     656    if($num_consecutive_timedouts >= $NUM_TRIES) { 
     657        $error = 1; 
     658        $loop = 0;                          # to break out of outer while loop 
     659 
     660        $num_consecutive_timedouts = 0; 
     661 
     662        #print STDERR "WARNING from WgetDownload: wget timed out $NUM_TRIES times waiting for a response\n"; 
     663        #print STDERR "\tThe URL may be inaccessible or the proxy configuration is wrong or incomplete.\n"; 
     664    } 
     665 
     666    if($loop == 0) { # error or EOF, either way, clean up 
     667         
     668        if($error) { 
     669        $self->{'forced_quit'} = 1;         # subclasses need to know we're quitting 
     670         
     671        if(kill(0, $childpid)) { 
     672            # If kill(0, $childpid) returns true, then the process is running 
     673            # and we need to kill it. 
     674            close($chld_in); 
     675            close($chld_out); 
     676            kill("TERM", $childpid); # prefix - to signal to kill process group 
     677             
     678            #die "Perl terminated wget after timing out repeatedly and is about to exit\n"; 
     679        } 
     680        } 
     681        else { # wget finished, terminate naturally 
     682        close($chld_in); 
     683        close($chld_out); 
     684        # Program terminates only when the following line is included  
     685        # http://perldoc.perl.org/IPC/Open2.html explains why this is necessary 
     686        # it prevents the child from turning into a "zombie process". 
     687        # While the wget process terminates without it, this perl script does not:  
     688        # the DOS prompt is not returned without it. 
     689        waitpid $childpid, 0; 
    487690        } 
    488691         
    489         if ($line =~ m/^Location:\s*(.*?)\s*\[following\]\s*$/i) { 
    490         my $follow_url = $1; 
    491         push(@follow_list,$follow_url); 
    492         } 
    493          
    494         if ($line =~ m/ERROR\s+\d+/) { 
    495         $error_text .= $line; 
    496         } 
    497          
    498         $full_text .= $line; 
    499     }  
    500     else { # wget finished, terminate naturally 
    501         #print STDERR "\nPerl: open2 command, input stream closed. Wget terminated naturally.\n"; 
    502         close($chld_in); 
    503         close($chld_out); 
    504         # Program terminates only when the following line is included  
    505         # http://perldoc.perl.org/IPC/Open2.html explains why this is necessary 
    506         # it prevents the child from turning into a "zombie process". 
    507         # While the wget process terminates without it, this perl script does not:  
    508         # the DOS prompt is not returned without it. 
    509         waitpid $childpid, 0; 
    510         $loop = 0; 
    511          
    512         $childpid = undef; 
     692        # error or not: 
     693        $childpid = undef;       
    513694        if(defined $port) { 
    514695        $read_set->remove($serverSocket); 
     
    518699 
    519700    # if we run this script from the command-line (as opposed to from GLI),  
    520     # then we're not working with sockets and can therefore can skip the next bits 
     701    # then we're not working with sockets and can therefore skip the next bits 
    521702    next unless(defined $port); 
    522703 
     
    547728            # for it to start up, checking for whether it is running in order to kill it. 
    548729            for(my $seconds = 1; $seconds <= 5; $seconds++) { 
    549             if(kill(0, $childpid)) {  
     730            if(kill(0, $childpid)) { 
    550731                # If kill(0, $childpid) returns true, then the process is running 
    551732                # and we need to kill it. 
    552733                close($chld_in); 
    553734                close($chld_out); 
    554                 kill("TERM", $childpid); 
     735                kill("TERM", $childpid); # prefix - to signal to kill process group 
    555736                 
    556737                $childpid = undef;