On 07/15/22 11:31, Richard W.M. Jones wrote:
On Fri, Jul 15, 2022 at 09:30:38AM +0200, Laszlo Ersek wrote:
> On 07/14/22 14:36, Richard W.M. Jones wrote:
>> Add a new, optional [?wait] parameter to On_exit.kill, allowing
>> programs to wait for a number of seconds for the subprocess to exit.
>> ---
>> mltools/on_exit.ml | 30 +++++++++++++++++++++---------
>> mltools/on_exit.mli | 14 ++++++++++++--
>> 2 files changed, 33 insertions(+), 11 deletions(-)
>>
>> diff --git a/mltools/on_exit.ml b/mltools/on_exit.ml
>> index e8353df..cdaa83c 100644
>> --- a/mltools/on_exit.ml
>> +++ b/mltools/on_exit.ml
>> @@ -24,10 +24,10 @@ open Unix
>> open Printf
>>
>> type action =
>> - | Unlink of string (* filename *)
>> - | Rm_rf of string (* directory *)
>> - | Kill of int * int (* signal, pid *)
>> - | Fn of (unit -> unit) (* generic function *)
>> + | Unlink of string (* filename *)
>> + | Rm_rf of string (* directory *)
>> + | Kill of int * int * int (* wait, signal, pid *)
>> + | Fn of (unit -> unit) (* generic function *)
>>
>> (* List of (priority, action). *)
>> let actions = ref []
>> @@ -35,18 +35,30 @@ let actions = ref []
>> (* Perform a single exit action, printing any exception but
>> * otherwise ignoring failures.
>> *)
>> -let do_action action =
>> +let rec do_action action =
>
> I'd slightly prefer (a) do_kill to be introduced either before
> do_action, or (b) for do_kill to be defined inside do_action, to using
> "let rec" here. I think I understand what "rec" does to scoping,
but
> still, we don't have actual recursion here.
>
> (I've noticed this pattern in many places in the v2v projects, and it
> always confuses me -- it doesn't bring too much convenience IMO, so I'd
> rather restrict it to actual recursion.)
It just allows you to write code top-down instead of bottom-up.
As you can see I mix both styles when I fancy it :-/
>> try
>> match action with
>> | Unlink file -> Unix.unlink file
>> | Rm_rf dir ->
>> let cmd = sprintf "rm -rf %s" (Filename.quote dir) in
>> ignore (Tools_utils.shell_command cmd)
>> - | Kill (signal, pid) ->
>> - kill pid signal
>> + | Kill (wait, signal, pid) ->
>> + do_kill ~wait ~signal ~pid
>> | Fn f -> f ()
>> with exn -> debug "%s" (Printexc.to_string exn)
>>
>> +and do_kill ~wait ~signal ~pid =
>> + kill pid signal;
>> +
>> + let rec loop i =
>> + if i > 0 then (
>> + let pid', _ = waitpid [ WNOHANG ] pid in
>> + if pid' = 0 then
>> + loop (i-1)
>
> Missing: "sleep 1;" before the tail-recursive call, I think.
Ugh, indeed.
>> + )
>> + in
>> + loop wait
>> +
>> (* Make sure the actions are performed only once. *)
>> let done_actions = ref false
>>
>> @@ -106,6 +118,6 @@ let rm_rf ?(prio = 1000) dir =
>> register ();
>> List.push_front (prio, Rm_rf dir) actions
>>
>> -let kill ?(prio = 1000) ?(signal = Sys.sigterm) pid =
>> +let kill ?(prio = 1000) ?(wait = 0) ?(signal = Sys.sigterm) pid =
>> register ();
>> - List.push_front (prio, Kill (signal, pid)) actions
>> + List.push_front (prio, Kill (wait, signal, pid)) actions
>> diff --git a/mltools/on_exit.mli b/mltools/on_exit.mli
>> index 910783e..dd35101 100644
>> --- a/mltools/on_exit.mli
>> +++ b/mltools/on_exit.mli
>> @@ -58,12 +58,22 @@ val unlink : ?prio:int -> string -> unit
>> val rm_rf : ?prio:int -> string -> unit
>> (** Recursively remove a temporary directory on exit (using [rm -rf]). *)
>>
>> -val kill : ?prio:int -> ?signal:int -> int -> unit
>> +val kill : ?prio:int -> ?wait:int -> ?signal:int -> int -> unit
>> (** Kill [PID] on exit. The signal sent defaults to [Sys.sigterm].
>>
>> Use this with care since you can end up unintentionally killing
>> another process if [PID] goes away or doesn't exist before the
>> - program exits. *)
>> + program exits.
>> +
>> + The optional [?wait] flag attempts to wait for a specified
>> + number of seconds for the subprocess to go away. For example
>> + using [~wait:5] will wait for up to 5 seconds. Since this
>> + runs when virt-v2v is exiting, it is best to keep waiting times
>> + as short as possible. Also there is no way to report errors
>> + in the subprocess. If reliable cleanup of a subprocess is
>> + required then this is not the correct place to do it.
>> +
>> + [?wait] defaults to [0] which means we do not try to wait. *)
>>
>> val register : unit -> unit
>> (** Force this module to register its at_exit function and signal
>>
>
> (please consider formatting *.mli before *.ml)
>
> I believe I take the opposite position on this; I'd rather wait forever.
> No subprocess is expected to hang, and we should leave no subprocess
> behind. If a subprocess hangs, the parent process should (apparently)
> hang forever too, and let users report bugs.
>
> That would also eliminate the question of *how* to wait for N seconds;
> we'd just drop WNOHANG from waitpid.
>
> But this is just my opinion; food for thought. :)
So another way to think about this is that if the process hasn't gone
away after N seconds, something has gone wrong, so we should report an
error (but also exit). How about that?
Good point! That's actually the best for the user. Don't let them wonder
forever, also don't pretend everything is fine when the child process
doesn't exist in 5 seconds and just silently give up.
I'm also not convinced that lazy umount is wrong in this
situation.
In general, and particularly in the livecd-creator case, it was wrong.
But in this case we know that nbdcopy flushed the data successfully,
so the data and metadata of the file must have been written to the NFS
server. If the umount fails here it's caused by some other problem
(eg. stuck nbdkit) but not one that should ever lead to data loss.
I'm nervous about lazy unmount; it's too indiscriminate. We just stop
waiting and assume nothing else is currently keeping a reference to the
mount point.
I think your idea about preserving the timeout, but failing hard when it
actually elapses, is the most user-friendly approach. I guess that could
mean an "else" branch in the "loop" function, where we should abort
hard, or something like that (we can't exit cleanly, as we're already
exiting and running cleanup handlers... but maybe we have a
counter-measure for that already. I seem to recall "run these actions
only once".)
Really, anything but lazy unmount please. :)
Thanks!
Laszlo