On 11/14/2017 04:25 AM, Richard W.M. Jones wrote:
On Mon, Nov 13, 2017 at 12:42:48PM -0600, Eric Blake wrote:
> I'm observing a difference in timing on nbdkit shutdown that is
> dependent on client behavior, and I wonder if that's a bug, but I can't
> figure out where to patch it.
>
> When there is no connection present, sending SIGINT to nbdkit terminates
> immediately.
>
> If, on the other hand, there is a single client currently connected, the
> termination behavior on SIGINT depends on what the client has done: if
> the client is currently silent with regards to issuing
> transmission-phase transactions, nbdkit hangs for 5 seconds, then
> forcibly tears down the connection:
I'm guessing it's because of commit 63f0eb0889c8f8a82ba06a02a8a92d695902baad
which I added to fix a race in plugin_cleanup(). See also:
https://www.redhat.com/archives/libguestfs/2017-September/msg00226.html
[...]
> Why does current traffic from the client cause the plugin to be torn
> down faster? Does it matter?
I believe because the main loop checks the !quit flag if there
is traffic on the connection.
There is most likely to be a better fix for the race than 63f0eb0889.
I added that as a quick workaround for the segfault we saw in the
tests. Perhaps we should actively cancel the threads on shutdown
instead of waiting?
If the plugin is still responding to an active request, and it is not
expecting thread cancellation, that could be bad (writing code that is
safe in the presence of pthread_cancel is not always trivial). But the
scenario I described is when there are no active requests at the moment,
so the threads are under our control. A better way might be to have a
read-from-self pipe that we write on SIGINT, and a select() that checks
for activity both on the read-from-self pipe and from the client, so
that we can react to either situation immediately. But then we still
have to be careful to (try) and let the pending plugin active requests
complete (and merely reject any new requests from the client during that
time), before giving up completely. In other words, I don't have a
problem with waiting 5 seconds for SIGINT to have an effect in some
cases, so much as a problem of explaining why we still have to wait even
when there is no reason for it.
At any rate, for now I'm just documenting the issues, rather than
planning on tackling the code associated with the problem; it will be a
task (for someone else?) down the road.
--
Eric Blake, Principal Software Engineer
Red Hat, Inc. +1-919-301-3266
Virtualization:
qemu.org |
libvirt.org