On Fri, Mar 27, 2020 at 05:33:28PM -0500, Eric Blake wrote:
We have been seeing sporadic hangs on test-nbd-tls-psk.sh, where
even
though the client to the 'nbdkit nbd' process has cleanly exited,
things are stalled in .close where nbd is trying to pthread_join() the
reader thread, while the reader thread is itself blocked on a poll()
that will never make additional progress. Tracing the race is
difficult: nbd_shutdown() sends NBD_CMD_DISC to the server, and the
NBD protocol does not require the server to send a response but does
not forbid the server from using gnutls_bye to at least gracefully end
the TLS session. So where a plaintext server just closes the socket
(and the resulting EOF exits our poll()), it appears that with TLS, we
can sometimes hit the situation where the server is waiting for a
response to gnutls_bye() (note that qemu does not use this function,
but nbdkit does), while at the same time our client is waiting to see
EOF on the socket, resulting in deadlock. In the past (see commit
430f8141), we added shutdown(SHUT_WR) for plaintext clients to prod
the server into closing the socket faster, but libnbd does not yet
have that counterpart action.
But I can at least reuse the same mechanism we have for waking up the
poll() loop when first sending a command to the server. That is,
since we already have a pipe-to-self in addition to reading from the
server, it's trivial to argue that closing the pipe-to-self will
guarantee that the reader thread sees something interesting to break
out of its poll() loop, regardless of whether it also sees something
interesting from the server after having sent NBD_CMD_DISC, and
regardless of whether I need to add in more gnutls_bye() calls to
either nbdkit or libnbd.
Fixes: ab7760fc
Signed-off-by: Eric Blake <eblake(a)redhat.com>
---
May be incomplete: I might also need to break out of the reader loop
when read() returns 0.
plugins/nbd/nbd.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/plugins/nbd/nbd.c b/plugins/nbd/nbd.c
index e5b8f338..23a7da06 100644
--- a/plugins/nbd/nbd.c
+++ b/plugins/nbd/nbd.c
@@ -537,10 +537,10 @@ nbdplug_close_handle (struct handle *h)
{
if (nbd_shutdown (h->nbd, 0) == -1)
nbdkit_debug ("failed to clean up handle: %s", nbd_get_error ());
+ close (h->fds[1]);
if ((errno = pthread_join (h->reader, NULL)))
nbdkit_debug ("failed to join reader thread: %m");
close (h->fds[0]);
- close (h->fds[1]);
nbd_close (h->nbd);
free (h);
}
Probably deserves a comment in the code to mention to future Eric (and
Richard) not to move that close call. But yes this is also fine, so ACK.
Rich.
--
Richard Jones, Virtualization Group, Red Hat
http://people.redhat.com/~rjones
Read my programming and virtualization blog:
http://rwmj.wordpress.com
virt-top is 'top' for virtual machines. Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top