We have been seeing sporadic hangs on test-nbd-tls-psk.sh and
test-nbd-tls.sh (on my machine, running those two in a loop with
commits 0a76cae4 and 09e34ba2 reverted would fail within 100
attempts), where even though the client to the 'nbdkit nbd' process
has cleanly exited, things are stalled in .close where nbd is trying
to pthread_join() the reader thread, while the reader thread is itself
blocked on a poll() that will never make additional progress. In
hindsight, the problem is obvious: we used the wrong libnbd API
(synchronous instead of async), which often results in:
nbd .close nbd reader server
nbd_shutdown poll
- send NBD_CMD_DISC
receive NBD_CMD_DISC
shutdown(SHUT_WR)
wake up on EOF
nbd_aio_notify_read
- move state to CLOSED
nbd_aio_is_dead -> true
break loop
- poll
pthread_join
close fds
nbd_close
see EOF from client
but sometimes things happen differently:
nbd .close nbd reader server
nbd_shutdown poll
- send NBD_CMD_DISC
- poll
receive NBD_CMD_DISC
shutdown(SHUT_WR)
- wake up on EOF
- move state to CLOSED
pthread_join
stuck
As can be seen, having two threads compete on poll() on the same fd is
unwise. Worse, if the server uses a means that will cause us to see
EOF even though the socket is still open (such as shutdown(SHUT_WR) on
plaintext, or gnutls_bye() with TLS), but waits for us to close the
socket first, we end up with two processes deadlocked on each other.
(nbdkit as server has used gnutls_bye since commit bd9a52e0; qemu
however does not use it.)
Other commits such as 14ba4154, c70616f8, and 430f8141 show that
getting the shutdown sequence race-free has not been trivial.
Fixes: ab7760fc
Signed-off-by: Eric Blake <eblake(a)redhat.com>
---
plugins/nbd/nbd.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/plugins/nbd/nbd.c b/plugins/nbd/nbd.c
index ba1e7188..0ea2a4a2 100644
--- a/plugins/nbd/nbd.c
+++ b/plugins/nbd/nbd.c
@@ -532,7 +532,7 @@ nbdplug_open (int readonly)
static void
nbdplug_close_handle (struct handle *h)
{
- if (nbd_shutdown (h->nbd, 0) == -1)
+ if (nbd_aio_disconnect (h->nbd, 0) == -1)
nbdkit_debug ("failed to clean up handle: %s", nbd_get_error ());
if ((errno = pthread_join (h->reader, NULL)))
nbdkit_debug ("failed to join reader thread: %m");
--
2.26.0.rc2