On 3/19/20 7:13 AM, Richard W.M. Jones wrote:
[Dropping devel, adding libguestfs]
This can be reproduced on x86-64 so I can reproduce it locally. It
only appears to happen when the tests are run under rpmbuild, not when
I run them as ‘make check’, but I'm unclear why this is.
As Eric described earlier, the test runs two copies of nbdkit and a
client, connected like this:
qemu-img info ===> nbdkit nbd ===> nbdkit example1
[3] [2] [1]
These are started in order [1], [2] then [3]. When the client
(process [3]) completes it exits and then the test harness kills
processes [1] and [2] in that order.
I just hit a breakthrough in understanding the deadlock.
The stack trace of [2] at the hang is:
Thread 3 (Thread 0x7fabbf4f7700 (LWP 3955842)):
#0 0x00007fabc05c0f0f in poll () from /lib64/libc.so.6
This thread is calling poll() at the same time as:
#1 0x00007fabc090abba in poll (__timeout=-1, __nfds=2,
__fds=0x7fabbf4f6bb0) at /usr/include/bits/poll2.h:46
#2 nbdplug_reader (handle=0x5584020e09b0) at nbd.c:323
#3 0x00007fabc069d472 in start_thread () from /lib64/libpthread.so.0
#4 0x00007fabc05cc063 in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7fabbfcf8700 (LWP 3955793)):
#0 0x00007fabc069eab7 in __pthread_clockjoin_ex () from /lib64/libpthread.so.0
#1 0x00007fabc090af2b in nbdplug_close_handle (h=0x5584020e09b0) at nbd.c:538
this one just finished a poll(), because I used the blocking
nbd_shutdown instead of the non-blocking nbd_aio_disconnect. Depending
on which of the two threads wakes up first to service the server's
reaction, the other one can be stranded.
Closing the pipe-to-self is a bandaid that ensures the reader thread
eventually wakes up, but using the right API to begin with is even better.
--
Eric Blake, Principal Software Engineer
Red Hat, Inc. +1-919-301-3226
Virtualization:
qemu.org |
libvirt.org