[Dropping devel, adding libguestfs]
This can be reproduced on x86-64 so I can reproduce it locally. It
only appears to happen when the tests are run under rpmbuild, not when
I run them as ‘make check’, but I'm unclear why this is.
As Eric described earlier, the test runs two copies of nbdkit and a
client, connected like this:
qemu-img info ===> nbdkit nbd ===> nbdkit example1
[3] [2] [1]
These are started in order [1], [2] then [3]. When the client
(process [3]) completes it exits and then the test harness kills
processes [1] and [2] in that order.
The stack trace of [2] at the hang is:
Thread 3 (Thread 0x7fabbf4f7700 (LWP 3955842)):
#0 0x00007fabc05c0f0f in poll () from /lib64/libc.so.6
#1 0x00007fabc090abba in poll (__timeout=-1, __nfds=2, __fds=0x7fabbf4f6bb0) at
/usr/include/bits/poll2.h:46
#2 nbdplug_reader (handle=0x5584020e09b0) at nbd.c:323
#3 0x00007fabc069d472 in start_thread () from /lib64/libpthread.so.0
#4 0x00007fabc05cc063 in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7fabbfcf8700 (LWP 3955793)):
#0 0x00007fabc069eab7 in __pthread_clockjoin_ex () from /lib64/libpthread.so.0
#1 0x00007fabc090af2b in nbdplug_close_handle (h=0x5584020e09b0) at nbd.c:538
#2 0x00005583f90caee0 in backend_close (b=<optimized out>) at backend.c:247
#3 0x00005583f90cdbf1 in free_connection (conn=0x5584020df890) at connections.c:359
#4 handle_single_connection (sockin=<optimized out>, sockout=<optimized out>)
at connections.c:230
#5 0x00005583f90d63e8 in start_thread (datav=0x5584020bf1b0) at sockets.c:356
#6 0x00007fabc069d472 in start_thread () from /lib64/libpthread.so.0
#7 0x00007fabc05cc063 in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7fabc002ca40 (LWP 3955770)):
#0 0x00007fabc06a3b02 in pthread_cond_wait@(a)GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x00005583f90d6dbf in accept_incoming_connections (socks=0x5584020bf010, nr_socks=1)
at sockets.c:501
#2 0x00005583f90ca441 in start_serving () at main.c:936
#3 main (argc=<optimized out>, argv=<optimized out>) at main.c:702
The stack trace of [1] at the hang is:
Thread 2 (Thread 0x7f30ac822700 (LWP 3955798)):
#0 0x00007f30b32a384c in recv () from /lib64/libpthread.so.0
#1 0x00007f30b3342c8f in _gnutls_stream_read (ms=0x7f30ac821a2c, pull_func=0x7f30b336a570
<system_read>, size=5, bufel=<synthetic pointer>, session=0x55c5cf561620) at
buffers.c:346
#2 _gnutls_read (ms=0x7f30ac821a2c, pull_func=0x7f30b336a570 <system_read>, size=5,
bufel=<synthetic pointer>, session=0x55c5cf561620) at buffers.c:426
#3 _gnutls_io_read_buffered (session=session@entry=0x55c5cf561620, total=5,
recv_type=recv_type@entry=4294967295, ms=0x7f30ac821a2c) at buffers.c:582
#4 0x00007f30b3338f3f in recv_headers (ms=<optimized out>, record=0x7f30ac821a80,
htype=4294967295, type=GNUTLS_ALERT, record_params=0x55c5cf565f60, session=0x55c5cf561620)
at record.c:1172
#5 _gnutls_recv_in_buffers (session=session@entry=0x55c5cf561620,
type=type@entry=GNUTLS_ALERT, htype=htype@entry=4294967295, ms=<optimized out>,
ms@entry=0) at record.c:1307
#6 0x00007f30b333b555 in _gnutls_recv_int (session=session@entry=0x55c5cf561620,
type=type@entry=GNUTLS_ALERT, data=data@entry=0x0, data_size=data_size@entry=0,
seq=seq@entry=0x0, ms=0) at record.c:1773
#7 0x00007f30b333b703 in gnutls_bye (session=session@entry=0x55c5cf561620,
how=how@entry=GNUTLS_SHUT_RDWR) at record.c:312
#8 0x000055c5c57af171 in crypto_close () at crypto.c:407
#9 0x000055c5c57aea58 in free_connection (conn=0x55c5cf560500) at connections.c:339
#10 handle_single_connection (sockin=<optimized out>, sockout=<optimized out>)
at connections.c:230
#11 0x000055c5c57b73e8 in start_thread (datav=0x55c5cf541550) at sockets.c:356
#12 0x00007f30b3299472 in start_thread () from /lib64/libpthread.so.0
#13 0x00007f30b31c8063 in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f30b2c28a40 (LWP 3955740)):
#0 0x00007f30b329fb02 in pthread_cond_wait@(a)GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1 0x000055c5c57b7dbf in accept_incoming_connections (socks=0x55c5cf5413b0, nr_socks=1)
at sockets.c:501
#2 0x000055c5c57ab441 in start_serving () at main.c:936
#3 main (argc=<optimized out>, argv=<optimized out>) at main.c:702
It seems like process [2] is hanging in pthread_join, waiting on
thread 3 (the reader loop) to finish. The reader loop is stuck on
poll.
Meanwhile process [1] is trying to do a clean TLS disconnect. This
involves waiting on process [2] to write something, which it is never
going to do.
I don't think this is actually anything to do with libnbd not cleanly
shutting down TLS connections. If libnbd had closed the socket
abruptly then process [1] wouldn't have to wait. I think this might
be a bug in the nbd plugin.
Rich.
--
Richard Jones, Virtualization Group, Red Hat
http://people.redhat.com/~rjones
Read my programming and virtualization blog:
http://rwmj.wordpress.com
virt-df lists disk usage of guests without needing to install any
software inside the virtual machine. Supports Linux and Windows.
http://people.redhat.com/~rjones/virt-df/