[replying here, as I seem to have been dropped from cc on the subthread
at
https://lists.fedoraproject.org/archives/list/devel@lists.fedoraproject.o...
- maybe I should subscribe to devel@ instead of seeing this second-hand...
hmm - I can't even post to devel@ without subscribing, so now just
sending this to libguestfs]
[adding libguestfs - now that devel@ has helped point to a bug in nbdkit
itself]
On 3/18/20 4:49 AM, Richard W.M. Jones wrote:
On Wed, Mar 18, 2020 at 09:38:52AM +0000, Peter Robinson wrote:
>> This might be a bug in the package itself, but has anyone seen builds
>> hanging in weird places, in Rawhide, especially on armv7 and s390x?
>>
>> This packge build has hung 3 times in the same place, once on armv7
>> and twice on s390x:
>>
>>
https://koji.fedoraproject.org/koji/taskinfo?taskID=42570766
>>
>> It's hard to explain how it could hang at that place in the build
>> unless something fundamental is broken like make.
>
> Well make 4.3 did land recently (March 12th) in rawhide so that's
> entirely possible.
Yes, Eric Blake pointed this out to me too. However I don't really
want to blame make unless others have seen similar hangs. It could
easily be a new bug in the package itself.
If anyone has access to that builder, it might be interesting to get a
process listing, or strace of whatever process is hanging.
Dan Horak added:
it's a deadlock in the tests, not in make. Reproduced with
"fedpkg local" in a cycle.
sharkcz 1649225 0.0 0.0 222288 3904 pts/5 S+ 06:24 0:00 /bin/sh -e
/var/tmp/rpm-tmp.RXcMRr
sharkcz 1649230 0.0 0.0 10372 3248 pts/5 S+ 06:24 0:00 make -j4 check
sharkcz 1658088 0.0 0.0 251236 3400 pts/5 Sl+ 06:25 0:00
/home/sharkcz/nbdkit/nbdkit-1.19.3/server/nbdkit -v -P test-nbd-tls-psk.pid1 -U
/tmp/tmp.7e7Gv5MPmZ --tls=require --tls-psk=keys.psk --
/home/sharkcz/nbdkit/nbdkit-1.19.3/plugins/example1/.libs/nbdkit-example1-plugin.so
sharkcz 1658091 0.0 0.1 192944 4464 pts/5 Sl+ 06:25 0:00
/home/sharkcz/nbdkit/nbdkit-1.19.3/server/nbdkit -v -P test-nbd-tls-psk.pid2 -U
/tmp/tmp.yp61yXx09y --tls=off --
/home/sharkcz/nbdkit/nbdkit-1.19.3/plugins/nbd/.libs/nbdkit-nbd-plugin.so tls=require
tls-psk=keys.psk tls-username=qemu socket=/tmp/tmp.7e7Gv5MPmZ
the 2 nbdkit processes are stuck in the futex() syscall
Reconstructing state from those command lines - we have a TLS test that
operates 3 processes:
client <=> nbdkit nbd <=> nbdkit example1
it looks like this particular test was checking a plain-text client
connecting to nbdkit nbd, which in turn was connecting as a TLS client
to nbdkit example1. I also know that 'nbdkit nbd' uses libnbd to
support TLS, and that we have not fully implemented clean TLS teardown
in libnbd - so it could be that the nbd side has told the example1 side
that it will be shutting down soon, but due to unclean TLS library
usage, is missing a poll() wakeup to realize that there will be no
further response coming from the example1 side; while the example1 side
is doing blocking I/O waiting for the nbd side to close the socket. The
overall test that spawned both nbdkit processes in the background
(tests/test-nbd-tls-psk.sh) has completed, though, stranding those two
hung child processes without their original parent but letting 'make
check' report testsuite success.
As to why make is hanging, that is beyond me. Maybe something new in
make 4.3 is detecting that we have stranded indirect processes, and is
waiting for them to complete?
Ideally, we need to fix libnbd TLS support to do cleaner shutdown.
Pragmatically, nbdkit's tests/functions.sh start_nbdkit() function right
now tries only a single:
cleanup_fn kill "$(cat "$pidfile")"
without waiting to see if it actually worked. We could probably turn
that into a more robust kill_nbdkit() function that first tries the
graceful SIGTERM, waits a few seconds to confirm whether the process
actually died, and follows up with a harder SIGKILL as needed
(preferably failing a test whenever SIGTERM was insufficient). It may
not solve the bug in libnbd TLS shutdown, but would at least prevent
stuck processes.
--
Eric Blake, Principal Software Engineer
Red Hat, Inc. +1-919-301-3226
Virtualization:
qemu.org |
libvirt.org