On Thu, Mar 09, 2023 at 09:50:00AM +0000, Richard W.M. Jones wrote:
When a highly multi-threaded program such as nbdcopy encounters an
error, there is a race condition in the library which can cause an
assertion failure and thus a core dump:
(1) An error occurs on one of the threads. nbdcopy calls exit(3).
(2) In lib/errors.c, the destructor calls pthread_key_delete.
(3) Another thread which is still running also encounters an error,
and inside libnbd the library calls set_error().
(4) The call to set_error() calls pthread_getspecific which returns
NULL (since the key has already been destroyed in step (2)), and this
causes us to call pthread_setspecific which returns EINVAL because
glibc is able to detect invalid use of a pthread_key_t after it has
been destroyed. In any case, the error message is lost, and any
subsequent call to nbd_get_error() will return NULL.
(5) We enter the %DEAD state, where there is an assertion that
nbd_get_error() is not NULL. This assertion is supposed to be for
checking that the code called set_error() before entering the %DEAD
state. Although we did call set_error(), the error message was lost
at step (4) and nbd_get_error() will always return NULL. This
assertion failure causes a crash.
There aren't any good ways to fix this, so I chose the same method as
used by libvirt in this commit:
https://gitlab.com/libvirt/libvirt/-/commit/8e44e5593eb9b89fbc0b54fde15f1...
(a) Use '-z nodelete' to prevent the library from being unloaded on
dlclose().
(b) Do not call pthread_key_destroy (thus leaking the key).
(c) When threads exit they are still able to call free_errors_key
because it is still present in memory.
Thanks: Daniel P. Berrangé, Eric Blake
---
configure.ac | 9 +++++++++
lib/Makefile.am | 1 +
lib/errors.c | 6 +++++-
3 files changed, 15 insertions(+), 1 deletion(-)
Reviewed-by: Daniel P. Berrangé <berrange(a)redhat.com>
With regards,
Daniel
--
|: