On Thu, Mar 09, 2023 at 11:11:50AM +0000, Daniel P. Berrangé wrote:
On Thu, Mar 09, 2023 at 09:50:00AM +0000, Richard W.M. Jones wrote:
> When a highly multi-threaded program such as nbdcopy encounters an
> error, there is a race condition in the library which can cause an
> assertion failure and thus a core dump:
>
> (1) An error occurs on one of the threads. nbdcopy calls exit(3).
>
> (2) In lib/errors.c, the destructor calls pthread_key_delete.
>
> (3) Another thread which is still running also encounters an error,
> and inside libnbd the library calls set_error().
>
> (4) The call to set_error() calls pthread_getspecific which returns
> NULL (since the key has already been destroyed in step (2)), and this
> causes us to call pthread_setspecific which returns EINVAL because
> glibc is able to detect invalid use of a pthread_key_t after it has
> been destroyed. In any case, the error message is lost, and any
> subsequent call to nbd_get_error() will return NULL.
>
> (5) We enter the %DEAD state, where there is an assertion that
> nbd_get_error() is not NULL. This assertion is supposed to be for
> checking that the code called set_error() before entering the %DEAD
> state. Although we did call set_error(), the error message was lost
> at step (4) and nbd_get_error() will always return NULL. This
> assertion failure causes a crash.
>
> There aren't any good ways to fix this, so I chose the same method as
> used by libvirt in this commit:
>
>
https://gitlab.com/libvirt/libvirt/-/commit/8e44e5593eb9b89fbc0b54fde15f1...
>
> (a) Use '-z nodelete' to prevent the library from being unloaded on
> dlclose().
>
> (b) Do not call pthread_key_destroy (thus leaking the key).
>
> (c) When threads exit they are still able to call free_errors_key
> because it is still present in memory.
>
> Thanks: Daniel P. Berrangé, Eric Blake
> ---
> configure.ac | 9 +++++++++
> lib/Makefile.am | 1 +
> lib/errors.c | 6 +++++-
> 3 files changed, 15 insertions(+), 1 deletion(-)
Reviewed-by: Daniel P. Berrangé <berrange(a)redhat.com>
Thanks.
I have pushed this as commit 368e3d0d5a because I'm under a bit of
time pressure to make a release of libnbd this morning. We can make
further revisions afterwards if there are issues found, although it
works fine for me, it fixes the original crash, and all the tests
(including valgrind) pass.
Rich.
--
Richard Jones, Virtualization Group, Red Hat
http://people.redhat.com/~rjones
Read my programming and virtualization blog:
http://rwmj.wordpress.com
nbdkit - Flexible, fast NBD server with plugins
https://gitlab.com/nbdkit/nbdkit