On Thu, Mar 09, 2023 at 09:13:24AM +0000, Daniel P. Berrangé wrote:
On Thu, Mar 09, 2023 at 08:44:51AM +0000, Richard W.M. Jones wrote:
> When a highly multi-threaded program such as nbdcopy encounters an
> error, there is a race condition in the library which can cause an
> assertion failure and thus a core dump:
>
> (1) An error occurs on one of the threads. nbdcopy calls exit(3).
>
> (2) In lib/errors.c, the destructor calls pthread_key_delete.
>
> (3) Another thread which is still running also encounters an error,
> and inside libnbd the library calls set_error().
>
> (4) The call to set_error() calls pthread_getspecific which returns
> NULL (since the key has already been destroyed in step (2)), and this
> causes us to call pthread_setspecific which returns EINVAL because
> glibc is able to detect invalid use of a pthread_key_t after it has
> been destroyed. In any case, the error message is lost, and any
> subsequent call to nbd_get_error() will return NULL.
>
> (5) We enter the %DEAD state, where there is an assertion that
> nbd_get_error() is not NULL. This assertion is supposed to be for
> checking that the code called set_error() before entering the %DEAD
> state. Although we did call set_error(), the error message was lost
> at step (4) and nbd_get_error() will always return NULL. This
> assertion failure causes a crash.
>
> There aren't any good ways to fix this. I chose to leak the
> pthread_key_t on the exit path.
> ---
> lib/errors.c | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/lib/errors.c b/lib/errors.c
> index 8b77650ef3..6fbfaacd34 100644
> --- a/lib/errors.c
> +++ b/lib/errors.c
> @@ -69,7 +69,11 @@ errors_key_destroy (void)
> free (last_error->error);
> free (last_error);
> }
> - pthread_key_delete (errors_key);
> +
> + /* We could do this, but that causes a race condition described here:
> + *
https://listman.redhat.com/archives/libguestfs/2023-March/031002.html
> + */
> + //pthread_key_delete (errors_key);
> }
While I think this fixes the problem with pthread_setspecific crashing,
I believe this then opens apps up to the problem hit by libvirt with
destructors being run which don't exist in memory.
Consider a thread is running that has done something with libnbd which
caused allocate_last_error_on_demand() to run, which populates the
'errors_key' data for that thread. This thread is no longer doing anything
with libnbd, but the 'errors_key' data remains populated none the less
because that's just how thread locals work.
Now nothing at all is using libnbd.so, so some thread calls dlclose(),
causing libnbd.so to be unmapped from memory. Since we've not called
pthread_key_delete, the thread local still exists and, crucially, so
does its registered destructor function.
The thread mentioned in the previous paragraph now terminates and this
causes the destructor function 'free_errors_key' to be run.... except
this function isn't mapped in memory any more. This will cause the
program to crash.
Indeed it does exactly this. v4 posted.
Rich.
virt-p2v converts physical machines to virtual machines. Boot with a
live CD or over the network (PXE) and turn machines into KVM guests.