On Wed, Mar 08, 2023 at 03:22:29PM -0600, Eric Blake wrote:
On Wed, Mar 08, 2023 at 08:29:41PM +0000, Richard W.M. Jones wrote:
> When a highly multi-threaded program such as nbdcopy encounters an
> error, there is a race condition in the library which can cause an
> assertion failure and thus a core dump:
>
> (1) An error occurs on one of the threads. nbdcopy calls exit(3).
>
> (2) In lib/errors.c, the destructor calls pthread_key_delete.
>
Note - POSIX says that calling pthread_key_delete() can leak memory.
If the reason we are calling pthread_key_delete() is because libnbd
was dlopen'd and is now being dlclose'd, and an app repeatedly reloads
libnbd, causes an error, and then unloads libnbd, this leak could grow
to be rather sizeable. But that is a rather unlikely scenario, and
safely coordinating cleanup is hard when we consider both normal
thread exits (where the key destructor runs if the key is not yet
deleted) and module unloads (where pthread_key_delete skips key
destructors).
> (3) Another thread which is still running also encounters an error,
> and inside libnbd the library calls set_error().
>
> (4) The call to set_error() calls pthread_getspecific which returns
> NULL (since the key has already been destroyed in step (2)), and this
This is undefined behavior per POSIX. It is not safe to call
pthread_getspecific() on a destroyed key (even though glibc lets us
get away with it by giving us EINVAL).
> causes us to call pthread_setspecific which returns EINVAL because
> glibc is able to detect invalid use of a pthread_key_t after it has
> been destroyed. In any case, the error message is lost, and any
> subsequent call to nbd_get_error() will return NULL.
Indeed - once we call pthread_key_delete() in step 2, ANY further use
of the key is going to cause undefined behavior (hopefully EINVAL
failures, but worse behavior is also possible).
One possible idea: have a non-thread-local global that starts life
false, but which errors_free() atomically sets to true prior to
calling pthread_key_delete(). All other functions which reference
errors_key check that the global is still false before using
pthread_{get,set}specific.
Another idea: have a counter of the number of threads that have set a
thread-local error but not yet exited. Increment the counter any time
allocate_last_error_on_demand() says a new thread is triggering an
error, and decrement the counter any time free_errors_key() calls the
destructor at the end of a thread. Only when all threads with an
allocated error have exited will the counter reach zero, so only the
last thread to clean up needs to call pthread_key_delete(). The
counter might not reach zero before the process exits, but that's okay
(if the reason the module is being unloaded is because the process is
going away, cleanup is less important), although I don't know if
valgrind would complain. In fact, POSIX suggests:
https://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_key_de...
"For an application to know that it is safe to delete a key, it has to
know that all the threads that might potentially ever use the key do
not attempt to use it again. For example, it could know this if all
the client threads have called a cleanup procedure declaring that they
are through with the module that is being shut down, perhaps by
setting a reference count to zero."
IMHO the suggested ways to solve this problem are all complex enough
that I would not have much confidence in them even once implemented.
The use of pthread thread-data destructors is inherantly dangerous
when combined with dlclose, to the extent that the best course of
action is to simply forbid this scenario.
Libvirt hit this problem with pthread data destructors and dlclose()
many years ago and we merely blocked the functional problem by linking
libvirt with '-z nodelete'...
commit 8e44e5593eb9b89fbc0b54fde15f130707a0d81e
Author: Daniel P. Berrangé <berrange(a)redhat.com>
Date: Thu Sep 1 17:57:06 2011 +0100
Prevent crash from dlclose() of libvirt.so
When libvirt calls virInitialize it creates a thread local
for the virErrorPtr storage, and registers a callback to
cleanup memory when a thread exits. When libvirt is dlclose()d
or otherwise made non-resident, the callback function is
removed from memory, but the thread local may still exist
and if a thread later exists, it will invoke the callback
and SEGV. There may also be other thread locals with callbacks
pointing to libvirt code, so it is in general never safe to
unload libvirt.so from memory once initialized.
To allow dlclose() to succeed, but keep libvirt.so resident
in memory, link with '-z nodelete'. This issue was first
found with the libvirt CIM provider, but can potentially
hit many of the dynamic language bindings which all ultimately
involve dlopen() in some way, either on libvirt.so itself,
or on the glue code for the binding which in turns links
to libvirt
Yes, this prevents you from unloading the library from memory so you
potentially have permanent memory usage, but unless the library is
really huge, I have a hard time believing that is a problem that
matters in practice. I doubt anyone has even noticed that libvirt
doesn't get removed from memory with dlcose()...
With regards,
Daniel
--
|:
https://berrange.com -o-
https://www.flickr.com/photos/dberrange :|
|:
https://libvirt.org -o-
https://fstop138.berrange.com :|
|:
https://entangle-photo.org -o-
https://www.instagram.com/dberrange :|