Re: [Libguestfs] [PATCH libnbd] lib/errors.c: Fix assert fail in exit path in multi-threaded code

Wednesday, 8 March 2023

On Wed, Mar 08, 2023 at 03:22:29PM -0600, Eric Blake wrote:
...
 On Wed, Mar 08, 2023 at 08:29:41PM +0000, Richard W.M. Jones wrote:
 > When a highly multi-threaded program such as nbdcopy encounters an
 > error, there is a race condition in the library which can cause an
 > assertion failure and thus a core dump:
 > 
 > (1) An error occurs on one of the threads.  nbdcopy calls exit(3).
 > 
 > (2) In lib/errors.c, the destructor calls pthread_key_delete.
 >

 Note - POSIX says that calling pthread_key_delete() can leak memory.
 If the reason we are calling pthread_key_delete() is because libnbd
 was dlopen'd and is now being dlclose'd, and an app repeatedly reloads
 libnbd, causes an error, and then unloads libnbd, this leak could grow
 to be rather sizeable.  But that is a rather unlikely scenario, and
 safely coordinating cleanup is hard when we consider both normal
 thread exits (where the key destructor runs if the key is not yet
 deleted) and module unloads (where pthread_key_delete skips key
 destructors).

 > (3) Another thread which is still running also encounters an error,
 > and inside libnbd the library calls set_error().
 > 
 > (4) The call to set_error() calls pthread_getspecific which returns
 > NULL (since the key has already been destroyed in step (2)), and this

 This is undefined behavior per POSIX.  It is not safe to call
 pthread_getspecific() on a destroyed key (even though glibc lets us
 get away with it by giving us EINVAL).

 > causes us to call pthread_setspecific which returns EINVAL because
 > glibc is able to detect invalid use of a pthread_key_t after it has
 > been destroyed.  In any case, the error message is lost, and any
 > subsequent call to nbd_get_error() will return NULL.

 Indeed - once we call pthread_key_delete() in step 2, ANY further use
 of the key is going to cause undefined behavior (hopefully EINVAL
 failures, but worse behavior is also possible).

 One possible idea: have a non-thread-local global that starts life
 false, but which errors_free() atomically sets to true prior to
 calling pthread_key_delete().  All other functions which reference
 errors_key check that the global is still false before using
 pthread_{get,set}specific.

 Another idea: have a counter of the number of threads that have set a
 thread-local error but not yet exited.  Increment the counter any time
 allocate_last_error_on_demand() says a new thread is triggering an
 error, and decrement the counter any time free_errors_key() calls the
 destructor at the end of a thread.  Only when all threads with an
 allocated error have exited will the counter reach zero, so only the
 last thread to clean up needs to call pthread_key_delete().  The
 counter might not reach zero before the process exits, but that's okay
 (if the reason the module is being unloaded is because the process is
 going away, cleanup is less important), although I don't know if
 valgrind would complain.  In fact, POSIX suggests:

 https://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_key_de...
 "For an application to know that it is safe to delete a key, it has to
 know that all the threads that might potentially ever use the key do
 not attempt to use it again. For example, it could know this if all
 the client threads have called a cleanup procedure declaring that they
 are through with the module that is being shut down, perhaps by
 setting a reference count to zero." 
IMHO the suggested ways to solve this problem are all complex enough
that I would not have much confidence in them even once implemented.

The use of pthread thread-data destructors is inherantly dangerous
when combined with dlclose, to the extent that the best course of
action is to simply forbid this scenario.

Libvirt hit this problem with pthread data destructors and dlclose()
many years ago and we merely blocked the functional problem by linking
libvirt with '-z nodelete'...

  commit 8e44e5593eb9b89fbc0b54fde15f130707a0d81e
  Author: Daniel P. Berrangé <berrange(a)redhat.com&gt;
  Date:   Thu Sep 1 17:57:06 2011 +0100

    Prevent crash from dlclose() of libvirt.so

    When libvirt calls virInitialize it creates a thread local
    for the virErrorPtr storage, and registers a callback to
    cleanup memory when a thread exits. When libvirt is dlclose()d
    or otherwise made non-resident, the callback function is
    removed from memory, but the thread local may still exist
    and if a thread later exists, it will invoke the callback
    and SEGV. There may also be other thread locals with callbacks
    pointing to libvirt code, so it is in general never safe to
    unload libvirt.so from memory once initialized.

    To allow dlclose() to succeed, but keep libvirt.so resident
    in memory, link with '-z nodelete'. This issue was first
    found with the libvirt CIM provider, but can potentially
    hit many of the dynamic language bindings which all ultimately
    involve dlopen() in some way, either on libvirt.so itself,
    or on the glue code for the binding which in turns links
    to libvirt

Yes, this prevents you from unloading the library from memory so you
potentially have permanent memory usage, but unless the library is
really huge, I have a hard time believing that is a problem that
matters in practice. I doubt anyone has even noticed that libvirt
doesn't get removed from memory with dlcose()...

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Re: [Libguestfs] [PATCH libnbd] lib/errors.c: Fix assert fail in exit path in multi-threaded code