When a highly multi-threaded program such as nbdcopy encounters an
error, there is a race condition in the library which can cause an
assertion failure and thus a core dump:
(1) An error occurs on one of the threads. nbdcopy calls exit(3).
(2) In lib/errors.c, the destructor calls pthread_key_delete.
(3) Another thread which is still running also encounters an error,
and inside libnbd the library calls set_error().
(4) The call to set_error() calls pthread_getspecific which returns
NULL (since the key has already been destroyed in step (2)), and this
causes us to call pthread_setspecific which returns EINVAL because
glibc is able to detect invalid use of a pthread_key_t after it has
been destroyed. In any case, the error message is lost, and any
subsequent call to nbd_get_error() will return NULL.
(5) We enter the %DEAD state, where there is an assertion that
nbd_get_error() is not NULL. This assertion is supposed to be for
checking that the code called set_error() before entering the %DEAD
state. Although we did call set_error(), the error message was lost
at step (4) and nbd_get_error() will always return NULL. This
assertion failure causes a crash.
There aren't any good ways to fix this. I chose to remove the
assertions that might trigger, and ignore pthread_getspecific
returning EINVAL.
This reverts a small part of commit 831cbf7ee14c ("generator: Allow
DEAD state actions to run").
---
generator/state_machine_generator.ml | 5 +----
generator/states.c | 2 --
lib/errors.c | 5 ++++-
3 files changed, 5 insertions(+), 7 deletions(-)
diff --git a/generator/state_machine_generator.ml b/generator/state_machine_generator.ml
index 43c1ce4c14..9f94bed1f3 100644
--- a/generator/state_machine_generator.ml
+++ b/generator/state_machine_generator.ml
@@ -449,10 +449,7 @@ let
pr " abort (); /* Should never happen, but keeps GCC happy. */\n";
pr " }\n";
pr "\n";
- pr " if (r == -1) {\n";
- pr " assert (nbd_get_error () != NULL);\n";
- pr " return -1;\n";
- pr " }\n";
+ pr " if (r == -1) return -1;\n";
pr " } while (!blocked);\n";
pr " return 0;\n";
pr "}\n";
diff --git a/generator/states.c b/generator/states.c
index fa0f8d716e..c0cf5a7f26 100644
--- a/generator/states.c
+++ b/generator/states.c
@@ -191,8 +191,6 @@ STATE_MACHINE {
return 0;
DEAD:
- /* The caller should have used set_error() before reaching here */
- assert (nbd_get_error ());
abort_option (h);
nbd_internal_abort_commands (h, &h->cmds_to_issue);
nbd_internal_abort_commands (h, &h->cmds_in_flight);
diff --git a/lib/errors.c b/lib/errors.c
index 8b77650ef3..9142a0843e 100644
--- a/lib/errors.c
+++ b/lib/errors.c
@@ -93,7 +93,10 @@ allocate_last_error_on_demand (void)
last_error = calloc (1, sizeof *last_error);
if (last_error) {
int err = pthread_setspecific (errors_key, last_error);
- if (err != 0) {
+ /* Ignore EINVAL because that can happen in a race with other
+ * threads when we are exiting.
+ */
+ if (err != 0 && err != EINVAL) {
/* This is not supposed to happen (XXX). */
fprintf (stderr, "%s: %s: %s\n", "libnbd",
"pthread_setspecific",
strerror (err));
--
2.39.2