Re: [Libguestfs] [libnbd PATCH 0/4] Work towards fixing nbdcopy silent corruption

Wednesday, 2 February 2022

On 02/02/22 11:43, Richard W.M. Jones wrote:
...
 On Wed, Feb 02, 2022 at 09:23:09AM +0100, Laszlo Ersek wrote:
> On 02/01/22 20:44, Eric Blake wrote:
>> These are the trivial patches modeled after Nir's commit 7ba6ef67,
>> fixing further places that fail to check *error during callbacks.
>
> I find these interfaces very difficult to use.
>
> https://libguestfs.org/nbd_aio_block_status.3.html
>
> We have two callbacks here, each takes a *error, and I have zero idea of
> the order the callbacks are supposed to be called in.

 The documentation is not great, but I can tell you that they are
 called in the order:

  - extent_cb (usually once, but in theory several times)
  - completion_cb (once at the end)

 The *error parameter is also not well documented, but the behaviour is:

  - There is a single error value stored for the asynch block status
    command (cmd->error).

  - *error is initialized from cmd->error on entry to the callback,
    ie. *error != 0 if an error has been reported by a previous
    callback.

  - If the callback sets *error != 0 _and_ returns -1 then cmd->error
    is updated.

  - The callback can also fail by returning -1 and not setting *error,
    in which case cmd->error is set to EPROTO (but don't do this as
    it means your program is buggy - but see below about OCaml).

  - Since there's only one error value, updating it overwrites
    previous values.

  - Returning -1 doesn't stop future callbacks, because basically we're
    pulling data off the wire and there's no way to tell the server to
    stop.  So extent and completion callbacks are still called.

  - Don't forget that its not just callbacks that can fail, commands
    can fail for other reasons, eg. network errors.  In this case the
    whole connection will drop into what we call the %DEAD state
    (because NBD can't resynchronize).  Every call returns -1 and you
    have to close and reopen the connection.  It's possible to
    explicitly test for this (nbd_aio_is_dead).

https://gitlab.com/nbdkit/libnbd/-/blob/1373d423f5ac42255fe606d2a7ab1caf1...

 To find out if a command had an error you can do either (but not both):

  - Check for *error != 0 in completion_cb.  Return 1 from
    completion_cb which auto-retires the command.

  - Call nbd_aio_command_completed which will return -1 and
    then call nbd_get_errno which will return cmd->error.  This
    also retires the command.  (Note in this case you also have
    to deal with nbd_aio_command_completed returning 0, meaning
    the command has not completed, or 1, meaning the command
    completed without error).

 The first method is what nbdcopy should be doing but isn't.

> The last paragraph of the commit message speaks about this, and the code
> updates seem to suggest some ordering, but I'm still unsure.
>
> ... In fact, I'm now uncertain whether ignoring "err" in
virt-v2v's
> "lib/utils.ml", function "get_disk_allocated", is safe. It's
a "sync"
> (not aio) nbd_block_status() call, but the documentation
> <https://libguestfs.org/nbd_block_status.3.html> says:
>
> "On entry to the callback, the error parameter contains the errno value
> of any previously detected error."

 The above applies to synch commands as well, with the difference that:

  - There's no completion callback (because there's no need for it in a
    synch call).

  - Once the command completes, it is retired and the error status is
    copied to nbd_get_errno.

  - The synch command returns -1 if there was an error or 0 if successful.

 See lib/rw.c: wait_for_command and nbd_unlocked_block_status.

 So in answer to your question, virt-v2v can ignore !err, unless there
 is a reason why you want to know if previous callbacks encountered an
 error (maybe you want to save some work).  You'll still get an
 exception thrown by NBD.block_status.

> So what should we do then, in "get_disk_allocated"? Return -1
> immediately? And if we do so, what is that going to mean for
> "NBD.block_status"? Will it raise an exception?

 All OCaml wrappers check for the underlying nbd_* call failing and
 turn that into an exception which carries the errno and error string.
 See ocaml/nbd-c.c and ocaml/helpers.c:nbd_internal_ocaml_raise_error

 In addition, when calling back into OCaml code (ie. into the extent
 callback), if the callback throws an exception then that is
 automatically turned into setting *error from the err reference and
 returning -1.

 OCaml code ought to do this when encountering an error inside the
 callback:

   if some_error then (
     err := errno;
     raise AnException
   )

 But there are a number of difficulties with that in practice:

  - OCaml functions that you call might raise exceptions which you
    don't catch, so you don't end up setting !err on all exceptions.

  - OCaml code doesn't have access to C errno values!

 I wouldn't worry too much about this.  What happens in practice is
 that nbd_internal_ocaml_exception_in_wrapper prints this on stderr:

   libnbd: <operation>: uncaught OCaml exception: <exception name>

 cmd->error is set to EPROTO and the function as a whole throws some
 NBD.Error exception.  The error doesn't get lost in this case, it's
 just that the output is not ideal.

 See ocaml/nbd-c.c:extent_wrapper_locked and
 ocaml/helpers.c:nbd_internal_ocaml_exception_in_wrapper 
OK -- so end-to-end (IIUC), all I could learn, with reading !err in the
OCaml callback, is whether an earlier invocation of the callback raised
an exception (possibly indirectly, i.e. by returning -1 to the wrapper,
or directly by raising an exception, or even by allowing a deeper
exception to escape). Eric said the extents array and the metacontext ID
are still valid in this case, so eliminating the loop body would only be
a speed-up, and the outermost call will throw an exception anyway.

... What happens "if the server replies with NBD_REPLY_TYPE_ERROR first
and then NBD_REPLY_TYPE_BLOCK_STATUS second (a custom server)"?

Hmm I don't think it matters much: we already assume that the entries
make sense (e.g. we already assert() that the server does not send
zero-length entries), so we'd just complete the inner loop (ignoring
!err being nonzero), and then the NBD.block_status function would raise
an exception. OK!

Thanks!
Laszlo

...

 Rich.

> Thanks,
> Laszlo
>
>> nbdcopy still has a bigger bug in that it is NOT properly checking for
>> *error in the pread/pwrite completion handlers, but fixing that
>> gracefully (rather than just quitting the process immediately, as we
>> can do in the examples), requires more invasive patches which I will
>> be posting separately for review, while the ones posted here are
>> trivial enough that I'm pushing them now (commits 23bcea4a..1373d423).
>>
>> I also audited other uses of completion callbacks:
>> - in tests, closure-lifetimes, errors, server-death, and
>>   shutdown-flags are all safe (either the callback returns 0 and the
>>   code later calls nbd_aio_command_completed, or we are explicitly
>>   testing aspects of completion callbacks)
>> - in fuse/operations.c, the completion callback returns 0 and calls
>>   nbd_aio_command_completed later
>> - in examples/strict-structured-reads.c, the completion callback
>>   properly checks *error
>>
>> I also checked that nbdkit's nbd plugin properly checks *error.
>>
>> Eric Blake (4):
>>   examples: aio-connect-read.c: Fix error handling
>>   examples: glib-main-loop: don't strand commands when gsource
>>     disappears
>>   examples: glib-main-loop: Fix error handling
>>   copy: Ignore extents if we encountered earlier error
>>
>>  examples/aio-connect-read.c |  7 +++++++
>>  examples/glib-main-loop.c   | 14 ++++++++++++--
>>  copy/nbd-ops.c              |  4 ++--
>>  3 files changed, 21 insertions(+), 4 deletions(-)
>>

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Re: [Libguestfs] [libnbd PATCH 0/4] Work towards fixing nbdcopy silent corruption