On 02/02/22 11:43, Richard W.M. Jones wrote:
On Wed, Feb 02, 2022 at 09:23:09AM +0100, Laszlo Ersek wrote:
> On 02/01/22 20:44, Eric Blake wrote:
>> These are the trivial patches modeled after Nir's commit 7ba6ef67,
>> fixing further places that fail to check *error during callbacks.
>
> I find these interfaces very difficult to use.
>
>
https://libguestfs.org/nbd_aio_block_status.3.html
>
> We have two callbacks here, each takes a *error, and I have zero idea of
> the order the callbacks are supposed to be called in.
The documentation is not great, but I can tell you that they are
called in the order:
- extent_cb (usually once, but in theory several times)
- completion_cb (once at the end)
The *error parameter is also not well documented, but the behaviour is:
- There is a single error value stored for the asynch block status
command (cmd->error).
- *error is initialized from cmd->error on entry to the callback,
ie. *error != 0 if an error has been reported by a previous
callback.
- If the callback sets *error != 0 _and_ returns -1 then cmd->error
is updated.
- The callback can also fail by returning -1 and not setting *error,
in which case cmd->error is set to EPROTO (but don't do this as
it means your program is buggy - but see below about OCaml).
- Since there's only one error value, updating it overwrites
previous values.
- Returning -1 doesn't stop future callbacks, because basically we're
pulling data off the wire and there's no way to tell the server to
stop. So extent and completion callbacks are still called.
- Don't forget that its not just callbacks that can fail, commands
can fail for other reasons, eg. network errors. In this case the
whole connection will drop into what we call the %DEAD state
(because NBD can't resynchronize). Every call returns -1 and you
have to close and reopen the connection. It's possible to
explicitly test for this (nbd_aio_is_dead).
https://gitlab.com/nbdkit/libnbd/-/blob/1373d423f5ac42255fe606d2a7ab1caf1...
https://gitlab.com/nbdkit/libnbd/-/blob/1373d423f5ac42255fe606d2a7ab1caf1...
To find out if a command had an error you can do either (but not both):
- Check for *error != 0 in completion_cb. Return 1 from
completion_cb which auto-retires the command.
- Call nbd_aio_command_completed which will return -1 and
then call nbd_get_errno which will return cmd->error. This
also retires the command. (Note in this case you also have
to deal with nbd_aio_command_completed returning 0, meaning
the command has not completed, or 1, meaning the command
completed without error).
The first method is what nbdcopy should be doing but isn't.
> The last paragraph of the commit message speaks about this, and the code
> updates seem to suggest some ordering, but I'm still unsure.
>
> ... In fact, I'm now uncertain whether ignoring "err" in
virt-v2v's
> "lib/utils.ml", function "get_disk_allocated", is safe. It's
a "sync"
> (not aio) nbd_block_status() call, but the documentation
> <
https://libguestfs.org/nbd_block_status.3.html> says:
>
> "On entry to the callback, the error parameter contains the errno value
> of any previously detected error."
The above applies to synch commands as well, with the difference that:
- There's no completion callback (because there's no need for it in a
synch call).
- Once the command completes, it is retired and the error status is
copied to nbd_get_errno.
- The synch command returns -1 if there was an error or 0 if successful.
See lib/rw.c: wait_for_command and nbd_unlocked_block_status.
So in answer to your question, virt-v2v can ignore !err, unless there
is a reason why you want to know if previous callbacks encountered an
error (maybe you want to save some work). You'll still get an
exception thrown by NBD.block_status.
> So what should we do then, in "get_disk_allocated"? Return -1
> immediately? And if we do so, what is that going to mean for
> "NBD.block_status"? Will it raise an exception?
All OCaml wrappers check for the underlying nbd_* call failing and
turn that into an exception which carries the errno and error string.
See ocaml/nbd-c.c and ocaml/helpers.c:nbd_internal_ocaml_raise_error
In addition, when calling back into OCaml code (ie. into the extent
callback), if the callback throws an exception then that is
automatically turned into setting *error from the err reference and
returning -1.
OCaml code ought to do this when encountering an error inside the
callback:
if some_error then (
err := errno;
raise AnException
)
But there are a number of difficulties with that in practice:
- OCaml functions that you call might raise exceptions which you
don't catch, so you don't end up setting !err on all exceptions.
- OCaml code doesn't have access to C errno values!
I wouldn't worry too much about this. What happens in practice is
that nbd_internal_ocaml_exception_in_wrapper prints this on stderr:
libnbd: <operation>: uncaught OCaml exception: <exception name>
cmd->error is set to EPROTO and the function as a whole throws some
NBD.Error exception. The error doesn't get lost in this case, it's
just that the output is not ideal.
See ocaml/nbd-c.c:extent_wrapper_locked and
ocaml/helpers.c:nbd_internal_ocaml_exception_in_wrapper
OK -- so end-to-end (IIUC), all I could learn, with reading !err in the
OCaml callback, is whether an earlier invocation of the callback raised
an exception (possibly indirectly, i.e. by returning -1 to the wrapper,
or directly by raising an exception, or even by allowing a deeper
exception to escape). Eric said the extents array and the metacontext ID
are still valid in this case, so eliminating the loop body would only be
a speed-up, and the outermost call will throw an exception anyway.
... What happens "if the server replies with NBD_REPLY_TYPE_ERROR first
and then NBD_REPLY_TYPE_BLOCK_STATUS second (a custom server)"?
Hmm I don't think it matters much: we already assume that the entries
make sense (e.g. we already assert() that the server does not send
zero-length entries), so we'd just complete the inner loop (ignoring
!err being nonzero), and then the NBD.block_status function would raise
an exception. OK!
Thanks!
Laszlo
Rich.
> Thanks,
> Laszlo
>
>> nbdcopy still has a bigger bug in that it is NOT properly checking for
>> *error in the pread/pwrite completion handlers, but fixing that
>> gracefully (rather than just quitting the process immediately, as we
>> can do in the examples), requires more invasive patches which I will
>> be posting separately for review, while the ones posted here are
>> trivial enough that I'm pushing them now (commits 23bcea4a..1373d423).
>>
>> I also audited other uses of completion callbacks:
>> - in tests, closure-lifetimes, errors, server-death, and
>> shutdown-flags are all safe (either the callback returns 0 and the
>> code later calls nbd_aio_command_completed, or we are explicitly
>> testing aspects of completion callbacks)
>> - in fuse/operations.c, the completion callback returns 0 and calls
>> nbd_aio_command_completed later
>> - in examples/strict-structured-reads.c, the completion callback
>> properly checks *error
>>
>> I also checked that nbdkit's nbd plugin properly checks *error.
>>
>> Eric Blake (4):
>> examples: aio-connect-read.c: Fix error handling
>> examples: glib-main-loop: don't strand commands when gsource
>> disappears
>> examples: glib-main-loop: Fix error handling
>> copy: Ignore extents if we encountered earlier error
>>
>> examples/aio-connect-read.c | 7 +++++++
>> examples/glib-main-loop.c | 14 ++++++++++++--
>> copy/nbd-ops.c | 4 ++--
>> 3 files changed, 21 insertions(+), 4 deletions(-)
>>