Re: [Libguestfs] [libnbd PATCH 0/4] Work towards fixing nbdcopy silent corruption

Wednesday, 2 February 2022

On 02/02/22 09:23, Laszlo Ersek wrote:
...
 On 02/01/22 20:44, Eric Blake wrote:
> These are the trivial patches modeled after Nir's commit 7ba6ef67,
> fixing further places that fail to check *error during callbacks.

 I find these interfaces very difficult to use.

 https://libguestfs.org/nbd_aio_block_status.3.html

 We have two callbacks here, each takes a *error, and I have zero idea of
 the order the callbacks are supposed to be called in.

 The last paragraph of the commit message speaks about this, and the code
 updates seem to suggest some ordering, but I'm still unsure.

 ... In fact, I'm now uncertain whether ignoring "err" in virt-v2v's
 "lib/utils.ml", function "get_disk_allocated", is safe. It's a
"sync"
 (not aio) nbd_block_status() call, but the documentation
 <https://libguestfs.org/nbd_block_status.3.html> says:

 "On entry to the callback, the error parameter contains the errno value
 of any previously detected error."

 So what should we do then, in "get_disk_allocated"? Return -1
 immediately? 
...
From patch#4, it seems that we need exactly this. Here's the
function: 
let get_disk_allocated ~dir ~disknr =
  let socket = sprintf "%s/out%d" dir disknr
  and alloc_ctx = "base:allocation" in
  with_nbd_connect_unix ~socket ~meta_contexts:[alloc_ctx]
    (fun nbd ->
         if NBD.can_meta_context nbd alloc_ctx then (
           (* Get the list of extents, using a 2GiB chunk size as hint. *)
           let size = NBD.get_size nbd
           and allocated = ref 0_L
           and fetch_offset = ref 0_L in
           while !fetch_offset < size do
             let remaining = size -^ !fetch_offset in
             let fetch_size = min 0x8000_0000_L remaining in
             NBD.block_status nbd fetch_size !fetch_offset
               (fun ctx offset entries err ->
                  assert (ctx = alloc_ctx);
                  for i = 0 to Array.length entries / 2 - 1 do
                    let len = entries.(i * 2)
                    and typ = entries.(i * 2 + 1) in
                    assert (len > 0_L);
                    if typ &^ 1_L = 0_L then
                      allocated := !allocated +^ len;
                    fetch_offset := !fetch_offset +^ len
                  done;
                  0
               )
           done;
           Some !allocated
         ) else None
       )

If our (nameless) function is entered with a nonzero "err", then we
cannot trust the "entries" array. Which means not only "allocated"
may
be miscalculated, but even "fetch_offset" -- and the latter can as well
break the chunking (= outer) loop. This seems to imply we *must not*
continue the outer loop.

So how do we handle this in OCaml? If we return -1 from the (nameless)
extents callback, will NBD.block_status raise an exception?

Hmmmm: maybe. In the generated nbd_internal_ocaml_nbd_block_status()
function, in file "ocaml/nbd-c.c", we have:

  caml_enter_blocking_section ();
  r =  nbd_block_status (h, count, offset, extent_callback, flags);
  caml_leave_blocking_section ();

  if (r == -1)
    nbd_internal_ocaml_raise_error ();

However, in extent_wrapper() -> extent_wrapper_locked(), there is more
confusion for me:

  metacontextv = caml_copy_string (metacontext);
  offsetv = caml_copy_int64 (offset);
  entriesv = nbd_internal_ocaml_alloc_int64_from_uint32_array (entries, nr_entries);
  errorv = caml_alloc_tuple (1);
  Store_field (errorv, 0, Val_int (*error));
  args[0] = metacontextv;
  args[1] = offsetv;
  args[2] = entriesv;
  args[3] = errorv;
  rv = caml_callbackN_exn (data->fnv, 4, args);
  *error = Int_val (Field (errorv, 0));
  if (Is_exception_result (rv)) {
    nbd_internal_ocaml_exception_in_wrapper ("extent", rv);
    CAMLreturnT (int, -1);
  }

  r = Int_val (rv);
  assert (r >= 0);
  CAMLreturnT (int, r);

If I understand correctly, if we returned (-1) from the (nameless)
extents callback, that would trip the assert(); would it not? Thus, we
would not reach the nbd_internal_ocaml_raise_error() in the above-quoted
nbd_internal_ocaml_nbd_block_status() function.

Thanks,
Laszlo

...
 And if we do so, what is that going to mean for
 "NBD.block_status"? Will it raise an exception?

 Thanks,
 Laszlo

> nbdcopy still has a bigger bug in that it is NOT properly checking for
> *error in the pread/pwrite completion handlers, but fixing that
> gracefully (rather than just quitting the process immediately, as we
> can do in the examples), requires more invasive patches which I will
> be posting separately for review, while the ones posted here are
> trivial enough that I'm pushing them now (commits 23bcea4a..1373d423).
>
> I also audited other uses of completion callbacks:
> - in tests, closure-lifetimes, errors, server-death, and
>   shutdown-flags are all safe (either the callback returns 0 and the
>   code later calls nbd_aio_command_completed, or we are explicitly
>   testing aspects of completion callbacks)
> - in fuse/operations.c, the completion callback returns 0 and calls
>   nbd_aio_command_completed later
> - in examples/strict-structured-reads.c, the completion callback
>   properly checks *error
>
> I also checked that nbdkit's nbd plugin properly checks *error.
>
> Eric Blake (4):
>   examples: aio-connect-read.c: Fix error handling
>   examples: glib-main-loop: don't strand commands when gsource
>     disappears
>   examples: glib-main-loop: Fix error handling
>   copy: Ignore extents if we encountered earlier error
>
>  examples/aio-connect-read.c |  7 +++++++
>  examples/glib-main-loop.c   | 14 ++++++++++++--
>  copy/nbd-ops.c              |  4 ++--
>  3 files changed, 21 insertions(+), 4 deletions(-)
>

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Re: [Libguestfs] [libnbd PATCH 0/4] Work towards fixing nbdcopy silent corruption