[PATCH libnbd 0/2] copy: Use preferred block size when writing
by Richard W.M. Jones
This patch series attempts to fix the well-known problem in nbdcopy
that it ignores the preferred block size. It only attempts to fix it
for writing (which is the case we care about for virt-v2v).
Unfortunately it's subtly incorrect, and in a way that is difficult
for me to see how to fix right now. Hence posting it so I can return
to it later.
[None of the problem description below will make any sense until you
look at the patches.]
The problem with the second patch is that the assembled blocks are
written synchronously on handle[0]. However the handle can be in use
at the same time by multi-thread-copying.c, resulting in two poll(2)
loops running at the same time trying to consume events.
It's very hard to reproduce this problem -- all the tests run fine --
but the following command sometimes demonstrates it:
$ nbdkit -U - --filter=checkwrite --filter=offset data '1 @0x100000000 1' offset=1 --run './run nbdcopy "$uri" "$uri"'
It will either run normally, deadlock, or give a weird error from the
state machine which is characteristic of the two poll loops competing
for events:
> nbd+unix://?socket=/tmp/nbdkitVOF5BR/socket: nbd_shutdown: nothing to poll for in state REPLY.START: Invalid argument
One way to fix this would be to open an extra handle to the
destination NBD server for sending these completed blocks. However
that breaks various assumptions, and wouldn't work for !multi-conn
servers.
Another way would be some kind of lock around handle[0], but that
seems hard to do given that we're doing asynch operations.
Rich.
2 years, 5 months
[PATCH v3] spec: Clarify BLOCK_STATUS reply details
by Eric Blake
Our docs were inconsistent on whether a NBD_REPLY_TYPE_BLOCK_STATUS
reply chunk can exceed the client's requested length, and silent on
whether the lengths must be consistent when multiple contexts were
negotiated. Clarify this to match existing practice as implemented in
qemu-nbd. Clean up some nearby grammatical errors while at it.
---
Another round of rewording attempts, based on feedback from Rich on
v2. I went ahead and pushed patch 1 and 2 of the v2 series, as they
were less controversial.
doc/proto.md | 42 ++++++++++++++++++++++++++++--------------
1 file changed, 28 insertions(+), 14 deletions(-)
diff --git a/doc/proto.md b/doc/proto.md
index 8a817d2..bacccfa 100644
--- a/doc/proto.md
+++ b/doc/proto.md
@@ -882,15 +882,25 @@ The procedure works as follows:
server supports.
- During transmission, a client can then indicate interest in metadata
for a given region by way of the `NBD_CMD_BLOCK_STATUS` command,
- where *offset* and *length* indicate the area of interest. The
- server MUST then respond with the requested information, for all
- contexts which were selected during negotiation. For every metadata
- context, the server sends one set of extent chunks, where the sizes
- of the extents MUST be less than or equal to the length as specified
- in the request. Each extent comes with a *flags* field, the
- semantics of which are defined by the metadata context.
-- A server MUST reply to `NBD_CMD_BLOCK_STATUS` with a structured
- reply of type `NBD_REPLY_TYPE_BLOCK_STATUS`.
+ where *offset* and *length* indicate the area of interest. On
+ success, the server MUST respond with one structured reply chunk of
+ type `NBD_REPLY_TYPE_BLOCK_STATUS` per metadata context selected
+ during negotiation, where each reply chunk is a list of one or more
+ consecutive extents for that context. Each extent comes with a
+ *flags* field, the semantics of which are defined by the metadata
+ context.
+
+The client's requested *length* is only a hint to the server, so the
+cumulative extent length contained in a chunk of the server's reply
+may be shorter or longer the original request. When more than one
+metadata context was negotiated, the reply chunks for the different
+contexts of a single block status request need not have the same
+number of extents or cumulative extent length.
+
+In the request, the client may use the `NBD_CMD_FLAG_REQ_ONE` command
+flag to further constrain the server's reply so that each chunk
+contains exactly one extent whose length does not exceed the client's
+original *length*.
A client MUST NOT use `NBD_CMD_BLOCK_STATUS` unless it selected a
nonzero number of metadata contexts during negotiation, and used the
@@ -1778,8 +1788,8 @@ MUST initiate a hard disconnect.
*length* MUST be 4 + (a positive integer multiple of 8). This reply
represents a series of consecutive block descriptors where the sum
of the length fields within the descriptors is subject to further
- constraints documented below. This chunk type MUST appear
- exactly once per metadata ID in a structured reply.
+ constraints documented below. A successful block status request MUST
+ have exactly one status chunk per negotiated metadata context ID.
The payload starts with:
@@ -1801,15 +1811,19 @@ MUST initiate a hard disconnect.
*length* of the final extent MAY result in a sum larger than the
original requested length, if the server has that information anyway
as a side effect of reporting the status of the requested region.
+ When multiple metadata contexts are negotiated, the reply chunks for
+ the different contexts need not have the same number of extents or
+ cumulative extent length.
Even if the client did not use the `NBD_CMD_FLAG_REQ_ONE` flag in
its request, the server MAY return fewer descriptors in the reply
than would be required to fully specify the whole range of requested
information to the client, if looking up the information would be
too resource-intensive for the server, so long as at least one
- extent is returned. Servers should however be aware that most
- clients implementations will then simply ask for the next extent
- instead.
+ extent is returned. Servers should however be aware that most
+ client implementations will likely follow up with a request for
+ extent information at the first offset not covered by a
+ reduced-length reply.
All error chunk types have bit 15 set, and begin with the same
*error*, *message length*, and optional *message* fields as
--
2.35.1
2 years, 6 months
Re: [Libguestfs] nbdkit error: "write reply: NBD_CMD_WRITE: Broken pipe"
by Wouter Verhelst
Same story.
On Fri, Jun 17, 2022 at 01:09:25PM +0200, Wouter Verhelst wrote:
> Hi, and sorry for the delay (I was overseas for a month in May to visit family
> etc)
>
> On Tue, May 03, 2022 at 09:07:17AM +0100, Richard W.M. Jones wrote:
> > On Mon, May 02, 2022 at 03:36:33PM +0100, Nikolaus Rath wrote:
> > > So I tried to reproduce this, and noticed something odd. It seems I can
> > > disconnect the nbd device (nbd-client -d) while there are still requests
> > > in flight:
> > >
> > > May 02 15:20:50 vostro.rath.org kernel: nbd1: detected capacity change from 0 to 52428800
> > > May 02 15:20:50 vostro.rath.org kernel: block nbd1: NBD_DISCONNECT
> > > May 02 15:20:50 vostro.rath.org kernel: block nbd1: Disconnected due to user request.
> > > May 02 15:20:50 vostro.rath.org kernel: block nbd1: shutting down sockets
> > > May 02 15:20:50 vostro.rath.org kernel: I/O error, dev nbd1, sector 776 op 0x0:(READ) flags 0x80700 phys_seg 29 prio class 0
> > > May 02 15:20:50 vostro.rath.org kernel: I/O error, dev nbd1, sector 776 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
> > > May 02 15:20:50 vostro.rath.org kernel: Buffer I/O error on dev nbd1, logical block 97, async page read
> > > May 02 15:20:50 vostro.rath.org kernel: block nbd1: Attempted send on invalid socket
> > > May 02 15:20:50 vostro.rath.org kernel: I/O error, dev nbd1, sector 0 op 0x1:(WRITE) flags 0x800 phys_seg 0 prio class 0
> > > May 02 15:20:50 vostro.rath.org kernel: block nbd1: Attempted send on invalid socket
> > > May 02 15:20:50 vostro.rath.org kernel: I/O error, dev nbd1, sector 0 op 0x1:(WRITE) flags 0x800 phys_seg 0 prio class 0
> > >
> > > This was generated by running:
> > >
> > > $ nbd-client localhost /dev/nbd1 && mkfs.ext4 /dev/nbd1 && nbd-client -d
> > > /dev/nbd1
> > >
> > > Is that expected behavior?
> >
> > It's a bit unexpected to me. Adding Wouter to the thread - he might
> > have an idea here, especially if there's a way to have "nbd-client -d"
> > wait for pending requests to finish before disconnecting.
> >
> > I don't use the kernel client very much myself. We mostly use either
> > libnbd or the qemu client.
>
> The kernel is supposed to deal with this. If there are still processes
> using the device and/or, nbd-client -d should receive an error (I believe it
> was EPERM). If there are outstanding writes, the disconnect should flush
> those, wait for them to return, and *then* handle the disconnect.
>
> At least those are the assumptions made in nbd-client ;-)
>
> (I notice you did send this to the kernel maintainer too, but just
> wanted to follow up on what nbd-client assumes)
>
> --
> w(a)uter.{be,co.za}
> wouter(a){grep.be,fosdem.org,debian.org}
--
w(a)uter.{be,co.za}
wouter(a){grep.be,fosdem.org,debian.org}
2 years, 6 months
Re: [Libguestfs] Kernel driver I/O block size hinting
by Wouter Verhelst
Sorry for the late reply.
I just noticed that my mail config was borked; I was happily sending out
emails, but none of them reached anyone :-/
Fixed now.
On Fri, Jun 17, 2022 at 12:59:04PM +0200, Wouter Verhelst wrote:
> Hi,
>
> On Tue, Jun 14, 2022 at 03:38:19PM +0100, Richard W.M. Jones wrote:
> > This is a follow-up to this thread:
> >
> > https://listman.redhat.com/archives/libguestfs/2022-June/thread.html#29210
> >
> > about getting the kernel client (nbd.ko) to obey block size
> > constraints sent by the NBD server:
> >
> > https://github.com/NetworkBlockDevice/nbd/blob/master/doc/proto.md#block-...
> >
> > I was sent this very interesting design document about the original
> > intent behind the kernel's I/O limits:
> >
> > https://people.redhat.com/msnitzer/docs/io-limits.txt
> >
> > There are four or five kernel block layer settings we could usefully
> > adjust, and there are three NBD block size constraints, and in my
> > opinion there's not a very clear mapping between them. But I'll have
> > a go at what I think we should do.
> >
> > - - -
> >
> > (1) Kernel physical_block_size & logical_block_size: The example given
> > is of a hard disk with 4K physical sectors (AF) which can nevertheless
> > emulate 512-byte sectors. In this case you'd set physical_block_size
> > = 4K, logical_block_size = 512b.
> >
> > Data structures (partition tables, etc) should be aligned to
> > physical_block_size to avoid unnecessary RMW cycles. But the
> > fundamental until of I/O is logical_block_size.
> >
> > Current behaviour of nbd.ko is that logical_block_size ==
> > physical_block_size == the nbd-client "-b" option (default: 512 bytes,
> > contradicting the documentation).
>
> Whoops, indeed. Fixed in git.
>
> > I think we should set logical_block_size == physical_block_size ==
> > MAX (512, NBD minimum block size constraint).
> >
> > What should happen to the nbd-client -b option?
>
> I believe it remains useful to have an override for exceptional
> situations. I think I'll leave it (but we can provide an appropriate
> warning about this possibly being a bad idea in the man page)
>
> It might be useful to extend the syntax to specify more than one block
> size, given that there are going to be multiple ones now.
>
> > (2) Kernel minimum_io_size: The documentation says this is the
> > "preferred minimum unit for random I/O".
> >
> > Current behaviour of nbd.ko is this is not set.
> >
> > I think the NBD's preferred block size should map to minimum_io_size.
> >
> >
> > (3) Kernel optimal_io_size: The documentation says this is the
> > "[preferred] streaming I/O [size]".
> >
> > Current behaviour of nbd.ko is this is not set.
> >
> > NBD doesn't really have the concept of streaming vs random I/O, so we
> > could either ignore this or set it to the same value as
> > minimum_io_size.
> >
> > I have a kernel patch allowing nbd-client to set both minimum_io_size
> > and optimal_io_size from userspace.
> >
> >
> > (4) Kernel blk_queue_max_hw_sectors: This is documented as: "set max
> > sectors for a request ... Enables a low level driver to set a hard
> > upper limit, max_hw_sectors, on the size of requests."
> >
> > Current behaviour of nbd.ko is that we set this to 65536 (sectors?
> > blocks?), which for 512b sectors is 32M.
> >
> > I think we could set this to MIN (32M, NBD maximum block size constraint),
> > converting the result to sectors.
> >
> > - - -
> >
> > What do people think?
>
> Yes, this all looks reasonable to me. Thanks.
>
> --
> w(a)uter.{be,co.za}
> wouter(a){grep.be,fosdem.org,debian.org}
--
w(a)uter.{be,co.za}
wouter(a){grep.be,fosdem.org,debian.org}
2 years, 6 months