On Jun 13 2022, "Richard W.M. Jones" <rjones(a)redhat.com> wrote:
On Mon, Jun 13, 2022 at 11:58:11AM +0100, Nikolaus Rath wrote:
> On Jun 13 2022, "Richard W.M. Jones" <rjones(a)redhat.com> wrote:
> > On Mon, Jun 13, 2022 at 10:33:58AM +0100, Nikolaus Rath wrote:
> >> Hello,
> >>
> >> I am trying to improve performance of the scenario where the kernel's
> >> NBD client talks to NBDKit's S3 plugin.
> >>
> >> For me, the main bottleneck is currently due to the fact that the kernel
> >> aligns requests to only 512 B, no matter the blocksize reported by
> >> nbdkit.
> >>
> >> Using a 512 B object size is not feasible (due to latency and request
> >> overhead). However, with a larger object size there are two conflicting
> >> objectives:
> >>
> >> 1. To maximize parallelism (which is important to reduce the effects of
> >> connection latency), it's best to limit the size of the kernel's
NBD
> >> requests to the object size.
> >>
> >> 2. To minimize un-aligned writes, it's best to allow arbitrarily large
> >> NBD requests, because the larger the requests the larger the amount of
> >> full blocks that are written. Unfortunately this means that all objects
> >> touched by the request are written sequentially.
> >>
> >> I see a number of ways to address that:
> >>
> >> 1. Change the kernel's NBD code to honor the blocksize reported by the
> >> NBD server. This would be ideal, but I don't feel up to making this
> >> happen. Theoretical solution only.
> >
> > This would be the ideal solution. I wonder how technically
> > complicated it would be actually?
> >
> > AIUI you'd have to modify nbd-client to query the block limits from
> > the server, which is the hardest part of this, but it's all userspace
> > code. Then you'd pass those down to the kernel via the ioctl (see
> > drivers/block/nbd.c:__nbd_ioctl). Then inside the kernel you'd call
> > blk_queue_io_min & blk_queue_io_opt with the values (I'm not sure how
> > you set the max request size, or if that's possible). See
> > block/blk-settings.c for details of these functions.
>
> If it's only about getting the blocksize from the NBD server, then I
> certainly feel up to the task.
>
> However, nbd-client already has:
>
> -block-size block size
>
> -b Use a blocksize of "block size". Default is 1024; allowed
values are ei‐
> ther 512, 1024, 2048 or 4096
>
> So my worry is that more complicated in-kernel changes will be needed to
> make other values work. In particular, nbd_is_valid_blksize() (in nbd.c)
> checks that the block size is less or equal to PAGE_SIZE.
>
> (I'm interested in 32 kB and 512 kB block sizes)
This setting controls ioctl(nbd, NBD_SET_BLKSIZE,...) which inside the
kernel calls:
blk_queue_logical_block_size(nbd->disk->queue, blksize);
blk_queue_physical_block_size(nbd->disk->queue, blksize);
These functions are documented in block/blk-settings.c, but basically
control the size of LBAs. For most devices that would be 512. (ISTR
we changed the default in NBD a while back too, since 1024 caused
problems for creating and reading some filesystems.)
You can't really increase this setting to 2M or whatever S3 needs,
because firstly it has to be smaller than the page size as you pointed
out above, but mainly it'll radically change how filesystems get
created since they use the block size as a basic unit to size other
disk structures. In fact I wouldn't be surprised if most filesystems
just don't function at all if the block size is massive.
Nevertheless, tracing the code which sets this is instructive to see
how you would adjust the same kernel code to set the minimum and
preferred I/O settings via blk_queue_io_min / blk_queue_io_opt. These
settings are separate from the block size (although must be multiples
of the block size.)
Ah, this is helpful. Thank you for clarifying!
I'll probably start with some experiments where I just hardcode a larger
value in the kernel and see what happens.
Best,
-Nikolaus
--
GPG Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F
»Time flies like an arrow, fruit flies like a Banana.«