Kernel driver I/O block size hinting

[PATCH] php: add arginfo to php...

[v2v PATCH 0/3] add LUKS-on-LVM...

Richard W.M. Jones

Tuesday, 14 June 2022 Tue, 14 Jun '22

9:38 a.m.

This is a follow-up to this thread: https://listman.redhat.com/archives/libguestfs/2022-June/thread.html#29210 about getting the kernel client (nbd.ko) to obey block size constraints sent by the NBD server: https://github.com/NetworkBlockDevice/nbd/blob/master/doc/proto.md#block-... I was sent this very interesting design document about the original intent behind the kernel's I/O limits: https://people.redhat.com/msnitzer/docs/io-limits.txt There are four or five kernel block layer settings we could usefully adjust, and there are three NBD block size constraints, and in my opinion there's not a very clear mapping between them. But I'll have a go at what I think we should do. - - - (1) Kernel physical_block_size & logical_block_size: The example given is of a hard disk with 4K physical sectors (AF) which can nevertheless emulate 512-byte sectors. In this case you'd set physical_block_size = 4K, logical_block_size = 512b. Data structures (partition tables, etc) should be aligned to physical_block_size to avoid unnecessary RMW cycles. But the fundamental until of I/O is logical_block_size. Current behaviour of nbd.ko is that logical_block_size == physical_block_size == the nbd-client "-b" option (default: 512 bytes, contradicting the documentation). I think we should set logical_block_size == physical_block_size == MAX (512, NBD minimum block size constraint). What should happen to the nbd-client -b option? (2) Kernel minimum_io_size: The documentation says this is the "preferred minimum unit for random I/O". Current behaviour of nbd.ko is this is not set. I think the NBD's preferred block size should map to minimum_io_size. (3) Kernel optimal_io_size: The documentation says this is the "[preferred] streaming I/O [size]". Current behaviour of nbd.ko is this is not set. NBD doesn't really have the concept of streaming vs random I/O, so we could either ignore this or set it to the same value as minimum_io_size. I have a kernel patch allowing nbd-client to set both minimum_io_size and optimal_io_size from userspace. (4) Kernel blk_queue_max_hw_sectors: This is documented as: "set max sectors for a request ... Enables a low level driver to set a hard upper limit, max_hw_sectors, on the size of requests." Current behaviour of nbd.ko is that we set this to 65536 (sectors? blocks?), which for 512b sectors is 32M. I think we could set this to MIN (32M, NBD maximum block size constraint), converting the result to sectors. - - - What do people think? Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-builder quickly builds VMs from scratch http://libguestfs.org/virt-builder.1.html

Show replies by date

Nikolaus Rath

Tuesday, 14 June Tue, 14 Jun

2:30 p.m.

On Jun 14 2022, "Richard W.M. Jones" <rjones(a)redhat.com> wrote:

...

Why the lower bound of 512?

...

What should happen to the nbd-client -b option?

Perhaps it should become the lower-bound (instead of the hardcoded 512)? That's assuming there is a reason for having a client-specified lower bound.

...

(2) Kernel minimum_io_size: The documentation says this is the "preferred minimum unit for random I/O". Current behaviour of nbd.ko is this is not set. I think the NBD's preferred block size should map to minimum_io_size. (3) Kernel optimal_io_size: The documentation says this is the "[preferred] streaming I/O [size]". Current behaviour of nbd.ko is this is not set. NBD doesn't really have the concept of streaming vs random I/O, so we could either ignore this or set it to the same value as minimum_io_size. I have a kernel patch allowing nbd-client to set both minimum_io_size and optimal_io_size from userspace. (4) Kernel blk_queue_max_hw_sectors: This is documented as: "set max sectors for a request ... Enables a low level driver to set a hard upper limit, max_hw_sectors, on the size of requests." Current behaviour of nbd.ko is that we set this to 65536 (sectors? blocks?), which for 512b sectors is 32M.

FWIW, on my 5.16 kernel, the default is 65 kB (according to /sys/block/nbdX/queue/max_sectors_kb x 512b).

...

I think we could set this to MIN (32M, NBD maximum block size constraint), converting the result to sectors.

I don't think that's right. Rather, it should be NBD's preferred block size. Setting this to the preferred block size means that NBD requests will be this large whenever there are enough sequential dirty pages, and that no requests will ever be larger than this. I think this is exactly what the NBD server would like to have. Settings this to the maximum block size would mean that NBD requests will exceed the preferred size whenever there are enough sequential dirty pages (while still obeying the maximum). This seems strictly worse. Unrelated to the proposed changes (all of which I think are technically correct), I am wondering if this will have much practical benefits. As far as I can tell, the kernel currently aligns NBD requests to the logical/physical block size rather than the size of the NBD request. Are there NBD servers that would benefit from the kernel honoring the preferred blocksize if the data is not also aligned to this blocksize? Best, -Nikolaus -- GPG Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F »Time flies like an arrow, fruit flies like a Banana.«

Richard W.M. Jones

Wednesday, 15 June Wed, 15 Jun

5:09 a.m.

On Tue, Jun 14, 2022 at 08:30:15PM +0100, Nikolaus Rath wrote:

...

On Jun 14 2022, "Richard W.M. Jones" <rjones(a)redhat.com> wrote: > I think we should set logical_block_size == physical_block_size == > MAX (512, NBD minimum block size constraint). Why the lower bound of 512?

I suspect the kernel can't handle sector sizes smaller than 512 bytes. By default the NBD protocol advises advertising a minimum size of 1 byte, and I'm almost certain setting logical_block_size == 1 would break everything.

...

> What should happen to the nbd-client -b option? Perhaps it should become the lower-bound (instead of the hardcoded 512)? That's assuming there is a reason for having a client-specified lower bound.

Right, I don't think there's a reason to continue with the -b option. I only use it to set -b 512 to work around the annoying default in older versions (which was 1024).

...

> (4) Kernel blk_queue_max_hw_sectors: This is documented as: "set max > sectors for a request ... Enables a low level driver to set a hard > upper limit, max_hw_sectors, on the size of requests." > > Current behaviour of nbd.ko is that we set this to 65536 (sectors? > blocks?), which for 512b sectors is 32M. FWIW, on my 5.16 kernel, the default is 65 kB (according to /sys/block/nbdX/queue/max_sectors_kb x 512b).

I have: $ cat /sys/devices/virtual/block/nbd0/queue/max_hw_sectors_kb 32768 (ie. 32 MB) which I think comes from the nbd module setting: blk_queue_max_hw_sectors(disk->queue, 65536); multiplied by 512b sectors.

...

> I think we could set this to MIN (32M, NBD maximum block size constraint), > converting the result to sectors. I don't think that's right. Rather, it should be NBD's preferred block size. Setting this to the preferred block size means that NBD requests will be this large whenever there are enough sequential dirty pages, and that no requests will ever be larger than this. I think this is exactly what the NBD server would like to have.

This kernel setting limits the maximum request size on the queue. In my testing reading and writing files with the default [above] the kernel never got anywhere near sending multi-megabyte requests. In fact the largest request it sent was 128K, even when I did stuff like: # dd if=/dev/zero of=/tmp/mnt/zero bs=100M count=10 128K happens to be 2 x blk_queue_io_opt, but I need to do more testing to see if that relationship always holds.

...

Settings this to the maximum block size would mean that NBD requests will exceed the preferred size whenever there are enough sequential dirty pages (while still obeying the maximum). This seems strictly worse. Unrelated to the proposed changes (all of which I think are technically correct), I am wondering if this will have much practical benefits. As far as I can tell, the kernel currently aligns NBD requests to the logical/physical block size rather than the size of the NBD request. Are there NBD servers that would benefit from the kernel honoring the preferred blocksize if the data is not also aligned to this blocksize?

I'm not sure I parsed this. Can you give an example? Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com nbdkit - Flexible, fast NBD server with plugins https://gitlab.com/nbdkit/nbdkit

Richard W.M. Jones

5:35 a.m.

On Wed, Jun 15, 2022 at 11:09:31AM +0100, Richard W.M. Jones wrote:

...

The answer is apparently no. With minimum_io_size == 64K and optimal_io_size == 256K, the server still only sees at most 128K requests. Although I still think we need to make these changes to nbd.ko, I don't think this is going to solve the original problem of trying to aggregate requests into the very large block sizes favoured by S3. (nbdkit blocksize filter + a layer of caching seems like the way to go for that) Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-builder quickly builds VMs from scratch http://libguestfs.org/virt-builder.1.html

Nikolaus Rath

6:40 a.m.

On Jun 15 2022, "Richard W.M. Jones" <rjones(a)redhat.com> wrote:

...

On Wed, Jun 15, 2022 at 11:09:31AM +0100, Richard W.M. Jones wrote: > This kernel setting limits the maximum request size on the queue. > > In my testing reading and writing files with the default [above] the > kernel never got anywhere near sending multi-megabyte requests. In > fact the largest request it sent was 128K, even when I did stuff like: > > # dd if=/dev/zero of=/tmp/mnt/zero bs=100M count=10 > > 128K happens to be 2 x blk_queue_io_opt, but I need to do more testing > to see if that relationship always holds. The answer is apparently no. With minimum_io_size == 64K and optimal_io_size == 256K, the server still only sees at most 128K requests.

I have seen larger requests. Example (from the stats fiter): WRITE request sizes: 131072 bytes: 66.1% of requests (8521) 9 bit aligned: 100.0% (8521) 10 bit aligned: 62.5% (5327) 11 bit aligned: 28.6% (2438) 12 bit aligned: 10.6% (907) 13 bit aligned: 6.0% (508) 14 bit aligned: 5.0% (426) 15+ bit-aligned: 4.9% (415) 262144 bytes: 9.2% of requests (1185) 9 bit aligned: 100.0% (1185) 10 bit aligned: 67.2% (796) 11 bit aligned: 48.8% (578) 12 bit aligned: 13.4% (159) 14 bit aligned: 11.9% (141) 15 bit aligned: 11.8% (140) 16 bit aligned: 11.7% (139) 17 bit aligned: 11.3% (134) 18 bit aligned: 5.8% (69) 19+ bit-aligned: 2.7% (32) 393216 bytes: 6.5% of requests (841) 9 bit aligned: 100.0% (841) 10 bit aligned: 64.2% (540) 11 bit aligned: 52.2% (439) 12 bit aligned: 17.5% (147) 13 bit aligned: 15.2% (128) 14 bit aligned: 14.6% (123) 15 bit aligned: 14.0% (118) 16 bit aligned: 13.8% (116) 17 bit aligned: 12.5% (105) 18 bit aligned: 6.4% (54) 19+ bit-aligned: 3.3% (28) 524288 bytes: 4.4% of requests (571) 9 bit aligned: 100.0% (571) 10 bit aligned: 55.2% (315) 11 bit aligned: 49.0% (280) 12 bit aligned: 34.7% (198) 13 bit aligned: 28.5% (163) 14 bit aligned: 26.8% (153) 16 bit aligned: 25.4% (145) 17 bit aligned: 22.4% (128) 18 bit aligned: 11.7% (67) 19 bit aligned: 6.3% (36) 20+ bit-aligned: 3.3% (19) 655360 bytes: 3.8% of requests (493) 9 bit aligned: 100.0% (493) 10 bit aligned: 57.0% (281) 11 bit aligned: 51.7% (255) 12 bit aligned: 46.0% (227) 13 bit aligned: 41.0% (202) 14 bit aligned: 40.4% (199) 16 bit aligned: 36.5% (180) 17 bit aligned: 32.5% (160) 18 bit aligned: 16.6% (82) 19 bit aligned: 9.3% (46) 20 bit aligned: 5.1% (25) 21+ bit-aligned: 1.8% (9) 786432 bytes: 2.1% of requests (270) 9 bit aligned: 100.0% (270) 10 bit aligned: 50.4% (136) 11 bit aligned: 47.4% (128) 12 bit aligned: 42.6% (115) 13 bit aligned: 34.1% (92) 14 bit aligned: 31.9% (86) 16 bit aligned: 31.1% (84) 17 bit aligned: 28.1% (76) 18 bit aligned: 14.8% (40) 19 bit aligned: 7.4% (20) 20+ bit-aligned: 3.3% (9) 1024 bytes: 1.8% of requests (238) 10 bit aligned: 100.0% (238) 11 bit aligned: 52.5% (125) 12 bit aligned: 25.2% (60) 13 bit aligned: 11.8% (28) 14 bit aligned: 5.0% (12) 17+ bit-aligned: 1.7% (4) 917504 bytes: 0.9% of requests (113) 9 bit aligned: 100.0% (113) 10 bit aligned: 61.9% (70) 11 bit aligned: 52.2% (59) 12 bit aligned: 50.4% (57) 14 bit aligned: 38.9% (44) 16 bit aligned: 38.1% (43) 17 bit aligned: 34.5% (39) 18 bit aligned: 18.6% (21) 19 bit aligned: 8.8% (10) 20+ bit-aligned: 3.5% (4) 1048576 bytes: 0.2% of requests (23) 9 bit aligned: 100.0% (23) 11 bit aligned: 56.5% (13) 16 bit aligned: 52.2% (12) 17 bit aligned: 43.5% (10) 19 bit aligned: 8.7% (2) 22 bit aligned: 4.3% (1) other sizes: 4.9% of requests (631) However, these are still far less than what I would have liked to see (I was writing a number of huge files through ZFS, and had hoped to see lots of 32 MB requests).

...

Although I still think we need to make these changes to nbd.ko, I don't think this is going to solve the original problem of trying to aggregate requests into the very large block sizes favoured by S3. (nbdkit blocksize filter + a layer of caching seems like the way to go for that)

Yeah.. that'd be quite a shame though. We'd be pretty much duplicating the page cache, solely for the purpose of processing it in different chunks than those produced by NBD.ko. Best, -Nikolaus -- GPG Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F »Time flies like an arrow, fruit flies like a Banana.«

Nikolaus Rath

6:43 a.m.

On Jun 15 2022, "Richard W.M. Jones" <rjones(a)redhat.com> wrote:

...

> > I think we could set this to MIN (32M, NBD maximum block size constraint), > > converting the result to sectors. > > I don't think that's right. Rather, it should be NBD's preferred block > size. > > Setting this to the preferred block size means that NBD requests will be > this large whenever there are enough sequential dirty pages, and that no > requests will ever be larger than this. I think this is exactly what the > NBD server would like to have. This kernel setting limits the maximum request size on the queue.

Right. But why not limit it to the *preferred* blocksize of the NBD server? The kernel obviously does not care, and the NBD server obviously prefers this blocksize over the maximum block size.

...

In my testing reading and writing files with the default [above] the kernel never got anywhere near sending multi-megabyte requests.

Well, yes, but that shouldn't affect which value we should use, I think.

...

> Unrelated to the proposed changes (all of which I think are technically > correct), I am wondering if this will have much practical benefits. As > far as I can tell, the kernel currently aligns NBD requests to the > logical/physical block size rather than the size of the NBD request. Are > there NBD servers that would benefit from the kernel honoring the > preferred blocksize if the data is not also aligned to this blocksize? I'm not sure I parsed this. Can you give an example?

No - I am asking for examples :-). My question is: in which scenario is it helpful for the NBD server to receive non-aligned requests of its preferred blocksize? Isn't that just as bad as receiving requests with a non-preferred blocksize? Best, -Nikolaus -- GPG Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F »Time flies like an arrow, fruit flies like a Banana.«

1233

days inactive

1234

days old

guestfs@lists.libguestfs.org

Manage subscription

5 comments

2 participants

tags (0)

participants (2)

Nikolaus Rath
Richard W.M. Jones

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Kernel driver I/O block size hinting