On May 21 2022, "Richard W.M. Jones" <rjones(a)redhat.com> wrote:
On Sat, May 21, 2022 at 01:21:11PM +0100, Nikolaus Rath wrote:
> Hi,
>
> How does the blocksize filter take into account writes that end-up
> overlapping due to read-modify-write cycles?
>
> Specifically, suppose there are two non-overlapping writes handled
> by two different threads, that, due to blocksize requirements,
> overlap when expanded. I think there is a risk that one thread may
> partially undo the work of the other here.
>
> Looking at the code, it seems that writes of unaligned heads and
> tails are protected with a global lock., but writes of aligned data
> can occur concurrently.
I agree.
Assuming the underlying plugin is NBDKIT_THREAD_MODEL_PARALLEL and no
other filters impose thread model limits, the blocksize filter does
not limit the thread model, so the thread model of nbdkit would also
be NBDKIT_THREAD_MODEL_PARALLEL.
That means that two writes either on different connections or
pipelined on the same connection could happen at the same time.
“blocksize_pwrite” would be called concurrently for the two requests.
> However, does this not miss the case where there is one unaligned
> write that overlaps with an aligned one?
>
> For example, with blocksize 10, we could have:
>
> Thread 1: receives write request for offset=0, size=10
> Thread 2: receives write request for offset=4, size=16
> Thread 1: acquires lock, reads bytes 0-4
> Thread 2: does aligned write (no locking needed), writes bytes 0-10
> Thread 1: writes bytes 0-10, overwriting data from Thread 2
I believe this analysis is correct. (CC'd to Eric who knows a lot
more about this.)
However I don't think it's a bug. If a client doesn't want writes to
squash each other, then it shouldn't send overlapping requests. I bet
the same thing happens with an SSD.
But the requests are not overlapping from the client point of view. They
only become overlapping when the server applies its read-modify-write
operation to align them to the blocksize.
I think you elsewhere said that the blocksize reported by the NBD server
is only a preferred blocksize, so I'd be surprised if not following this
"preference" results in data corruption.
NBD_CMD_FLAG_FUA is provided for clients that wish to ensure that a
write has been committed before sending another request.
Do you have an example of a client which sends overlapping requests
and depends on particular behaviour of the server? You may be able to
get it to work by using nbdkit-noparallel-filter which can be used to
serialize nbdkit.
I'm working with the kernel's NBD client, and it would explain all the
mysterious data corruption issues that I've seen with the S3 plugin. But
I have not yet confirmed definitely that this is the root cause.
For now, I'll avoid the blocksize filter and instead do the
read-modify-write in the plugin with proper locking. If that fixes it,
then I think we can conclude that the kernel is sending such requests
(but, as I said above, I would not consider them overlapping nor would I
consider this a bug).
Best,
-Nikolaus
--
GPG Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F
»Time flies like an arrow, fruit flies like a Banana.«