On Sat, May 21, 2022 at 05:37:10PM +0100, Nikolaus Rath wrote:
On May 21 2022, "Richard W.M. Jones"
<rjones(a)redhat.com> wrote:
> On Sat, May 21, 2022 at 01:21:11PM +0100, Nikolaus Rath wrote:
>> Hi,
>>
>> How does the blocksize filter take into account writes that end-up
>> overlapping due to read-modify-write cycles?
>>
>> Specifically, suppose there are two non-overlapping writes handled
>> by two different threads, that, due to blocksize requirements,
>> overlap when expanded. I think there is a risk that one thread may
>> partially undo the work of the other here.
>>
>> Looking at the code, it seems that writes of unaligned heads and
>> tails are protected with a global lock., but writes of aligned data
>> can occur concurrently.
>
> I agree.
>
> Assuming the underlying plugin is NBDKIT_THREAD_MODEL_PARALLEL and no
> other filters impose thread model limits, the blocksize filter does
> not limit the thread model, so the thread model of nbdkit would also
> be NBDKIT_THREAD_MODEL_PARALLEL.
>
> That means that two writes either on different connections or
> pipelined on the same connection could happen at the same time.
> “blocksize_pwrite” would be called concurrently for the two requests.
>
>> However, does this not miss the case where there is one unaligned
>> write that overlaps with an aligned one?
>>
>> For example, with blocksize 10, we could have:
>>
>> Thread 1: receives write request for offset=0, size=10
>> Thread 2: receives write request for offset=4, size=16
>> Thread 1: acquires lock, reads bytes 0-4
>> Thread 2: does aligned write (no locking needed), writes bytes 0-10
>> Thread 1: writes bytes 0-10, overwriting data from Thread 2
>
> I believe this analysis is correct. (CC'd to Eric who knows a lot
> more about this.)
>
> However I don't think it's a bug. If a client doesn't want writes to
> squash each other, then it shouldn't send overlapping requests. I bet
> the same thing happens with an SSD.
But the requests are not overlapping from the client point of view. They
only become overlapping when the server applies its read-modify-write
operation to align them to the blocksize.
I'm going to leave this one to Eric who's an expert on this ("write
tearing", I think).
I think you elsewhere said that the blocksize reported by the NBD
server
is only a preferred blocksize, so I'd be surprised if not following this
"preference" results in data corruption.
This is true for NBD at the moment, but I think everyone accepts it's
a mistake in the protocol. Eric was looking into this too.
> NBD_CMD_FLAG_FUA is provided for clients that wish to ensure
that a
> write has been committed before sending another request.
>
> Do you have an example of a client which sends overlapping requests
> and depends on particular behaviour of the server? You may be able to
> get it to work by using nbdkit-noparallel-filter which can be used to
> serialize nbdkit.
I'm working with the kernel's NBD client, and it would explain all the
mysterious data corruption issues that I've seen with the S3 plugin. But
I have not yet confirmed definitely that this is the root cause.
For now, I'll avoid the blocksize filter and instead do the
read-modify-write in the plugin with proper locking. If that fixes it,
then I think we can conclude that the kernel is sending such requests
(but, as I said above, I would not consider them overlapping nor would I
consider this a bug).
Rich.
--
Richard Jones, Virtualization Group, Red Hat
http://people.redhat.com/~rjones
Read my programming and virtualization blog:
http://rwmj.wordpress.com
nbdkit - Flexible, fast NBD server with plugins
https://gitlab.com/nbdkit/nbdkit