On May 25 2022, Eric Blake <eblake(a)redhat.com> wrote:
On Tue, May 24, 2022 at 08:45:02PM +0100, Nikolaus Rath wrote:
> > However, you are worried that a third possibility occurs:
> >
> > T2 sees that it needs to do RMW, grabs the lock, and reads 0x00-0x0f
> > for the unaligned head (it only needs 0x00-0x03, but we have to read a
> > block at a time), to populate its buffer with IIIIBBBBBBBBBBBB.
> >
> > T1 now writes 0x00-0x0f with AAAAAAAAAAAAAAAA, without any lock
> > blocking it.
> >
> > T2 now writes 0x00-0x0f using the contents of its buffer, resulting in:
> >
> > 0 0 0 0 1 1 1 1
> > 0...4...8...a...0...4...8...a...
> > end3: IIIIBBBBBBBBBBBBBBBBBBBBBBIIIIII
> >
> > which does NOT reflect either of the possibilities where T1 and T2
> > write atomically. Basically, we have become the victim of sharding.
>
> Yes, this is the scenario that I am worried about.
>
> I this is a data corruption problem no matter if we assume that writes
> should be atomically or not.
Writes to a single block should be atomic. Writes larger than a block
need not be atomic. But when we advertise a particular block size,
clients should be safe in assuming that anything smaller than that
block size is inadvisable (either it will fail as too small, or it
will incur a RMW penalty), but that something at the block size should
not need client-side protection when no other parallel requests are
overlapping that block. And that is where our blocksize filter is
currently failing.
Agreed.
> > You have just demonstrated that our current approach
(grabbing a
> > single semaphore, only around the unaligned portions), does not do
> > what we hoped. So what WOULD protect us, while still allowing as much
> > parallelism as possible?
>
> How about a per-block lock as implemented for the S3 plugin in
>
https://gitlab.com/nbdkit/nbdkit/-/merge_requests/10?
That feels pretty heavyweight. It may allow more parallelism, but at
the expense of more resources. I'm hoping that a single rwlock will
do, although it may be dominated by starvation time when toggling
between unaligned actions (serialized) vs aligned actions (parallel).
The resource usage is a single shared variable with an underlying mutex,
and a set (probably best implemented as a hash) containing at most as
many elements as block are being written in concurrently. This does not
sound like much to me. Am I missing something?
Best,
-Nikolaus
--
GPG Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F
»Time flies like an arrow, fruit flies like a Banana.«