Re: [Libguestfs] nbdkit blocksize filter, read-modify-write, and concurrency

Tuesday, 24 May 2022

On May 24 2022, Eric Blake <eblake(a)redhat.com&gt; wrote:
...
 minblock = 0x10
 Thread 1: receives write request for offset 0x00, size 0x10 (aligned request)
 Thread 2: receives write request for offset 0x04, size 0x16 (unaligned offset, unaligned
size)

 Graphically, we are wanting to write the following, given initial disk
 contents of I:

        0   0   0   0   1   1   1   1
        0...4...8...a...0...4...8...a...
 start      IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
 T1:    AAAAAAAAAAAAAAAA
 T2:        BBBBBBBBBBBBBBBBBBBBBB

 Because both writes are issued simultaneously, we do not know whether
 bytes 0x04 through 0x0f will be written as A or B.  But our assumption
 is that because blocks are written atomically, we hope to get exactly
 one of the two following results, where either T1 beat T2:

        0   0   0   0   1   1   1   1
        0...4...8...a...0...4...8...a...
 end1:  AAAABBBBBBBBBBBBBBBBBBBBBBIIIIII

 or where T2 beat T1:

        0   0   0   0   1   1   1   1
        0...4...8...a...0...4...8...a...
 end2:  AAAAAAAAAAAAAAAABBBBBBBBBBIIIIII

 However, you are worried that a third possibility occurs:

 T2 sees that it needs to do RMW, grabs the lock, and reads 0x00-0x0f
 for the unaligned head (it only needs 0x00-0x03, but we have to read a
 block at a time), to populate its buffer with IIIIBBBBBBBBBBBB.

 T1 now writes 0x00-0x0f with AAAAAAAAAAAAAAAA, without any lock
 blocking it.

 T2 now writes 0x00-0x0f using the contents of its buffer, resulting in:

        0   0   0   0   1   1   1   1
        0...4...8...a...0...4...8...a...
 end3:  IIIIBBBBBBBBBBBBBBBBBBBBBBIIIIII

 which does NOT reflect either of the possibilities where T1 and T2
 write atomically.  Basically, we have become the victim of sharding. 
Yes, this is the scenario that I am worried about.

I this is a data corruption problem no matter if we assume that writes
should be atomically or not.

In this scenario, the client has issued exactly one request to write
(among other things) bytes 0-4. This request was executed successfully,
so bytes 0-4 should have the new contents. There was no other write that
affected this byte range, so whether the write was done atomic or not
does not matter.

...
 You are correct that it is annoying that this third possibility
(where
 T1 appears to have never run) is possible with the blocksize filter.
 And we should probably consider it as a data-corruption bug.  Your
 blocksize example of 10 (or 0x10) bytes is unlikely, but we are more
 likely to hit scenarios where an older guest assumes it is writing to
 512-byte aligned hardware, while using the blocksize filter to try and
 guarantee RMW atomic access to 4k modern hardware.  The older client
 will be unaware that it must avoid parallel writes that are
 512-aligned but land in the same 4k page, so it seems like the
 blocksize filter should be doing that. 

Yes, I was just picking a very small number to illustrate the problem. I
have seen this happen in practice with much larger blocksizes (32 kB+).

...
 You have just demonstrated that our current approach (grabbing a
 single semaphore, only around the unaligned portions), does not do
 what we hoped.  So what WOULD protect us, while still allowing as much
 parallelism as possible? 
How about a per-block lock as implemented for the S3 plugin in
https://gitlab.com/nbdkit/nbdkit/-/merge_requests/10?

It might be a bit harder to do in plain C because of the absence set
datatypes, but I think it's should work for the blocksize filter as well.

Best,
-Nikolaus
-- 
GPG Fingerprint: ED31 791B 2C5C 1613 AF38 8B8A D113 FCAC 3C4E 599F

             »Time flies like an arrow, fruit flies like a Banana.«

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Re: [Libguestfs] nbdkit blocksize filter, read-modify-write, and concurrency