On 2/24/21 11:46 AM, Richard W.M. Jones wrote:
On Wed, Feb 24, 2021 at 11:40:09AM -0600, Eric Blake wrote:
> On 2/24/21 11:33 AM, Eric Blake wrote:
>> Implement a TODO item of emulating multi-connection consistency via
>> multiple plugin flush calls to allow a client to assume that a flush
>> on a single connection is good enough. This also gives us some
>> fine-tuning over whether to advertise the bit, including some setups
>> that are unsafe but may be useful in timing tests.
>>
>> Testing is interesting: I used the sh plugin to implement a server
>> that intentionally keeps a per-connection cache.
>>
>> Note that this filter assumes that multiple connections will still
>> share the same data (other than caching effects); effects are not
>> guaranteed when trying to mix it with more exotic plugins like info
>> that violate that premise.
>> ---
>>
>> I'm still working on the test; the sh plugin is good enough that it
>> does what I want when playing with it manually, but I still need to
>> write up various scenarios in test-multi-conn.sh to match what I've
>> played with manually.
>>
>> I'm open to feedback on the set of options I've exposed during .config
>> (too many, not enough, better names?) Right now, it is:
>> multi-conn-mode=auto|plugin|disable|emulate|unsafe
>> multi-conn-track-dirty=fast|connection|off
The patch itself is fine. Was there a particular reason to post it as
an RFC? I think it should be fine as it is. The one though was ...
RFC because testing might turn up issues. And I think I already found
my first one - with mode=emulate track-dirty=connection, I did two passes:
for all connections but mine:
if conn->dirty & WRITE
flush, conn->dirty &= ~WRITE, updated=true
flush mine
for all connections but mine:
if updated && conn->dirty & READ
flush, conn->dirty &= ~READ
The goal was to avoid flushing a connection that has no pending writes,
and no cached reads. But I think my use of conn->dirty &= ~WRITE is not
strong enough. A flush has two purposes: get all pending writes out of
cache, and force all future reads to go to the source instead of the
cache. Making two passes is wise (if no connection has pending writes,
then no connection has a stale cache), but for connections that had both
cached reads and pending writes, I end up flushing twice, even though
only one flush is required. So I think the first pass can get away with
conn->dirty=0 instead of &= ~WRITE.
I'm also in the middle of composing an email to the NBD list (will cc
here); in qemu-nbd (via nbd/server.c) we have:
case NBD_CMD_READ:
if (flags & NBD_CMD_FLAG_FUA)
flush()
perform pread()
The NBD spec permits FUA on reads, but does not directly mention what to
do with it (it only has hard requirements on writes); but it also makes
an indirect mention that CMD_FLAG_FUA was modeled after kernel semantics.
https://www.kernel.org/doc/Documentation/block/writeback_cache_control.txt
The way qemu-nbd is behaving is as if NBD_CMD_FLAG_FUA were to trigger a
kernel bio read with REQ_PREFLUSH (that is, ensure that any writes that
may affect what I'm about to read have landed so that my read is
guaranteed to be non-stale). For NBD_CMD_WRITE, qemu-nbd is behaving
like triggering a kernel bio write with REQ_FUA. NBD_CMD_FLUSH is
triggering an empty bio with REQ_PREFLUSH. The kernel docs imply that
REQ_FUA semantics are a no-op except on writes, but that REQ_PREFLUSH
indeed has the dual behavior of flushing all pending writes as well as
dropping all stale caches prior to reads.
It's odd that NBD_CMD_FLAG_FUA sometimes behaves like REQ_FUA and
sometimes like REQ_PREFLUSH, but being able to request REQ_PREFLUSH
semantics on a read makes sense. I wonder if the NBD protocol should be
updated to document how qemu-nbd already behaves, vs. adding yet another
NBD_CMD_FLAG_* since not all existing servers with FUA support have that
nice feature of pre-flush semantics on NBD_CMD_READ(fua).
> Another thing I thought about: should I add a knob that lets multi-conn
> know whether export names are significant? When set, it only has to
> maintain consistency by flushing on other connections visiting the same
> export name, and not all connections.
... While there are only a few plugins where the exportname is vital
(eg. tmpdisk), and there are a few more where the exportname can be
significant but is not usually, it might be worth this extra knob.
Okay, I'll work that in for my v2.
I'm not sure how to implement it though - the filter only sees
exportnames flying past so it must remember ones it has seen. But
then I think the problem is you may be in a position where you need to
issue a flush, but you no longer have the next_ops struct(?) (It
would be the same kind of problem as the background threads TODO)
Same as everything else - I have a mutex covering .preconnect and
.finalize for manipulating the list of open connections, and would
merely add the name of a connection as part of the struct being stored
in that list. We are guaranteed by nbdkit-filter.pod that the
next_ops/nxdata lifetimes are stable for the entire .preconnect to
.finalize, so all I have to ensure is that the export name is likewise
stable.
--
Eric Blake, Principal Software Engineer
Red Hat, Inc. +1-919-301-3226
Virtualization:
qemu.org |
libvirt.org