On 2/19/21 9:43 AM, Richard W.M. Jones wrote:
perf analysis of the cow filter showed that a great deal of time was
spent holding the global lock. This change makes the locking more
fine-grained so now we are only using the global lock to protect the
bitmap from concurrent access and nothing else.
I added a new read-modify-write lock to protect a few places where we
use an RMW cycle to modify an unaligned block, which should be a
fairly rare occurrence. Previously the global lock protected these.
A benchmark shows this is a considerable improvement. Running the test
from here:
http://git.annexia.org/?p=libguestfs-talks.git;a=blob;f=2021-pipelines/be...
Before this change:
real 3m51.086s
user 0m0.994s
sys 0m3.568s
After this change:
real 2m1.091s
user 0m1.038s
sys 0m3.576s
Compared to a similar pipeline using qemu-img convert and qcow2:
http://git.annexia.org/?p=libguestfs-talks.git;a=blob;f=2021-pipelines/be...
real 2m22.368s
user 0m54.415s
sys 1m27.410s
Nice speedups. In my review, I'm double-checking whether my recent
proposal to advertise multi-conn will be impacted in any way.
+++ b/filters/cow/blk.c
@@ -90,9 +90,12 @@
#include <alloca.h>
#endif
+#include <pthread.h>
+
#include <nbdkit-filter.h>
#include "bitmap.h"
+#include "cleanup.h"
#include "fdatasync.h"
#include "rounding.h"
#include "pread.h"
@@ -104,6 +107,9 @@
/* The temporary overlay. */
static int fd = -1;
+/* This lock protects the bitmap from parallel access. */
+static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
So pre-patch, we were locking both the bitmap access and the
next_ops->act(). Without reading further, it looks like your goal is to
move next_ops->act() out of the lock.
@@ -205,6 +213,7 @@ blk_set_size (uint64_t new_size)
void
blk_status (uint64_t blknum, bool *present, bool *trimmed)
{
+ ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock);
enum bm_entry state = bitmap_get_blk (&bm, blknum, BLOCK_NOT_ALLOCATED);
*present = state != BLOCK_NOT_ALLOCATED;
@@ -219,7 +228,18 @@ blk_read (struct nbdkit_next_ops *next_ops, void *nxdata,
uint64_t blknum, uint8_t *block, int *err)
{
off_t offset = blknum * BLKSIZE;
- enum bm_entry state = bitmap_get_blk (&bm, blknum, BLOCK_NOT_ALLOCATED);
+ enum bm_entry state;
+
+ /* The state might be modified from another thread - for example
+ * another thread might write (BLOCK_NOT_ALLOCATED ->
+ * BLOCK_ALLOCATED) while we are reading from the plugin, returning
+ * the old data. However a read issued after the write returns
+ * should always return the correct data.
+ */
+ {
+ ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock);
+ state = bitmap_get_blk (&bm, blknum, BLOCK_NOT_ALLOCATED);
+ }
Pre-patch, if two threads compete on a read and write, we have exactly
one of two outcomes:
threadA threadB
CMD_READ CMD_WRITE
lock blocked on lock...
state = not allocated
read old data
return
lock
write new data, update state
return
or
threadA threadB
CMD_READ CMD_WRITE
blocked on lock... lock
write new data, update state
return
lock
state = allocated
read new data
return
As we were locking per-block, we could shred large requests (a read
could see some old blocks and some new ones), but shredding is already
problem for the client to worry about and not unique to our cow filter.
The new code now allows reading old data to occur in parallel with
writing new data (for more performance), but otherwise doesn't seem to
be any worse in behavior; I still only see one of two valid outcomes
(the read either sees an entirely old block, or an entirely new block).
Large reads may still be shredded by block, but a client that is already
avoiding overlapping reads at the same time as a write will not be any
worse off.
nbdkit_debug ("cow: blk_read block %" PRIu64 " (offset %" PRIu64
") is %s",
blknum, (uint64_t) offset, state_to_string (state));
@@ -260,6 +280,8 @@ int
blk_cache (struct nbdkit_next_ops *next_ops, void *nxdata,
uint64_t blknum, uint8_t *block, enum cache_mode mode, int *err)
{
+ /* XXX Could make this lock more fine-grained with some thought. */
+ ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock);
On the other hand, NBD_CMD_CACHE is not frequently used, so I'm less
worried about this one being heavy-handed.
off_t offset = blknum * BLKSIZE;
enum bm_entry state = bitmap_get_blk (&bm, blknum, BLOCK_NOT_ALLOCATED);
unsigned n = BLKSIZE, tail = 0;
@@ -322,6 +344,8 @@ blk_write (uint64_t blknum, const uint8_t *block, int *err)
nbdkit_error ("pwrite: %m");
return -1;
}
+
+ ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock);
bitmap_set_blk (&bm, blknum, BLOCK_ALLOCATED);
pre-patch, if two threads contend on a write, we had:
threadA threadB
CMD_WRITE CMD_WRITE
lock blocked on lock...
write new data
update map
return
lock
write new data
update map (no-op)
return
post-patch, we have:
threadA threadB
CMD_WRITE CMD_WRITE
write new data write new data
lock blocked on lock...
update map
return
lock
update map (no-op)
return
Because we always pwrite() exactly one block, we're depending on the OS
to guarantee that we end up with exactly one of the two writes as our
final state (for client transactions larger than a block, shredding is
possible, but that was true without this patch and as above with reads,
dealing with shredding is already the responsibility of the client to
avoid overlapping I/O). Again, looks safe to me.
+++ b/filters/cow/cow.c
@@ -45,16 +45,17 @@
#include <nbdkit-filter.h>
#include "cleanup.h"
-
-#include "blk.h"
#include "isaligned.h"
#include "minmax.h"
#include "rounding.h"
-/* In order to handle parallel requests safely, this lock must be held
- * when calling any blk_* functions.
+#include "blk.h"
Looks a bit odd to move the #include, but not wrong.
+
+/* Read-modify-write requests are serialized through this global lock.
+ * This is only used for unaligned requests which should be
+ * infrequent.
*/
They would be even more infrequent if I ever got around to finishing my
patches on exposing block sizing from plugins and filters all the way
back to the client... But not this patch's problem.
@@ -588,12 +581,6 @@ cow_extents (struct nbdkit_next_ops *next_ops,
void *nxdata,
assert (IS_ALIGNED (count, BLKSIZE));
assert (count > 0); /* We must make forward progress. */
- /* We hold the lock for the whole time, even when requesting extents
- * from the plugin, because we want to present an atomic picture of
- * the current state.
- */
- ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock);
-
while (count > 0) {
bool present, trimmed;
struct nbdkit_extent e;
Our extents picture is no longer atomic, but I don't think that makes us
any worse off.
I agree with your analysis that any read issued by the client on one
connection after a write/flush has completed on another connection will
see only the cached data. The only race where we can see stale data on
a read is when the read is done in parallel with an outstanding write or
flush - but those aren't the cases we worry about when deciding whether
multi-conn makes sense. So I don't think anything in this patch
interferes with my patch to enable multi-conn.
ACK. Looks like a nice job!
--
Eric Blake, Principal Software Engineer
Red Hat, Inc. +1-919-301-3226
Virtualization:
qemu.org |
libvirt.org