[nbdkit PATCH] cow: Always advertise multi-conn
by Eric Blake
Our cow data is locked for proper sharing across connections, so even
though NBD_CMD_FLUSH is an intentional no-op, a client is guaranteed
that after a flush by connection A, a read by connection B will not
see any stale data, which matches the requirements for always
advertising multi-conn.
---
filters/cow/cow.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/filters/cow/cow.c b/filters/cow/cow.c
index 93e10f24..1085ee36 100644
--- a/filters/cow/cow.c
+++ b/filters/cow/cow.c
@@ -181,6 +181,14 @@ cow_can_cache (struct nbdkit_next_ops *next_ops, void *nxdata, void *handle)
return NBDKIT_CACHE_NATIVE;
}
+static int
+cow_can_multi_conn (struct nbdkit_next_ops *next_ops, void *nxdata,
+ void *handle)
+{
+ /* Our cache is consistent between connections. */
+ return 1;
+}
+
/* Override the plugin's .can_fast_zero, because our .zero is not fast */
static int
cow_can_fast_zero (struct nbdkit_next_ops *next_ops, void *nxdata,
@@ -712,6 +720,7 @@ static struct nbdkit_filter filter = {
.can_fua = cow_can_fua,
.can_cache = cow_can_cache,
.can_fast_zero = cow_can_fast_zero,
+ .can_multi_conn = cow_can_multi_conn,
.pread = cow_pread,
.pwrite = cow_pwrite,
.zero = cow_zero,
--
2.30.1
3 years, 9 months
[nbdkit PATCH] cache: Advertise flush/fua/multi_conn if cache=unsafe
by Eric Blake
When using cache=unsafe, we are already stating that the plugin will
not necessarily ever get our cached data written back, but locally
everything is already consistent (we grab a shared lock before
consulting the cache), so we can behave as if client flush requests
always works in a multi-conn safe manner, even without bothering with
fdatasync() on our cache file.
When using cache=writethrough, our behavior under multiple connections
is at the mercy of the plugin (if connection A and B both write, but
only B flushes, we can only guarantee that the data from A was flushed
if the plugin advertises multi-conn; as otherwise, the write by
connection A may still be in a cache that was not flushed by B's
command). But when using cache=writeback, we are guaranteed that only
one connection is ever writing back to the server at a time (that is,
once we flush, we are flushing data from ALL connections), so we can
always advertise multi-conn.
---
This is fallout from IRC discussion we had earlier today about whether
it was safe to enable multi-conn on the ssh plugin.
After re-reading
https://sourceforge.net/p/nbd/mailman/nbd-general/?viewmonth=201610&viewd...,
my take is that the original goal for MULTI_CONN is that when clear,
if connection A and B both write, then both must FLUSH to guarantee
that their writes reached persistent storage (if only one flushes, the
other connection may still have cached data that remains unflushed).
But when MULTI_CONN is set, only one of the two connections has to
issue a flush after both connections have received replies to all
writes that are intended to be made persistent. And the way we share
our cache for cache=unsafe and cache=writeback meets that goal.
filters/cache/cache.c | 56 +++++++++++++++++++++++++++++++++++++++++--
1 file changed, 54 insertions(+), 2 deletions(-)
diff --git a/filters/cache/cache.c b/filters/cache/cache.c
index b979f256..9e5a3e80 100644
--- a/filters/cache/cache.c
+++ b/filters/cache/cache.c
@@ -257,6 +257,53 @@ cache_can_fast_zero (struct nbdkit_next_ops *next_ops, void *nxdata,
return 1;
}
+/* Override the plugin's .can_flush, if we are cache=unsafe */
+static int
+cache_can_flush (struct nbdkit_next_ops *next_ops, void *nxdata,
+ void *handle)
+{
+ if (cache_mode == CACHE_MODE_UNSAFE)
+ return 1;
+ return next_ops->can_flush (nxdata);
+}
+
+
+/* Override the plugin's .can_fua, if we are cache=unsafe */
+static int
+cache_can_fua (struct nbdkit_next_ops *next_ops, void *nxdata,
+ void *handle)
+{
+ if (cache_mode == CACHE_MODE_UNSAFE)
+ return NBDKIT_FUA_NATIVE;
+ return next_ops->can_fua (nxdata);
+}
+
+/* Override the plugin's .can_multi_conn, if we are not cache=writethrough */
+static int
+cache_can_multi_conn (struct nbdkit_next_ops *next_ops, void *nxdata,
+ void *handle)
+{
+ /* For CACHE_MODE_NONE, we always advertise a no-op flush because
+ * our local cache access is consistent between connections, and we
+ * don't care about persisting the data to the underlying plugin.
+ *
+ * For CACHE_MODE_WRITEBACK, things are more subtle: we only write
+ * to the plugin during NBD_CMD_FLUSH, at which point that one
+ * connection writes back ALL cached blocks regardless of which
+ * connection originally wrote them, so a client can be assured that
+ * blocks from all connections have reached the plugin's permanent
+ * storage with only one connection having to send a flush.
+ *
+ * But for CACHE_MODE_WRITETHROUGH, we are at the mercy of the
+ * plugin; data written by connection A is not guaranteed to be made
+ * persistent by a flush from connection B unless the plugin itself
+ * supports multi-conn.
+ */
+ if (cache_mode !== CACHE_MODE_WRITETHROUGH)
+ return 1;
+ return next_ops->can_multi_conn (nxdata);
+}
+
/* Read data. */
static int
cache_pread (struct nbdkit_next_ops *next_ops, void *nxdata,
@@ -352,7 +399,8 @@ cache_pwrite (struct nbdkit_next_ops *next_ops, void *nxdata,
}
if ((flags & NBDKIT_FLAG_FUA) &&
- next_ops->can_fua (nxdata) == NBDKIT_FUA_EMULATE) {
+ (cache_mode == CACHE_MODE_UNSAFE ||
+ next_ops->can_fua (nxdata) == NBDKIT_FUA_EMULATE)) {
flags &= ~NBDKIT_FLAG_FUA;
need_flush = true;
}
@@ -442,7 +490,8 @@ cache_zero (struct nbdkit_next_ops *next_ops, void *nxdata,
flags &= ~NBDKIT_FLAG_MAY_TRIM;
if ((flags & NBDKIT_FLAG_FUA) &&
- next_ops->can_fua (nxdata) == NBDKIT_FUA_EMULATE) {
+ (cache_mode == CACHE_MODE_UNSAFE ||
+ next_ops->can_fua (nxdata) == NBDKIT_FUA_EMULATE)) {
flags &= ~NBDKIT_FLAG_FUA;
need_flush = true;
}
@@ -639,6 +688,9 @@ static struct nbdkit_filter filter = {
.get_size = cache_get_size,
.can_cache = cache_can_cache,
.can_fast_zero = cache_can_fast_zero,
+ .can_flush = cache_can_flush,
+ .can_fua = cache_can_fua,
+ .can_multi_conn = cache_can_multi_conn,
.pread = cache_pread,
.pwrite = cache_pwrite,
.zero = cache_zero,
--
2.30.1
3 years, 9 months
[nbdkit PATCH] blocksize: Allow parallel requests
by Eric Blake
The only time where we have to be careful is when the client sends
unaligned data; if we add a lock around our use of the bounce buffer,
then we can provide the same parallelism as the underlying plugin for
all other transactions.
When we first added .can_multi_conn, we waffled on what the blocksize
filter should do[1]; blindly slamming to 1 when all requests are
serial would have been correct at the time, but we opted to stay
conservative by instead slamming things to 0 with a promise to revisit
it when changing the thread model. Now that we are allowing parallel
transactions, blindly slamming to 1 is no longer a valid option, but
slamming to 0 is overkill when we can instead rely on passthrough to
the plugin's .can_multi_conn.
[1] https://listman.redhat.com/archives/libguestfs/2019-January/msg00074.html
---
filters/blocksize/blocksize.c | 42 +++++++++++++----------------------
1 file changed, 16 insertions(+), 26 deletions(-)
diff --git a/filters/blocksize/blocksize.c b/filters/blocksize/blocksize.c
index c3a2d60d..81d0e76c 100644
--- a/filters/blocksize/blocksize.c
+++ b/filters/blocksize/blocksize.c
@@ -1,5 +1,5 @@
/* nbdkit
- * Copyright (C) 2018-2020 Red Hat Inc.
+ * Copyright (C) 2018-2021 Red Hat Inc.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions are
@@ -49,23 +49,20 @@
#define BLOCKSIZE_MIN_LIMIT (64U * 1024)
-/* As long as we don't have parallel requests, we can reuse a common
- * buffer for alignment purposes; size it to the maximum we allow for
- * minblock. */
+/* In order to handle parallel requests safely, this lock must be held
+ * when using the bounce buffer.
+ */
+static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
+
+/* A single bounce buffer for alignment purposes, guarded by the lock.
+ * Size it to the maximum we allow for minblock.
+ */
static char bounce[BLOCKSIZE_MIN_LIMIT];
+
static unsigned int minblock;
static unsigned int maxdata;
static unsigned int maxlen;
-/* We are using a common bounce buffer (see above) so we must
- * serialize all requests.
- */
-static int
-blocksize_thread_model (void)
-{
- return NBDKIT_THREAD_MODEL_SERIALIZE_ALL_REQUESTS;
-}
-
static int
blocksize_parse (const char *name, const char *s, unsigned int *v)
{
@@ -157,17 +154,6 @@ blocksize_get_size (struct nbdkit_next_ops *next_ops, void *nxdata,
return ROUND_DOWN (size, minblock);
}
-static int
-blocksize_can_multi_conn (struct nbdkit_next_ops *next_ops, void *nxdata,
- void *handle)
-{
- /* Although we are serializing all requests anyway so this likely
- * doesn't make a difference, return false because the bounce buffer
- * is not consistent for flush.
- */
- return 0;
-}
-
static int
blocksize_pread (struct nbdkit_next_ops *next_ops, void *nxdata,
void *handle, void *b, uint32_t count, uint64_t offs,
@@ -179,6 +165,7 @@ blocksize_pread (struct nbdkit_next_ops *next_ops, void *nxdata,
/* Unaligned head */
if (offs & (minblock - 1)) {
+ ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock);
drop = offs & (minblock - 1);
keep = MIN (minblock - drop, count);
if (next_ops->pread (nxdata, bounce, minblock, offs - drop, flags,
@@ -202,6 +189,7 @@ blocksize_pread (struct nbdkit_next_ops *next_ops, void *nxdata,
/* Unaligned tail */
if (count) {
+ ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock);
if (next_ops->pread (nxdata, bounce, minblock, offs, flags, err) == -1)
return -1;
memcpy (buf, bounce, count);
@@ -228,6 +216,7 @@ blocksize_pwrite (struct nbdkit_next_ops *next_ops, void *nxdata,
/* Unaligned head */
if (offs & (minblock - 1)) {
+ ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock);
drop = offs & (minblock - 1);
keep = MIN (minblock - drop, count);
if (next_ops->pread (nxdata, bounce, minblock, offs - drop, 0, err) == -1)
@@ -253,6 +242,7 @@ blocksize_pwrite (struct nbdkit_next_ops *next_ops, void *nxdata,
/* Unaligned tail */
if (count) {
+ ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock);
if (next_ops->pread (nxdata, bounce, minblock, offs, 0, err) == -1)
return -1;
memcpy (bounce, buf, count);
@@ -332,6 +322,7 @@ blocksize_zero (struct nbdkit_next_ops *next_ops, void *nxdata,
/* Unaligned head */
if (offs & (minblock - 1)) {
+ ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock);
drop = offs & (minblock - 1);
keep = MIN (minblock - drop, count);
if (next_ops->pread (nxdata, bounce, minblock, offs - drop, 0, err) == -1)
@@ -355,6 +346,7 @@ blocksize_zero (struct nbdkit_next_ops *next_ops, void *nxdata,
/* Unaligned tail */
if (count) {
+ ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock);
if (next_ops->pread (nxdata, bounce, minblock, offs, 0, err) == -1)
return -1;
memset (bounce, 0, count);
@@ -437,12 +429,10 @@ blocksize_cache (struct nbdkit_next_ops *next_ops, void *nxdata,
static struct nbdkit_filter filter = {
.name = "blocksize",
.longname = "nbdkit blocksize filter",
- .thread_model = blocksize_thread_model,
.config = blocksize_config,
.config_complete = blocksize_config_complete,
.config_help = blocksize_config_help,
.get_size = blocksize_get_size,
- .can_multi_conn = blocksize_can_multi_conn,
.pread = blocksize_pread,
.pwrite = blocksize_pwrite,
.trim = blocksize_trim,
--
2.30.1
3 years, 9 months
[PATCH nbdkit] cow: Use more fine-grained locking.
by Richard W.M. Jones
perf analysis of the cow filter showed that a great deal of time was
spent holding the global lock. This change makes the locking more
fine-grained so now we are only using the global lock to protect the
bitmap from concurrent access and nothing else.
I added a new read-modify-write lock to protect a few places where we
use an RMW cycle to modify an unaligned block, which should be a
fairly rare occurrence. Previously the global lock protected these.
A benchmark shows this is a considerable improvement. Running the test
from here:
http://git.annexia.org/?p=libguestfs-talks.git;a=blob;f=2021-pipelines/be...
Before this change:
real 3m51.086s
user 0m0.994s
sys 0m3.568s
After this change:
real 2m1.091s
user 0m1.038s
sys 0m3.576s
Compared to a similar pipeline using qemu-img convert and qcow2:
http://git.annexia.org/?p=libguestfs-talks.git;a=blob;f=2021-pipelines/be...
real 2m22.368s
user 0m54.415s
sys 1m27.410s
---
filters/cow/blk.h | 7 -------
filters/cow/blk.c | 27 ++++++++++++++++++++++++++-
filters/cow/cow.c | 41 ++++++++++++++---------------------------
3 files changed, 40 insertions(+), 35 deletions(-)
diff --git a/filters/cow/blk.h b/filters/cow/blk.h
index d96b9924..be3e764a 100644
--- a/filters/cow/blk.h
+++ b/filters/cow/blk.h
@@ -44,13 +44,6 @@ extern int blk_init (void);
/* Close the overlay, free the bitmap. */
extern void blk_free (void);
-/*----------------------------------------------------------------------
- * ** NOTE **
- *
- * An exclusive lock must be held when you call any function below
- * this line.
- */
-
/* Allocate or resize the overlay and bitmap. */
extern int blk_set_size (uint64_t new_size);
diff --git a/filters/cow/blk.c b/filters/cow/blk.c
index 51ba91c4..cbd40188 100644
--- a/filters/cow/blk.c
+++ b/filters/cow/blk.c
@@ -90,9 +90,12 @@
#include <alloca.h>
#endif
+#include <pthread.h>
+
#include <nbdkit-filter.h>
#include "bitmap.h"
+#include "cleanup.h"
#include "fdatasync.h"
#include "rounding.h"
#include "pread.h"
@@ -104,6 +107,9 @@
/* The temporary overlay. */
static int fd = -1;
+/* This lock protects the bitmap from parallel access. */
+static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
+
/* Bitmap. */
static struct bitmap bm;
@@ -186,6 +192,8 @@ static uint64_t size = 0;
int
blk_set_size (uint64_t new_size)
{
+ ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock);
+
size = new_size;
if (bitmap_resize (&bm, size) == -1)
@@ -205,6 +213,7 @@ blk_set_size (uint64_t new_size)
void
blk_status (uint64_t blknum, bool *present, bool *trimmed)
{
+ ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock);
enum bm_entry state = bitmap_get_blk (&bm, blknum, BLOCK_NOT_ALLOCATED);
*present = state != BLOCK_NOT_ALLOCATED;
@@ -219,7 +228,18 @@ blk_read (struct nbdkit_next_ops *next_ops, void *nxdata,
uint64_t blknum, uint8_t *block, int *err)
{
off_t offset = blknum * BLKSIZE;
- enum bm_entry state = bitmap_get_blk (&bm, blknum, BLOCK_NOT_ALLOCATED);
+ enum bm_entry state;
+
+ /* The state might be modified from another thread - for example
+ * another thread might write (BLOCK_NOT_ALLOCATED ->
+ * BLOCK_ALLOCATED) while we are reading from the plugin, returning
+ * the old data. However a read issued after the write returns
+ * should always return the correct data.
+ */
+ {
+ ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock);
+ state = bitmap_get_blk (&bm, blknum, BLOCK_NOT_ALLOCATED);
+ }
nbdkit_debug ("cow: blk_read block %" PRIu64 " (offset %" PRIu64 ") is %s",
blknum, (uint64_t) offset, state_to_string (state));
@@ -260,6 +280,8 @@ int
blk_cache (struct nbdkit_next_ops *next_ops, void *nxdata,
uint64_t blknum, uint8_t *block, enum cache_mode mode, int *err)
{
+ /* XXX Could make this lock more fine-grained with some thought. */
+ ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock);
off_t offset = blknum * BLKSIZE;
enum bm_entry state = bitmap_get_blk (&bm, blknum, BLOCK_NOT_ALLOCATED);
unsigned n = BLKSIZE, tail = 0;
@@ -322,6 +344,8 @@ blk_write (uint64_t blknum, const uint8_t *block, int *err)
nbdkit_error ("pwrite: %m");
return -1;
}
+
+ ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock);
bitmap_set_blk (&bm, blknum, BLOCK_ALLOCATED);
return 0;
@@ -339,6 +363,7 @@ blk_trim (uint64_t blknum, int *err)
* here. However it's not trivial since BLKSIZE is unrelated to the
* overlay filesystem block size.
*/
+ ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock);
bitmap_set_blk (&bm, blknum, BLOCK_TRIMMED);
return 0;
}
diff --git a/filters/cow/cow.c b/filters/cow/cow.c
index 93e10f24..54291476 100644
--- a/filters/cow/cow.c
+++ b/filters/cow/cow.c
@@ -45,16 +45,17 @@
#include <nbdkit-filter.h>
#include "cleanup.h"
-
-#include "blk.h"
#include "isaligned.h"
#include "minmax.h"
#include "rounding.h"
-/* In order to handle parallel requests safely, this lock must be held
- * when calling any blk_* functions.
+#include "blk.h"
+
+/* Read-modify-write requests are serialized through this global lock.
+ * This is only used for unaligned requests which should be
+ * infrequent.
*/
-static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
+static pthread_mutex_t rmw_lock = PTHREAD_MUTEX_INITIALIZER;
bool cow_on_cache;
@@ -117,7 +118,6 @@ cow_get_size (struct nbdkit_next_ops *next_ops, void *nxdata,
nbdkit_debug ("cow: underlying file size: %" PRIi64, size);
- ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock);
r = blk_set_size (size);
if (r == -1)
return -1;
@@ -221,7 +221,6 @@ cow_pread (struct nbdkit_next_ops *next_ops, void *nxdata,
uint64_t n = MIN (BLKSIZE - blkoffs, count);
assert (block);
- ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock);
r = blk_read (next_ops, nxdata, blknum, block, err);
if (r == -1)
return -1;
@@ -242,7 +241,6 @@ cow_pread (struct nbdkit_next_ops *next_ops, void *nxdata,
* smarter here.
*/
while (count >= BLKSIZE) {
- ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock);
r = blk_read (next_ops, nxdata, blknum, buf, err);
if (r == -1)
return -1;
@@ -256,7 +254,6 @@ cow_pread (struct nbdkit_next_ops *next_ops, void *nxdata,
/* Unaligned tail */
if (count) {
assert (block);
- ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock);
r = blk_read (next_ops, nxdata, blknum, block, err);
if (r == -1)
return -1;
@@ -294,10 +291,10 @@ cow_pwrite (struct nbdkit_next_ops *next_ops, void *nxdata,
uint64_t n = MIN (BLKSIZE - blkoffs, count);
/* Do a read-modify-write operation on the current block.
- * Hold the lock over the whole operation.
+ * Hold the rmw_lock over the whole operation.
*/
assert (block);
- ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock);
+ ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&rmw_lock);
r = blk_read (next_ops, nxdata, blknum, block, err);
if (r != -1) {
memcpy (&block[blkoffs], buf, n);
@@ -314,7 +311,6 @@ cow_pwrite (struct nbdkit_next_ops *next_ops, void *nxdata,
/* Aligned body */
while (count >= BLKSIZE) {
- ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock);
r = blk_write (blknum, buf, err);
if (r == -1)
return -1;
@@ -328,7 +324,7 @@ cow_pwrite (struct nbdkit_next_ops *next_ops, void *nxdata,
/* Unaligned tail */
if (count) {
assert (block);
- ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock);
+ ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&rmw_lock);
r = blk_read (next_ops, nxdata, blknum, block, err);
if (r != -1) {
memcpy (block, buf, count);
@@ -376,9 +372,9 @@ cow_zero (struct nbdkit_next_ops *next_ops, void *nxdata,
uint64_t n = MIN (BLKSIZE - blkoffs, count);
/* Do a read-modify-write operation on the current block.
- * Hold the lock over the whole operation.
+ * Hold the rmw_lock over the whole operation.
*/
- ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock);
+ ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&rmw_lock);
r = blk_read (next_ops, nxdata, blknum, block, err);
if (r != -1) {
memset (&block[blkoffs], 0, n);
@@ -399,7 +395,6 @@ cow_zero (struct nbdkit_next_ops *next_ops, void *nxdata,
/* XXX There is the possibility of optimizing this: since this loop is
* writing a whole, aligned block, we should use FALLOC_FL_ZERO_RANGE.
*/
- ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock);
r = blk_write (blknum, block, err);
if (r == -1)
return -1;
@@ -411,7 +406,7 @@ cow_zero (struct nbdkit_next_ops *next_ops, void *nxdata,
/* Unaligned tail */
if (count) {
- ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock);
+ ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&rmw_lock);
r = blk_read (next_ops, nxdata, blknum, block, err);
if (r != -1) {
memset (&block[count], 0, BLKSIZE - count);
@@ -455,7 +450,7 @@ cow_trim (struct nbdkit_next_ops *next_ops, void *nxdata,
/* Do a read-modify-write operation on the current block.
* Hold the lock over the whole operation.
*/
- ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock);
+ ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&rmw_lock);
r = blk_read (next_ops, nxdata, blknum, block, err);
if (r != -1) {
memset (&block[blkoffs], 0, n);
@@ -471,7 +466,6 @@ cow_trim (struct nbdkit_next_ops *next_ops, void *nxdata,
/* Aligned body */
while (count >= BLKSIZE) {
- ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock);
r = blk_trim (blknum, err);
if (r == -1)
return -1;
@@ -483,7 +477,7 @@ cow_trim (struct nbdkit_next_ops *next_ops, void *nxdata,
/* Unaligned tail */
if (count) {
- ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock);
+ ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&rmw_lock);
r = blk_read (next_ops, nxdata, blknum, block, err);
if (r != -1) {
memset (&block[count], 0, BLKSIZE - count);
@@ -553,7 +547,6 @@ cow_cache (struct nbdkit_next_ops *next_ops, void *nxdata,
/* Aligned body */
while (remaining) {
- ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock);
r = blk_cache (next_ops, nxdata, blknum, block, mode, err);
if (r == -1)
return -1;
@@ -588,12 +581,6 @@ cow_extents (struct nbdkit_next_ops *next_ops, void *nxdata,
assert (IS_ALIGNED (count, BLKSIZE));
assert (count > 0); /* We must make forward progress. */
- /* We hold the lock for the whole time, even when requesting extents
- * from the plugin, because we want to present an atomic picture of
- * the current state.
- */
- ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock);
-
while (count > 0) {
bool present, trimmed;
struct nbdkit_extent e;
--
2.29.0.rc2
3 years, 9 months
[PATCH] nbdcopy: Speed up raw images copy
by Nir Soffer
When exporting sparse raw image, qemu-nbd reports unallocated area as
zero:
$ qemu-nbd --persistent --socket /tmp/nbd.sock --read-only --shared=10 \
--format raw empty-6g.raw --cache=none --aio=native
$ nbdinfo --map nbd+unix:///?socket=/tmp/nbd.sock
0 6442450944 2 zero
When using qcow2 image, it reports unallocated areas as a hole:
$ qemu-nbd --persistent --socket /tmp/nbd.sock --read-only --shared=10 \
--format qcow2 empty-6g.qcow2 --cache=none --aio=native
$ nbdinfo --map nbd+unix:///?socket=/tmp/nbd.sock
0 6442450944 3 hole,zero
Since nbdcopy is ignoring the ZERO flag and using only the HOLE flag,
coping raw images is extremely slow:
$ hyperfine -w3 "./nbdcopy nbd+unix:///?socket=/tmp/nbd.sock null:"
Benchmark #1: ./nbdcopy nbd+unix:///?socket=/tmp/nbd.sock null:
Time (mean ± σ): 1.595 s ± 0.034 s [User: 2.284 s, System: 3.565 s]
Range (min … max): 1.522 s … 1.630 s 10 runs
This is 69 times slower than qemu-img:
$ hyperfine -w3 "qemu-img convert -n nbd+unix:///?socket=/tmp/nbd.sock \
'json:{\"file.driver\":\"null-co\",\"file.size\":\"6g\"}'"
Benchmark #1: qemu-img convert -n nbd+unix:///?socket=/tmp/nbd.sock 'json:{"file.driver":"null-co","file.size":"6g"}'
Time (mean ± σ): 23.1 ms ± 0.5 ms [User: 6.3 ms, System: 16.5 ms]
Range (min … max): 22.6 ms … 25.5 ms 124 runs
Using ZERO instead of HOLE, nbdcopy does not read zero extents from the
server so it can copy this image 165 times faster:
$ hyperfine -w3 "./nbdcopy nbd+unix:///?socket=/tmp/nbd.sock null:"
Benchmark #1: ./nbdcopy nbd+unix:///?socket=/tmp/nbd.sock null:
Time (mean ± σ): 9.8 ms ± 0.8 ms [User: 6.7 ms, System: 5.2 ms]
Range (min … max): 9.2 ms … 15.4 ms 287 runs
Real images show smaller speedup, only 2 times faster:
$ qemu-nbd --persistent --socket /tmp/nbd.sock --read-only --shared=10 \
--format raw fedora-32.raw --cache=none --aio=native
Before:
$ hyperfine -w3 "./nbdcopy nbd+unix:///?socket=/tmp/nbd.sock null:"
Benchmark #1: ./nbdcopy nbd+unix:///?socket=/tmp/nbd.sock null:
Time (mean ± σ): 1.613 s ± 0.181 s [User: 1.843 s, System: 2.820 s]
Range (min … max): 1.407 s … 1.829 s 10 runs
After:
$ hyperfine -w3 "./nbdcopy nbd+unix:///?socket=/tmp/nbd.sock null:"
Benchmark #1: ./nbdcopy nbd+unix:///?socket=/tmp/nbd.sock null:
Time (mean ± σ): 795.5 ms ± 78.7 ms [User: 198.3 ms, System: 743.1 ms]
Range (min … max): 743.3 ms … 1012.0 ms 10 runs
For reference, copying same image with qemu-img:
$ hyperfine -w3 "qemu-img convert -n nbd+unix:///?socket=/tmp/nbd.sock \
'json:{\"file.driver\":\"null-co\",\"file.size\":\"6g\"}'"
Benchmark #1: qemu-img convert -n nbd+unix:///?socket=/tmp/nbd.sock 'json:{"file.driver":"null-co","file.size":"6g"}'
Time (mean ± σ): 1.046 s ± 0.028 s [User: 122.3 ms, System: 354.5 ms]
Range (min … max): 1.026 s … 1.121 s 10 runs
This issue does not exist when copying from file, since in this case we
detect unallocated areas as holes.
Signed-off-by: Nir Soffer <nsoffer(a)redhat.com>
---
copy/file-ops.c | 4 ++--
copy/main.c | 2 +-
copy/multi-thread-copying.c | 4 ++--
copy/nbd-ops.c | 10 ++++++++--
copy/nbdcopy.h | 2 +-
copy/synch-copying.c | 2 +-
6 files changed, 15 insertions(+), 9 deletions(-)
diff --git a/copy/file-ops.c b/copy/file-ops.c
index f61b67e..2a239d0 100644
--- a/copy/file-ops.c
+++ b/copy/file-ops.c
@@ -250,7 +250,7 @@ file_get_extents (struct rw *rw, uintptr_t index,
if (pos > offset) {
e.offset = offset;
e.length = pos - offset;
- e.hole = true;
+ e.zero = true;
if (extent_list_append (ret, e) == -1) {
perror ("realloc");
exit (EXIT_FAILURE);
@@ -271,7 +271,7 @@ file_get_extents (struct rw *rw, uintptr_t index,
if (pos > offset) {
e.offset = offset;
e.length = pos - offset;
- e.hole = false;
+ e.zero = false;
if (extent_list_append (ret, e) == -1) {
perror ("realloc");
exit (EXIT_FAILURE);
diff --git a/copy/main.c b/copy/main.c
index cfecb32..68a6030 100644
--- a/copy/main.c
+++ b/copy/main.c
@@ -667,7 +667,7 @@ default_get_extents (struct rw *rw, uintptr_t index,
e.offset = offset;
e.length = count;
- e.hole = false;
+ e.zero = false;
if (extent_list_append (ret, e) == -1) {
perror ("realloc");
exit (EXIT_FAILURE);
diff --git a/copy/multi-thread-copying.c b/copy/multi-thread-copying.c
index 4576119..98b4056 100644
--- a/copy/multi-thread-copying.c
+++ b/copy/multi-thread-copying.c
@@ -157,8 +157,8 @@ worker_thread (void *indexp)
char *data;
size_t len;
- if (exts.ptr[i].hole) {
- /* The source is a hole so we can proceed directly to
+ if (exts.ptr[i].zero) {
+ /* The source is zero so we can proceed directly to
* skipping, trimming or writing zeroes at the destination.
*/
command = calloc (1, sizeof *command);
diff --git a/copy/nbd-ops.c b/copy/nbd-ops.c
index f7dc37c..0bcf29b 100644
--- a/copy/nbd-ops.c
+++ b/copy/nbd-ops.c
@@ -190,8 +190,14 @@ add_extent (void *vp, const char *metacontext,
e.offset = offset;
e.length = entries[i];
- /* Note we deliberately don't care about the ZERO flag. */
- e.hole = (entries[i+1] & LIBNBD_STATE_HOLE) != 0;
+
+ /*
+ * Note we deliberately don't care about the HOLE flag. There is no need to
+ * read extent that reads as zeroes. We will convert to it to a hole or
+ * allocated extents based on the command line arguments.
+ */
+ e.zero = (entries[i+1] & LIBNBD_STATE_ZERO) != 0;
+
if (extent_list_append (ret, e) == -1) {
perror ("realloc");
exit (EXIT_FAILURE);
diff --git a/copy/nbdcopy.h b/copy/nbdcopy.h
index f586fc5..69fac2a 100644
--- a/copy/nbdcopy.h
+++ b/copy/nbdcopy.h
@@ -100,7 +100,7 @@ struct command {
struct extent {
uint64_t offset;
uint64_t length;
- bool hole;
+ bool zero;
};
DEFINE_VECTOR_TYPE(extent_list, struct extent);
diff --git a/copy/synch-copying.c b/copy/synch-copying.c
index 043893f..2712c10 100644
--- a/copy/synch-copying.c
+++ b/copy/synch-copying.c
@@ -68,7 +68,7 @@ synch_copying (void)
for (i = 0; i < exts.size; ++i) {
assert (exts.ptr[i].length <= count);
- if (exts.ptr[i].hole) {
+ if (exts.ptr[i].zero) {
if (!dst.ops->synch_trim (&dst, offset, exts.ptr[i].length) &&
!dst.ops->synch_zero (&dst, offset, exts.ptr[i].length)) {
/* If neither trimming nor efficient zeroing are possible,
--
2.26.2
3 years, 9 months
[PATCH nbdkit] cache, cow: Do not round size down.
by Richard W.M. Jones
Both the cache and cow filters have the annoying behaviour that the
size of the underlying plugin is rounded down to the block size used
by the filters (ie. 4K). This is especially visible for single sector
disk images:
$ ./nbdkit memory size=512 --filter=cow --run 'nbdinfo --size $uri'
0
$ ./nbdkit memory size=512 --filter=cache --run 'nbdinfo --size $uri'
0
but affects all users of these filters somewhat randomly and
unexpectedly. (For example I had a GPT partitioned disk image from a
customer which did not have a 4K aligned size, which was unexpectedly
"corrupted" when I tried to open it using the cow filter - the reason
for the corruption was the backup partition table at the end of the
disk being truncated.)
Getting rid of this assumption is awkward: the general idea is that we
round up the size of the backing file / bitmap by a full block. But
whenever we are asked to read or write to the underlying plugin we
handle the tail case carefully.
This also tests various corner cases.
---
filters/cache/nbdkit-cache-filter.pod | 9 +---
filters/cow/nbdkit-cow-filter.pod | 11 +----
tests/Makefile.am | 4 ++
filters/cache/blk.c | 54 +++++++++++++++++++---
filters/cache/cache.c | 12 ++---
filters/cow/blk.c | 49 +++++++++++++++++---
filters/cow/cow.c | 23 +++++++---
tests/test-cache-unaligned.sh | 66 +++++++++++++++++++++++++++
tests/test-cow-unaligned.sh | 56 +++++++++++++++++++++++
9 files changed, 239 insertions(+), 45 deletions(-)
diff --git a/filters/cache/nbdkit-cache-filter.pod b/filters/cache/nbdkit-cache-filter.pod
index 3601dcd5..34fd0b29 100644
--- a/filters/cache/nbdkit-cache-filter.pod
+++ b/filters/cache/nbdkit-cache-filter.pod
@@ -25,12 +25,6 @@ does not have effective caching, or (with C<cache=unsafe>) to defeat
flush requests from the client (which is unsafe and can cause data
loss, as the name suggests).
-Note that the use of this filter rounds the image size down to a
-multiple of the caching granularity (the larger of 4096 or the
-C<f_bsize> field of L<fstatvfs(3)>), to ease the implementation. If
-you need to round the image size up instead to access the last few
-bytes, combine this filter with L<nbdkit-truncate-filter(1)>.
-
This filter only caches image contents. To cache image metadata, use
L<nbdkit-cacheextents-filter(1)> between this filter and the plugin.
To accelerate sequential reads, use L<nbdkit-readahead-filter(1)>
@@ -168,7 +162,6 @@ L<nbdkit(1)>,
L<nbdkit-file-plugin(1)>,
L<nbdkit-cacheextents-filter(1)>,
L<nbdkit-readahead-filter(1)>,
-L<nbdkit-truncate-filter(1)>,
L<nbdkit-filter(3)>,
L<qemu-img(1)>.
@@ -180,4 +173,4 @@ Richard W.M. Jones
=head1 COPYRIGHT
-Copyright (C) 2018-2020 Red Hat Inc.
+Copyright (C) 2018-2021 Red Hat Inc.
diff --git a/filters/cow/nbdkit-cow-filter.pod b/filters/cow/nbdkit-cow-filter.pod
index 4d5ae856..64df3fbd 100644
--- a/filters/cow/nbdkit-cow-filter.pod
+++ b/filters/cow/nbdkit-cow-filter.pod
@@ -31,14 +31,6 @@ that away. It also allows us to create diffs, see below.
The plugin is opened read-only (as if the I<-r> flag was passed), but
you should B<not> pass the I<-r> flag to nbdkit.
-=item *
-
-Note that the use of this filter rounds the image size down to a
-multiple of the caching granularity (4096), to ease the
-implementation. If you need to round the image size up instead to
-access the last few bytes, combine this filter with
-L<nbdkit-truncate-filter(1)>.
-
=back
Limitations of the filter include:
@@ -140,7 +132,6 @@ C<nbdkit-cow-filter> first appeared in nbdkit 1.2.
L<nbdkit(1)>,
L<nbdkit-file-plugin(1)>,
-L<nbdkit-truncate-filter(1)>,
L<nbdkit-xz-filter(1)>,
L<nbdkit-filter(3)>,
L<qemu-img(1)>.
@@ -153,4 +144,4 @@ Richard W.M. Jones
=head1 COPYRIGHT
-Copyright (C) 2018 Red Hat Inc.
+Copyright (C) 2018-2021 Red Hat Inc.
diff --git a/tests/Makefile.am b/tests/Makefile.am
index 42e842b2..70898f20 100644
--- a/tests/Makefile.am
+++ b/tests/Makefile.am
@@ -1353,11 +1353,13 @@ TESTS += \
test-cache.sh \
test-cache-on-read.sh \
test-cache-max-size.sh \
+ test-cache-unaligned.sh \
$(NULL)
EXTRA_DIST += \
test-cache.sh \
test-cache-on-read.sh \
test-cache-max-size.sh \
+ test-cache-unaligned.sh \
$(NULL)
# cacheextents filter test.
@@ -1380,6 +1382,7 @@ TESTS += \
test-cow.sh \
test-cow-extents1.sh \
test-cow-extents2.sh \
+ test-cow-unaligned.sh \
$(NULL)
endif
TESTS += test-cow-null.sh
@@ -1388,6 +1391,7 @@ EXTRA_DIST += \
test-cow-extents1.sh \
test-cow-extents2.sh \
test-cow-null.sh \
+ test-cow-unaligned.sh \
$(NULL)
# delay filter tests.
diff --git a/filters/cache/blk.c b/filters/cache/blk.c
index dc1bcc57..b9af6909 100644
--- a/filters/cache/blk.c
+++ b/filters/cache/blk.c
@@ -1,5 +1,5 @@
/* nbdkit
- * Copyright (C) 2018-2020 Red Hat Inc.
+ * Copyright (C) 2018-2021 Red Hat Inc.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions are
@@ -54,6 +54,7 @@
#include "bitmap.h"
#include "minmax.h"
+#include "rounding.h"
#include "utils.h"
#include "cache.h"
@@ -165,18 +166,25 @@ blk_free (void)
lru_free ();
}
+/* Because blk_set_size is called before the other blk_* functions
+ * this should be set to the true size before we need it.
+ */
+static uint64_t size = 0;
+
int
blk_set_size (uint64_t new_size)
{
- if (bitmap_resize (&bm, new_size) == -1)
+ size = new_size;
+
+ if (bitmap_resize (&bm, size) == -1)
return -1;
- if (ftruncate (fd, new_size) == -1) {
+ if (ftruncate (fd, ROUND_UP (size, blksize)) == -1) {
nbdkit_error ("ftruncate: %m");
return -1;
}
- if (lru_set_size (new_size) == -1)
+ if (lru_set_size (size) == -1)
return -1;
return 0;
@@ -199,9 +207,22 @@ blk_read (struct nbdkit_next_ops *next_ops, void *nxdata,
"unknown");
if (state == BLOCK_NOT_CACHED) { /* Read underlying plugin. */
- if (next_ops->pread (nxdata, block, blksize, offset, 0, err) == -1)
+ unsigned n = blksize, tail = 0;
+
+ if (offset + n > size) {
+ tail = offset + n - size;
+ n -= tail;
+ }
+
+ if (next_ops->pread (nxdata, block, n, offset, 0, err) == -1)
return -1;
+ /* Normally we're reading whole blocks, but at the very end of the
+ * file we might read a partial block. Deal with that case by
+ * zeroing the tail.
+ */
+ memset (block + n, 0, tail);
+
/* If cache-on-read, copy the block to the cache. */
if (cache_on_read) {
nbdkit_debug ("cache: cache-on-read block %" PRIu64
@@ -247,9 +268,22 @@ blk_cache (struct nbdkit_next_ops *next_ops, void *nxdata,
if (state == BLOCK_NOT_CACHED) {
/* Read underlying plugin, copy to cache regardless of cache-on-read. */
- if (next_ops->pread (nxdata, block, blksize, offset, 0, err) == -1)
+ unsigned n = blksize, tail = 0;
+
+ if (offset + n > size) {
+ tail = offset + n - size;
+ n -= tail;
+ }
+
+ if (next_ops->pread (nxdata, block, n, offset, 0, err) == -1)
return -1;
+ /* Normally we're reading whole blocks, but at the very end of the
+ * file we might read a partial block. Deal with that case by
+ * zeroing the tail.
+ */
+ memset (block + n, 0, tail);
+
nbdkit_debug ("cache: cache block %" PRIu64 " (offset %" PRIu64 ")",
blknum, (uint64_t) offset);
@@ -281,6 +315,12 @@ blk_writethrough (struct nbdkit_next_ops *next_ops, void *nxdata,
int *err)
{
off_t offset = blknum * blksize;
+ unsigned n = blksize, tail = 0;
+
+ if (offset + n > size) {
+ tail = offset + n - size;
+ n -= tail;
+ }
reclaim (fd, &bm);
@@ -293,7 +333,7 @@ blk_writethrough (struct nbdkit_next_ops *next_ops, void *nxdata,
return -1;
}
- if (next_ops->pwrite (nxdata, block, blksize, offset, flags, err) == -1)
+ if (next_ops->pwrite (nxdata, block, n, offset, flags, err) == -1)
return -1;
bitmap_set_blk (&bm, blknum, BLOCK_CLEAN);
diff --git a/filters/cache/cache.c b/filters/cache/cache.c
index 91dcc43d..b979f256 100644
--- a/filters/cache/cache.c
+++ b/filters/cache/cache.c
@@ -1,5 +1,5 @@
/* nbdkit
- * Copyright (C) 2018-2020 Red Hat Inc.
+ * Copyright (C) 2018-2021 Red Hat Inc.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions are
@@ -201,9 +201,7 @@ cache_config_complete (nbdkit_next_config_complete *next, void *nxdata)
return next (nxdata);
}
-/* Get the file size; round it down to cache granularity before
- * setting cache size.
- */
+/* Get the file size, set the cache size. */
static int64_t
cache_get_size (struct nbdkit_next_ops *next_ops, void *nxdata,
void *handle)
@@ -216,7 +214,6 @@ cache_get_size (struct nbdkit_next_ops *next_ops, void *nxdata,
return -1;
nbdkit_debug ("cache: underlying file size: %" PRIi64, size);
- size = ROUND_DOWN (size, blksize);
ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock);
r = blk_set_size (size);
@@ -226,8 +223,9 @@ cache_get_size (struct nbdkit_next_ops *next_ops, void *nxdata,
return size;
}
-/* Force an early call to cache_get_size, consequently truncating the
- * cache to the correct size.
+/* Force an early call to cache_get_size because we have to set the
+ * backing file size and bitmap size before any other read or write
+ * calls.
*/
static int
cache_prepare (struct nbdkit_next_ops *next_ops, void *nxdata,
diff --git a/filters/cow/blk.c b/filters/cow/blk.c
index ae0db1fe..51ba91c4 100644
--- a/filters/cow/blk.c
+++ b/filters/cow/blk.c
@@ -94,6 +94,7 @@
#include "bitmap.h"
#include "fdatasync.h"
+#include "rounding.h"
#include "pread.h"
#include "pwrite.h"
#include "utils.h"
@@ -176,14 +177,21 @@ blk_free (void)
bitmap_free (&bm);
}
+/* Because blk_set_size is called before the other blk_* functions
+ * this should be set to the true size before we need it.
+ */
+static uint64_t size = 0;
+
/* Allocate or resize the overlay file and bitmap. */
int
blk_set_size (uint64_t new_size)
{
- if (bitmap_resize (&bm, new_size) == -1)
+ size = new_size;
+
+ if (bitmap_resize (&bm, size) == -1)
return -1;
- if (ftruncate (fd, new_size) == -1) {
+ if (ftruncate (fd, ROUND_UP (size, BLKSIZE)) == -1) {
nbdkit_error ("ftruncate: %m");
return -1;
}
@@ -216,8 +224,24 @@ blk_read (struct nbdkit_next_ops *next_ops, void *nxdata,
nbdkit_debug ("cow: blk_read block %" PRIu64 " (offset %" PRIu64 ") is %s",
blknum, (uint64_t) offset, state_to_string (state));
- if (state == BLOCK_NOT_ALLOCATED) /* Read underlying plugin. */
- return next_ops->pread (nxdata, block, BLKSIZE, offset, 0, err);
+ if (state == BLOCK_NOT_ALLOCATED) { /* Read underlying plugin. */
+ unsigned n = BLKSIZE, tail = 0;
+
+ if (offset + n > size) {
+ tail = offset + n - size;
+ n -= tail;
+ }
+
+ if (next_ops->pread (nxdata, block, n, offset, 0, err) == -1)
+ return -1;
+
+ /* Normally we're reading whole blocks, but at the very end of the
+ * file we might read a partial block. Deal with that case by
+ * zeroing the tail.
+ */
+ memset (block + n, 0, tail);
+ return 0;
+ }
else if (state == BLOCK_ALLOCATED) { /* Read overlay. */
if (pread (fd, block, BLKSIZE, offset) == -1) {
*err = errno;
@@ -238,6 +262,12 @@ blk_cache (struct nbdkit_next_ops *next_ops, void *nxdata,
{
off_t offset = blknum * BLKSIZE;
enum bm_entry state = bitmap_get_blk (&bm, blknum, BLOCK_NOT_ALLOCATED);
+ unsigned n = BLKSIZE, tail = 0;
+
+ if (offset + n > size) {
+ tail = offset + n - size;
+ n -= tail;
+ }
nbdkit_debug ("cow: blk_cache block %" PRIu64 " (offset %" PRIu64 ") is %s",
blknum, (uint64_t) offset, state_to_string (state));
@@ -258,9 +288,16 @@ blk_cache (struct nbdkit_next_ops *next_ops, void *nxdata,
if (mode == BLK_CACHE_IGNORE)
return 0;
if (mode == BLK_CACHE_PASSTHROUGH)
- return next_ops->cache (nxdata, BLKSIZE, offset, 0, err);
- if (next_ops->pread (nxdata, block, BLKSIZE, offset, 0, err) == -1)
+ return next_ops->cache (nxdata, n, offset, 0, err);
+
+ if (next_ops->pread (nxdata, block, n, offset, 0, err) == -1)
return -1;
+ /* Normally we're reading whole blocks, but at the very end of the
+ * file we might read a partial block. Deal with that case by
+ * zeroing the tail.
+ */
+ memset (block + n, 0, tail);
+
if (mode == BLK_CACHE_COW) {
if (pwrite (fd, block, BLKSIZE, offset) == -1) {
*err = errno;
diff --git a/filters/cow/cow.c b/filters/cow/cow.c
index 75d3b907..93e10f24 100644
--- a/filters/cow/cow.c
+++ b/filters/cow/cow.c
@@ -1,5 +1,5 @@
/* nbdkit
- * Copyright (C) 2018-2020 Red Hat Inc.
+ * Copyright (C) 2018-2021 Red Hat Inc.
*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions are
@@ -103,9 +103,7 @@ cow_open (nbdkit_next_open *next, void *nxdata,
return NBDKIT_HANDLE_NOT_NEEDED;
}
-/* Get the file size; round it down to overlay granularity before
- * setting overlay size.
- */
+/* Get the file size, set the cache size. */
static int64_t
cow_get_size (struct nbdkit_next_ops *next_ops, void *nxdata,
void *handle)
@@ -118,7 +116,6 @@ cow_get_size (struct nbdkit_next_ops *next_ops, void *nxdata,
return -1;
nbdkit_debug ("cow: underlying file size: %" PRIi64, size);
- size = ROUND_DOWN (size, BLKSIZE);
ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock);
r = blk_set_size (size);
@@ -128,8 +125,9 @@ cow_get_size (struct nbdkit_next_ops *next_ops, void *nxdata,
return size;
}
-/* Force an early call to cow_get_size, consequently truncating the
- * overlay to the correct size.
+/* Force an early call to cow_get_size because we have to set the
+ * backing file size and bitmap size before any other read or write
+ * calls.
*/
static int
cow_prepare (struct nbdkit_next_ops *next_ops, void *nxdata,
@@ -627,6 +625,7 @@ cow_extents (struct nbdkit_next_ops *next_ops, void *nxdata,
uint64_t range_offset = offset;
uint32_t range_count = 0;
size_t i;
+ int64_t size;
/* Asking the plugin for a single block of extents is not
* efficient for some plugins (eg. VDDK) so ask for as much data
@@ -643,6 +642,16 @@ cow_extents (struct nbdkit_next_ops *next_ops, void *nxdata,
if (present) break;
}
+ /* Don't ask for extent data beyond the end of the plugin. */
+ size = next_ops->get_size (nxdata);
+ if (size == -1)
+ return -1;
+
+ if (range_offset + range_count > size) {
+ unsigned tail = range_offset + range_count - size;
+ range_count -= tail;
+ }
+
CLEANUP_EXTENTS_FREE struct nbdkit_extents *extents2 =
nbdkit_extents_full (next_ops, nxdata,
range_count, range_offset, flags, err);
diff --git a/tests/test-cache-unaligned.sh b/tests/test-cache-unaligned.sh
new file mode 100755
index 00000000..80b3d128
--- /dev/null
+++ b/tests/test-cache-unaligned.sh
@@ -0,0 +1,66 @@
+#!/usr/bin/env bash
+# nbdkit
+# Copyright (C) 2018-2021 Red Hat Inc.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are
+# met:
+#
+# * Redistributions of source code must retain the above copyright
+# notice, this list of conditions and the following disclaimer.
+#
+# * Redistributions in binary form must reproduce the above copyright
+# notice, this list of conditions and the following disclaimer in the
+# documentation and/or other materials provided with the distribution.
+#
+# * Neither the name of Red Hat nor the names of its contributors may be
+# used to endorse or promote products derived from this software without
+# specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY RED HAT AND CONTRIBUTORS ''AS IS'' AND
+# ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
+# THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
+# PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL RED HAT OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+# USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+# OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+# SUCH DAMAGE.
+
+source ./functions.sh
+set -e
+set -x
+
+requires_filter cache
+requires_nbdsh_uri
+
+sock=$(mktemp -u /tmp/nbdkit-test-sock.XXXXXX)
+files="cache-unaligned.img $sock cache-unaligned.pid"
+rm -f $files
+cleanup_fn rm -f $files
+
+# Create an empty base image which is not a multiple of 4K.
+truncate -s 130000 cache-unaligned.img
+
+# Run nbdkit with the caching filter.
+start_nbdkit -P cache-unaligned.pid -U $sock --filter=cache \
+ file cache-unaligned.img
+
+nbdsh --connect "nbd+unix://?socket=$sock" \
+ -c '
+# Write some pattern data to the overlay and check it reads back OK.
+buf = b"abcdefghijklm" * 10000
+h.pwrite(buf, 0)
+buf2 = h.pread(130000, 0)
+assert buf == buf2
+
+# Flushing should write through to the underlying file.
+h.flush()
+
+with open("cache-unaligned.img", "rb") as file:
+ buf2 = file.read(130000)
+ assert buf == buf2
+'
diff --git a/tests/test-cow-unaligned.sh b/tests/test-cow-unaligned.sh
new file mode 100755
index 00000000..d89b84b9
--- /dev/null
+++ b/tests/test-cow-unaligned.sh
@@ -0,0 +1,56 @@
+#!/usr/bin/env bash
+# nbdkit
+# Copyright (C) 2018-2021 Red Hat Inc.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are
+# met:
+#
+# * Redistributions of source code must retain the above copyright
+# notice, this list of conditions and the following disclaimer.
+#
+# * Redistributions in binary form must reproduce the above copyright
+# notice, this list of conditions and the following disclaimer in the
+# documentation and/or other materials provided with the distribution.
+#
+# * Neither the name of Red Hat nor the names of its contributors may be
+# used to endorse or promote products derived from this software without
+# specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY RED HAT AND CONTRIBUTORS ''AS IS'' AND
+# ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
+# THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
+# PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL RED HAT OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+# USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+# OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+# SUCH DAMAGE.
+
+source ./functions.sh
+set -e
+set -x
+
+requires_filter cow
+requires_nbdsh_uri
+
+sock=$(mktemp -u /tmp/nbdkit-test-sock.XXXXXX)
+files="$sock cow-unaligned.pid"
+rm -f $files
+cleanup_fn rm -f $files
+
+# Run nbdkit with the cow filter and a size which is not a multiple of
+# the cow filter block size (4K).
+start_nbdkit -P cow-unaligned.pid -U $sock --filter=cow memory 130000
+
+nbdsh --connect "nbd+unix://?socket=$sock" \
+ -c '
+# Write some pattern data to the overlay and check it reads back OK.
+buf = b"abcdefghijklm" * 10000
+h.pwrite(buf, 0)
+buf2 = h.pread(130000, 0)
+assert buf == buf2
+'
--
2.29.0.rc2
3 years, 9 months
[PATCH] v2v: support configuration of viosock driver
by Valeriy Vdovin
Check that install_drivers function has copied viosock driver files to
the windows guest file system. If positive, this means the drivers have
passed minor/major version check and the guest is able to use them.
After we know that that the drivers are on the guest, we can enable
virtio sock option in configuration and starting script files.
Signed-off-by: Valeriy Vdovin <valeriy.vdovin(a)virtuozzo.com>
---
v2v/convert_linux.ml | 1 +
v2v/convert_windows.ml | 4 +++-
v2v/create_json.ml | 1 +
v2v/create_libvirt_xml.ml | 6 ++++++
v2v/linux_kernels.ml | 4 ++++
v2v/linux_kernels.mli | 1 +
v2v/output_qemu.ml | 4 ++++
v2v/types.ml | 1 +
v2v/types.mli | 2 +-
v2v/windows_virtio.ml | 5 +++--
v2v/windows_virtio.mli | 2 +-
11 files changed, 26 insertions(+), 5 deletions(-)
diff --git a/v2v/convert_linux.ml b/v2v/convert_linux.ml
index 86d387f1..9f22fe3c 100644
--- a/v2v/convert_linux.ml
+++ b/v2v/convert_linux.ml
@@ -154,6 +154,7 @@ let convert (g : G.guestfs) inspect source_disks output rcaps _ =
gcaps_virtio_rng = kernel.ki_supports_virtio_rng;
gcaps_virtio_balloon = kernel.ki_supports_virtio_balloon;
gcaps_isa_pvpanic = kernel.ki_supports_isa_pvpanic;
+ gcaps_virtio_socket = kernel.ki_supports_virtio_socket;
gcaps_machine = machine;
gcaps_arch = Utils.kvm_arch inspect.i_arch;
gcaps_acpi = acpi;
diff --git a/v2v/convert_windows.ml b/v2v/convert_windows.ml
index b452c09b..7842f443 100644
--- a/v2v/convert_windows.ml
+++ b/v2v/convert_windows.ml
@@ -214,7 +214,8 @@ let convert (g : G.guestfs) inspect _ output rcaps static_ips =
video_driver,
virtio_rng_supported,
virtio_ballon_supported,
- isa_pvpanic_supported =
+ isa_pvpanic_supported,
+ virtio_socket_supported =
Registry.with_hive_write g inspect.i_windows_system_hive
update_system_hive in
@@ -256,6 +257,7 @@ let convert (g : G.guestfs) inspect _ output rcaps static_ips =
gcaps_virtio_rng = virtio_rng_supported;
gcaps_virtio_balloon = virtio_ballon_supported;
gcaps_isa_pvpanic = isa_pvpanic_supported;
+ gcaps_virtio_socket = virtio_socket_supported;
gcaps_machine = machine;
gcaps_arch = Utils.kvm_arch inspect.i_arch;
gcaps_acpi = true;
diff --git a/v2v/create_json.ml b/v2v/create_json.ml
index fdf7b12f..316a5536 100644
--- a/v2v/create_json.ml
+++ b/v2v/create_json.ml
@@ -229,6 +229,7 @@ let create_json_metadata source targets target_buses
"virtio-rng", JSON.Bool guestcaps.gcaps_virtio_rng;
"virtio-balloon", JSON.Bool guestcaps.gcaps_virtio_balloon;
"isa-pvpanic", JSON.Bool guestcaps.gcaps_isa_pvpanic;
+ "virtio-socket", JSON.Bool guestcaps.gcaps_virtio_socket;
"acpi", JSON.Bool guestcaps.gcaps_acpi;
] in
List.push_back doc ("guestcaps", JSON.Dict guestcaps_dict);
diff --git a/v2v/create_libvirt_xml.ml b/v2v/create_libvirt_xml.ml
index 212ace2d..6a764cb2 100644
--- a/v2v/create_libvirt_xml.ml
+++ b/v2v/create_libvirt_xml.ml
@@ -521,6 +521,12 @@ let create_libvirt_xml ?pool source targets target_buses guestcaps
e "address" ["type", "isa"; "iobase", "0x505"] []
]
);
+ List.push_back devices (
+ e "viosock"
+ ["model",
+ if guestcaps.gcaps_virtio_socket then "virtio" else "none"]
+ []
+ );
(* Standard devices added to every guest. *)
List.push_back_list devices [
diff --git a/v2v/linux_kernels.ml b/v2v/linux_kernels.ml
index 7e171eae..6dead217 100644
--- a/v2v/linux_kernels.ml
+++ b/v2v/linux_kernels.ml
@@ -44,6 +44,7 @@ type kernel_info = {
ki_supports_virtio_rng : bool;
ki_supports_virtio_balloon : bool;
ki_supports_isa_pvpanic : bool;
+ ki_supports_virtio_socket : bool;
ki_is_xen_pv_only_kernel : bool;
ki_is_debug : bool;
ki_config_file : string option;
@@ -246,6 +247,8 @@ let detect_kernels (g : G.guestfs) inspect family bootloader =
kernel_supports "virtio_balloon" "VIRTIO_BALLOON" in
let supports_isa_pvpanic =
kernel_supports "pvpanic" "PVPANIC" in
+ let supports_virtio_socket =
+ kernel_supports "virtio_socket" "VIRTIO_SOCKET" in
let is_xen_pv_only_kernel =
check_config "X86_XEN" config_file ||
check_config "X86_64_XEN" config_file in
@@ -272,6 +275,7 @@ let detect_kernels (g : G.guestfs) inspect family bootloader =
ki_supports_virtio_rng = supports_virtio_rng;
ki_supports_virtio_balloon = supports_virtio_balloon;
ki_supports_isa_pvpanic = supports_isa_pvpanic;
+ ki_supports_virtio_socket = supports_virtio_socket;
ki_is_xen_pv_only_kernel = is_xen_pv_only_kernel;
ki_is_debug = is_debug;
ki_config_file = config_file;
diff --git a/v2v/linux_kernels.mli b/v2v/linux_kernels.mli
index 028eba81..fe81a036 100644
--- a/v2v/linux_kernels.mli
+++ b/v2v/linux_kernels.mli
@@ -33,6 +33,7 @@ type kernel_info = {
ki_supports_virtio_rng : bool; (** Kernel supports virtio-rng? *)
ki_supports_virtio_balloon : bool; (** Kernel supports memory balloon? *)
ki_supports_isa_pvpanic : bool; (** Kernel supports ISA pvpanic device? *)
+ ki_supports_virtio_socket : bool; (** Kernel supports virtio-socket? *)
ki_is_xen_pv_only_kernel : bool; (** Is a Xen paravirt-only kernel? *)
ki_is_debug : bool; (** Is debug kernel? *)
ki_config_file : string option; (** Path of config file, if found. *)
diff --git a/v2v/output_qemu.ml b/v2v/output_qemu.ml
index be3a3c5e..d6d70c23 100644
--- a/v2v/output_qemu.ml
+++ b/v2v/output_qemu.ml
@@ -247,6 +247,10 @@ object
arg "-balloon" "none";
if guestcaps.gcaps_isa_pvpanic then
arg_list "-device" ["pvpanic"; "ioport=0x505"];
+ if guestcaps.gcaps_virtio_socket then
+ arg "-viosock" "virtio"
+ else
+ arg "-viosock" "none";
(* Add a serial console to Linux guests. *)
if inspect.i_type = "linux" then
diff --git a/v2v/types.ml b/v2v/types.ml
index a8949e4b..4c7ee864 100644
--- a/v2v/types.ml
+++ b/v2v/types.ml
@@ -411,6 +411,7 @@ type guestcaps = {
gcaps_virtio_rng : bool;
gcaps_virtio_balloon : bool;
gcaps_isa_pvpanic : bool;
+ gcaps_virtio_socket : bool;
gcaps_machine : guestcaps_machine;
gcaps_arch : string;
gcaps_acpi : bool;
diff --git a/v2v/types.mli b/v2v/types.mli
index f474dcaa..42a80d9d 100644
--- a/v2v/types.mli
+++ b/v2v/types.mli
@@ -252,7 +252,7 @@ type guestcaps = {
gcaps_virtio_rng : bool; (** Guest supports virtio-rng. *)
gcaps_virtio_balloon : bool; (** Guest supports virtio balloon. *)
gcaps_isa_pvpanic : bool; (** Guest supports ISA pvpanic device. *)
-
+ gcaps_virtio_socket : bool; (** Guest supports virtio socket. *)
gcaps_machine : guestcaps_machine; (** Machine model. *)
gcaps_arch : string; (** Architecture that KVM must emulate. *)
gcaps_acpi : bool; (** True if guest supports acpi. *)
diff --git a/v2v/windows_virtio.ml b/v2v/windows_virtio.ml
index 74a43cc7..cf417a45 100644
--- a/v2v/windows_virtio.ml
+++ b/v2v/windows_virtio.ml
@@ -68,7 +68,7 @@ let rec install_drivers ((g, _) as reg) inspect rcaps =
match net_type with
| Some model -> model
| None -> RTL8139 in
- (IDE, net_type, Cirrus, false, false, false)
+ (IDE, net_type, Cirrus, false, false, false, false)
)
else (
(* Can we install the block driver? *)
@@ -178,9 +178,10 @@ let rec install_drivers ((g, _) as reg) inspect rcaps =
let virtio_rng_supported = g#exists (driverdir // "viorng.inf") in
let virtio_ballon_supported = g#exists (driverdir // "balloon.inf") in
let isa_pvpanic_supported = g#exists (driverdir // "pvpanic.inf") in
+ let virtio_socket_supported = g#exists (driverdir // "viosock.inf") in
(block, net, video,
- virtio_rng_supported, virtio_ballon_supported, isa_pvpanic_supported)
+ virtio_rng_supported, virtio_ballon_supported, isa_pvpanic_supported, virtio_socket_supported)
)
and install_linux_tools g inspect =
diff --git a/v2v/windows_virtio.mli b/v2v/windows_virtio.mli
index c063af3f..642317b1 100644
--- a/v2v/windows_virtio.mli
+++ b/v2v/windows_virtio.mli
@@ -20,7 +20,7 @@
val install_drivers
: Registry.t -> Types.inspect -> Types.requested_guestcaps ->
- Types.guestcaps_block_type * Types.guestcaps_net_type * Types.guestcaps_video_type * bool * bool * bool
+ Types.guestcaps_block_type * Types.guestcaps_net_type * Types.guestcaps_video_type * bool * bool * bool * bool
(** [install_drivers reg inspect rcaps]
installs virtio drivers from the driver directory or driver
ISO into the guest driver directory and updates the registry
--
2.27.0
3 years, 9 months
Re: [Libguestfs] [libguestfs/libnbd] nbd_connect_tcp: handshake: server policy prevents NBD_OPT_GO (#4)
by Richard W.M. Jones
On Thu, Feb 11, 2021 at 08:37:10AM -0800, Shem Pasamba wrote:
> Running ./nbdfuse ../mnt --tcp 172.27.20.55 2222 results in:
>
> nbd_connect_tcp: handshake: server policy prevents NBD_OPT_GO
>
> # ./nbdfuse --version
> nbdfuse 1.6.1
> libnbd 1.7.1
>
> With nbd-server started on Ubuntu 20.04 with command:
>
> sudo nbd-server 2222 /dev/md1
In general I would recommend using a more featureful server,
ie. qemu-nbd or nbdkit.
The error means that the server returned error NBD_REP_ERR_POLICY, but
the true reason will only be contained in the logs on the server side.
Rich.
--
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines. Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top
3 years, 9 months
ANNOUNCE: libnbd and nbdkit projects are moving to GitLab
by Martin Kletzander
I am pleased to announce that we made a successful switch from GitHub to
GitLab for nbdkit[0] and libnbd[1] projects today under the collective
group name nbdkit[2].
GitLab is the primary repository from now on. All changes will be
automatically mirrored to GitHub on a regular basis in order not to
break any current workflow. It is advised, however, that since GitLab
is the new primary repository all repository URLs should switch to
GitLab as well.
This move will help us expand CI for these projects as we will be able
to utilise all the effort that went to the great libvirt-ci[3] project
and I hope to post patches soon as I have them almost ready.
If you have any questions or concerns, don't hesitate to contact me
and/or Rich.
Have a nice day everyone!
Martin
[0] https://gitlab.com/nbdkit/nbdkit
[1] https://gitlab.com/nbdkit/libnbd
[2] https://gitlab.com/nbdkit
[3] https://gitlab.com/libvirt/libvirt-ci
3 years, 9 months