February 2021 - Libguestfs - Libguestfs List Archives

[nbdkit PATCH] cow: Always advertise multi-conn

by Eric Blake

Our cow data is locked for proper sharing across connections, so even though NBD_CMD_FLUSH is an intentional no-op, a client is guaranteed that after a flush by connection A, a read by connection B will not see any stale data, which matches the requirements for always advertising multi-conn. --- filters/cow/cow.c | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/filters/cow/cow.c b/filters/cow/cow.c index 93e10f24..1085ee36 100644 --- a/filters/cow/cow.c +++ b/filters/cow/cow.c @@ -181,6 +181,14 @@ cow_can_cache (struct nbdkit_next_ops *next_ops, void *nxdata, void *handle) return NBDKIT_CACHE_NATIVE; } +static int +cow_can_multi_conn (struct nbdkit_next_ops *next_ops, void *nxdata, + void *handle) +{ + /* Our cache is consistent between connections. */ + return 1; +} + /* Override the plugin's .can_fast_zero, because our .zero is not fast */ static int cow_can_fast_zero (struct nbdkit_next_ops *next_ops, void *nxdata, @@ -712,6 +720,7 @@ static struct nbdkit_filter filter = { .can_fua = cow_can_fua, .can_cache = cow_can_cache, .can_fast_zero = cow_can_fast_zero, + .can_multi_conn = cow_can_multi_conn, .pread = cow_pread, .pwrite = cow_pwrite, .zero = cow_zero, -- 2.30.1

4 years, 4 months

2
1
0 / 0

[nbdkit PATCH] cache: Advertise flush/fua/multi_conn if cache=unsafe

by Eric Blake

When using cache=unsafe, we are already stating that the plugin will not necessarily ever get our cached data written back, but locally everything is already consistent (we grab a shared lock before consulting the cache), so we can behave as if client flush requests always works in a multi-conn safe manner, even without bothering with fdatasync() on our cache file. When using cache=writethrough, our behavior under multiple connections is at the mercy of the plugin (if connection A and B both write, but only B flushes, we can only guarantee that the data from A was flushed if the plugin advertises multi-conn; as otherwise, the write by connection A may still be in a cache that was not flushed by B's command). But when using cache=writeback, we are guaranteed that only one connection is ever writing back to the server at a time (that is, once we flush, we are flushing data from ALL connections), so we can always advertise multi-conn. --- This is fallout from IRC discussion we had earlier today about whether it was safe to enable multi-conn on the ssh plugin. After re-reading https://sourceforge.net/p/nbd/mailman/nbd-general/?viewmonth=201610&viewd..., my take is that the original goal for MULTI_CONN is that when clear, if connection A and B both write, then both must FLUSH to guarantee that their writes reached persistent storage (if only one flushes, the other connection may still have cached data that remains unflushed). But when MULTI_CONN is set, only one of the two connections has to issue a flush after both connections have received replies to all writes that are intended to be made persistent. And the way we share our cache for cache=unsafe and cache=writeback meets that goal. filters/cache/cache.c | 56 +++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 54 insertions(+), 2 deletions(-) diff --git a/filters/cache/cache.c b/filters/cache/cache.c index b979f256..9e5a3e80 100644 --- a/filters/cache/cache.c +++ b/filters/cache/cache.c @@ -257,6 +257,53 @@ cache_can_fast_zero (struct nbdkit_next_ops *next_ops, void *nxdata, return 1; } +/* Override the plugin's .can_flush, if we are cache=unsafe */ +static int +cache_can_flush (struct nbdkit_next_ops *next_ops, void *nxdata, + void *handle) +{ + if (cache_mode == CACHE_MODE_UNSAFE) + return 1; + return next_ops->can_flush (nxdata); +} + + +/* Override the plugin's .can_fua, if we are cache=unsafe */ +static int +cache_can_fua (struct nbdkit_next_ops *next_ops, void *nxdata, + void *handle) +{ + if (cache_mode == CACHE_MODE_UNSAFE) + return NBDKIT_FUA_NATIVE; + return next_ops->can_fua (nxdata); +} + +/* Override the plugin's .can_multi_conn, if we are not cache=writethrough */ +static int +cache_can_multi_conn (struct nbdkit_next_ops *next_ops, void *nxdata, + void *handle) +{ + /* For CACHE_MODE_NONE, we always advertise a no-op flush because + * our local cache access is consistent between connections, and we + * don't care about persisting the data to the underlying plugin. + * + * For CACHE_MODE_WRITEBACK, things are more subtle: we only write + * to the plugin during NBD_CMD_FLUSH, at which point that one + * connection writes back ALL cached blocks regardless of which + * connection originally wrote them, so a client can be assured that + * blocks from all connections have reached the plugin's permanent + * storage with only one connection having to send a flush. + * + * But for CACHE_MODE_WRITETHROUGH, we are at the mercy of the + * plugin; data written by connection A is not guaranteed to be made + * persistent by a flush from connection B unless the plugin itself + * supports multi-conn. + */ + if (cache_mode !== CACHE_MODE_WRITETHROUGH) + return 1; + return next_ops->can_multi_conn (nxdata); +} + /* Read data. */ static int cache_pread (struct nbdkit_next_ops *next_ops, void *nxdata, @@ -352,7 +399,8 @@ cache_pwrite (struct nbdkit_next_ops *next_ops, void *nxdata, } if ((flags & NBDKIT_FLAG_FUA) && - next_ops->can_fua (nxdata) == NBDKIT_FUA_EMULATE) { + (cache_mode == CACHE_MODE_UNSAFE || + next_ops->can_fua (nxdata) == NBDKIT_FUA_EMULATE)) { flags &= ~NBDKIT_FLAG_FUA; need_flush = true; } @@ -442,7 +490,8 @@ cache_zero (struct nbdkit_next_ops *next_ops, void *nxdata, flags &= ~NBDKIT_FLAG_MAY_TRIM; if ((flags & NBDKIT_FLAG_FUA) && - next_ops->can_fua (nxdata) == NBDKIT_FUA_EMULATE) { + (cache_mode == CACHE_MODE_UNSAFE || + next_ops->can_fua (nxdata) == NBDKIT_FUA_EMULATE)) { flags &= ~NBDKIT_FLAG_FUA; need_flush = true; } @@ -639,6 +688,9 @@ static struct nbdkit_filter filter = { .get_size = cache_get_size, .can_cache = cache_can_cache, .can_fast_zero = cache_can_fast_zero, + .can_flush = cache_can_flush, + .can_fua = cache_can_fua, + .can_multi_conn = cache_can_multi_conn, .pread = cache_pread, .pwrite = cache_pwrite, .zero = cache_zero, -- 2.30.1

4 years, 4 months

2
1
0 / 0

[nbdkit PATCH] blocksize: Allow parallel requests

by Eric Blake

The only time where we have to be careful is when the client sends unaligned data; if we add a lock around our use of the bounce buffer, then we can provide the same parallelism as the underlying plugin for all other transactions. When we first added .can_multi_conn, we waffled on what the blocksize filter should do[1]; blindly slamming to 1 when all requests are serial would have been correct at the time, but we opted to stay conservative by instead slamming things to 0 with a promise to revisit it when changing the thread model. Now that we are allowing parallel transactions, blindly slamming to 1 is no longer a valid option, but slamming to 0 is overkill when we can instead rely on passthrough to the plugin's .can_multi_conn. [1] https://listman.redhat.com/archives/libguestfs/2019-January/msg00074.html --- filters/blocksize/blocksize.c | 42 +++++++++++++---------------------- 1 file changed, 16 insertions(+), 26 deletions(-) diff --git a/filters/blocksize/blocksize.c b/filters/blocksize/blocksize.c index c3a2d60d..81d0e76c 100644 --- a/filters/blocksize/blocksize.c +++ b/filters/blocksize/blocksize.c @@ -1,5 +1,5 @@ /* nbdkit - * Copyright (C) 2018-2020 Red Hat Inc. + * Copyright (C) 2018-2021 Red Hat Inc. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are @@ -49,23 +49,20 @@ #define BLOCKSIZE_MIN_LIMIT (64U * 1024) -/* As long as we don't have parallel requests, we can reuse a common - * buffer for alignment purposes; size it to the maximum we allow for - * minblock. */ +/* In order to handle parallel requests safely, this lock must be held + * when using the bounce buffer. + */ +static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER; + +/* A single bounce buffer for alignment purposes, guarded by the lock. + * Size it to the maximum we allow for minblock. + */ static char bounce[BLOCKSIZE_MIN_LIMIT]; + static unsigned int minblock; static unsigned int maxdata; static unsigned int maxlen; -/* We are using a common bounce buffer (see above) so we must - * serialize all requests. - */ -static int -blocksize_thread_model (void) -{ - return NBDKIT_THREAD_MODEL_SERIALIZE_ALL_REQUESTS; -} - static int blocksize_parse (const char *name, const char *s, unsigned int *v) { @@ -157,17 +154,6 @@ blocksize_get_size (struct nbdkit_next_ops *next_ops, void *nxdata, return ROUND_DOWN (size, minblock); } -static int -blocksize_can_multi_conn (struct nbdkit_next_ops *next_ops, void *nxdata, - void *handle) -{ - /* Although we are serializing all requests anyway so this likely - * doesn't make a difference, return false because the bounce buffer - * is not consistent for flush. - */ - return 0; -} - static int blocksize_pread (struct nbdkit_next_ops *next_ops, void *nxdata, void *handle, void *b, uint32_t count, uint64_t offs, @@ -179,6 +165,7 @@ blocksize_pread (struct nbdkit_next_ops *next_ops, void *nxdata, /* Unaligned head */ if (offs & (minblock - 1)) { + ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock); drop = offs & (minblock - 1); keep = MIN (minblock - drop, count); if (next_ops->pread (nxdata, bounce, minblock, offs - drop, flags, @@ -202,6 +189,7 @@ blocksize_pread (struct nbdkit_next_ops *next_ops, void *nxdata, /* Unaligned tail */ if (count) { + ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock); if (next_ops->pread (nxdata, bounce, minblock, offs, flags, err) == -1) return -1; memcpy (buf, bounce, count); @@ -228,6 +216,7 @@ blocksize_pwrite (struct nbdkit_next_ops *next_ops, void *nxdata, /* Unaligned head */ if (offs & (minblock - 1)) { + ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock); drop = offs & (minblock - 1); keep = MIN (minblock - drop, count); if (next_ops->pread (nxdata, bounce, minblock, offs - drop, 0, err) == -1) @@ -253,6 +242,7 @@ blocksize_pwrite (struct nbdkit_next_ops *next_ops, void *nxdata, /* Unaligned tail */ if (count) { + ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock); if (next_ops->pread (nxdata, bounce, minblock, offs, 0, err) == -1) return -1; memcpy (bounce, buf, count); @@ -332,6 +322,7 @@ blocksize_zero (struct nbdkit_next_ops *next_ops, void *nxdata, /* Unaligned head */ if (offs & (minblock - 1)) { + ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock); drop = offs & (minblock - 1); keep = MIN (minblock - drop, count); if (next_ops->pread (nxdata, bounce, minblock, offs - drop, 0, err) == -1) @@ -355,6 +346,7 @@ blocksize_zero (struct nbdkit_next_ops *next_ops, void *nxdata, /* Unaligned tail */ if (count) { + ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock); if (next_ops->pread (nxdata, bounce, minblock, offs, 0, err) == -1) return -1; memset (bounce, 0, count); @@ -437,12 +429,10 @@ blocksize_cache (struct nbdkit_next_ops *next_ops, void *nxdata, static struct nbdkit_filter filter = { .name = "blocksize", .longname = "nbdkit blocksize filter", - .thread_model = blocksize_thread_model, .config = blocksize_config, .config_complete = blocksize_config_complete, .config_help = blocksize_config_help, .get_size = blocksize_get_size, - .can_multi_conn = blocksize_can_multi_conn, .pread = blocksize_pread, .pwrite = blocksize_pwrite, .trim = blocksize_trim, -- 2.30.1

4 years, 4 months

2
1
0 / 0

[PATCH nbdkit] cow: Use more fine-grained locking.

by Richard W.M. Jones

perf analysis of the cow filter showed that a great deal of time was spent holding the global lock. This change makes the locking more fine-grained so now we are only using the global lock to protect the bitmap from concurrent access and nothing else. I added a new read-modify-write lock to protect a few places where we use an RMW cycle to modify an unaligned block, which should be a fairly rare occurrence. Previously the global lock protected these. A benchmark shows this is a considerable improvement. Running the test from here: http://git.annexia.org/?p=libguestfs-talks.git;a=blob;f=2021-pipelines/be... Before this change: real 3m51.086s user 0m0.994s sys 0m3.568s After this change: real 2m1.091s user 0m1.038s sys 0m3.576s Compared to a similar pipeline using qemu-img convert and qcow2: http://git.annexia.org/?p=libguestfs-talks.git;a=blob;f=2021-pipelines/be... real 2m22.368s user 0m54.415s sys 1m27.410s --- filters/cow/blk.h | 7 ------- filters/cow/blk.c | 27 ++++++++++++++++++++++++++- filters/cow/cow.c | 41 ++++++++++++++--------------------------- 3 files changed, 40 insertions(+), 35 deletions(-) diff --git a/filters/cow/blk.h b/filters/cow/blk.h index d96b9924..be3e764a 100644 --- a/filters/cow/blk.h +++ b/filters/cow/blk.h @@ -44,13 +44,6 @@ extern int blk_init (void); /* Close the overlay, free the bitmap. */ extern void blk_free (void); -/*---------------------------------------------------------------------- - * ** NOTE ** - * - * An exclusive lock must be held when you call any function below - * this line. - */ - /* Allocate or resize the overlay and bitmap. */ extern int blk_set_size (uint64_t new_size); diff --git a/filters/cow/blk.c b/filters/cow/blk.c index 51ba91c4..cbd40188 100644 --- a/filters/cow/blk.c +++ b/filters/cow/blk.c @@ -90,9 +90,12 @@ #include <alloca.h> #endif +#include <pthread.h> + #include <nbdkit-filter.h> #include "bitmap.h" +#include "cleanup.h" #include "fdatasync.h" #include "rounding.h" #include "pread.h" @@ -104,6 +107,9 @@ /* The temporary overlay. */ static int fd = -1; +/* This lock protects the bitmap from parallel access. */ +static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER; + /* Bitmap. */ static struct bitmap bm; @@ -186,6 +192,8 @@ static uint64_t size = 0; int blk_set_size (uint64_t new_size) { + ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock); + size = new_size; if (bitmap_resize (&bm, size) == -1) @@ -205,6 +213,7 @@ blk_set_size (uint64_t new_size) void blk_status (uint64_t blknum, bool *present, bool *trimmed) { + ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock); enum bm_entry state = bitmap_get_blk (&bm, blknum, BLOCK_NOT_ALLOCATED); *present = state != BLOCK_NOT_ALLOCATED; @@ -219,7 +228,18 @@ blk_read (struct nbdkit_next_ops *next_ops, void *nxdata, uint64_t blknum, uint8_t *block, int *err) { off_t offset = blknum * BLKSIZE; - enum bm_entry state = bitmap_get_blk (&bm, blknum, BLOCK_NOT_ALLOCATED); + enum bm_entry state; + + /* The state might be modified from another thread - for example + * another thread might write (BLOCK_NOT_ALLOCATED -> + * BLOCK_ALLOCATED) while we are reading from the plugin, returning + * the old data. However a read issued after the write returns + * should always return the correct data. + */ + { + ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock); + state = bitmap_get_blk (&bm, blknum, BLOCK_NOT_ALLOCATED); + } nbdkit_debug ("cow: blk_read block %" PRIu64 " (offset %" PRIu64 ") is %s", blknum, (uint64_t) offset, state_to_string (state)); @@ -260,6 +280,8 @@ int blk_cache (struct nbdkit_next_ops *next_ops, void *nxdata, uint64_t blknum, uint8_t *block, enum cache_mode mode, int *err) { + /* XXX Could make this lock more fine-grained with some thought. */ + ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock); off_t offset = blknum * BLKSIZE; enum bm_entry state = bitmap_get_blk (&bm, blknum, BLOCK_NOT_ALLOCATED); unsigned n = BLKSIZE, tail = 0; @@ -322,6 +344,8 @@ blk_write (uint64_t blknum, const uint8_t *block, int *err) nbdkit_error ("pwrite: %m"); return -1; } + + ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock); bitmap_set_blk (&bm, blknum, BLOCK_ALLOCATED); return 0; @@ -339,6 +363,7 @@ blk_trim (uint64_t blknum, int *err) * here. However it's not trivial since BLKSIZE is unrelated to the * overlay filesystem block size. */ + ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock); bitmap_set_blk (&bm, blknum, BLOCK_TRIMMED); return 0; } diff --git a/filters/cow/cow.c b/filters/cow/cow.c index 93e10f24..54291476 100644 --- a/filters/cow/cow.c +++ b/filters/cow/cow.c @@ -45,16 +45,17 @@ #include <nbdkit-filter.h> #include "cleanup.h" - -#include "blk.h" #include "isaligned.h" #include "minmax.h" #include "rounding.h" -/* In order to handle parallel requests safely, this lock must be held - * when calling any blk_* functions. +#include "blk.h" + +/* Read-modify-write requests are serialized through this global lock. + * This is only used for unaligned requests which should be + * infrequent. */ -static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER; +static pthread_mutex_t rmw_lock = PTHREAD_MUTEX_INITIALIZER; bool cow_on_cache; @@ -117,7 +118,6 @@ cow_get_size (struct nbdkit_next_ops *next_ops, void *nxdata, nbdkit_debug ("cow: underlying file size: %" PRIi64, size); - ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock); r = blk_set_size (size); if (r == -1) return -1; @@ -221,7 +221,6 @@ cow_pread (struct nbdkit_next_ops *next_ops, void *nxdata, uint64_t n = MIN (BLKSIZE - blkoffs, count); assert (block); - ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock); r = blk_read (next_ops, nxdata, blknum, block, err); if (r == -1) return -1; @@ -242,7 +241,6 @@ cow_pread (struct nbdkit_next_ops *next_ops, void *nxdata, * smarter here. */ while (count >= BLKSIZE) { - ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock); r = blk_read (next_ops, nxdata, blknum, buf, err); if (r == -1) return -1; @@ -256,7 +254,6 @@ cow_pread (struct nbdkit_next_ops *next_ops, void *nxdata, /* Unaligned tail */ if (count) { assert (block); - ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock); r = blk_read (next_ops, nxdata, blknum, block, err); if (r == -1) return -1; @@ -294,10 +291,10 @@ cow_pwrite (struct nbdkit_next_ops *next_ops, void *nxdata, uint64_t n = MIN (BLKSIZE - blkoffs, count); /* Do a read-modify-write operation on the current block. - * Hold the lock over the whole operation. + * Hold the rmw_lock over the whole operation. */ assert (block); - ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock); + ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&rmw_lock); r = blk_read (next_ops, nxdata, blknum, block, err); if (r != -1) { memcpy (&block[blkoffs], buf, n); @@ -314,7 +311,6 @@ cow_pwrite (struct nbdkit_next_ops *next_ops, void *nxdata, /* Aligned body */ while (count >= BLKSIZE) { - ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock); r = blk_write (blknum, buf, err); if (r == -1) return -1; @@ -328,7 +324,7 @@ cow_pwrite (struct nbdkit_next_ops *next_ops, void *nxdata, /* Unaligned tail */ if (count) { assert (block); - ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock); + ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&rmw_lock); r = blk_read (next_ops, nxdata, blknum, block, err); if (r != -1) { memcpy (block, buf, count); @@ -376,9 +372,9 @@ cow_zero (struct nbdkit_next_ops *next_ops, void *nxdata, uint64_t n = MIN (BLKSIZE - blkoffs, count); /* Do a read-modify-write operation on the current block. - * Hold the lock over the whole operation. + * Hold the rmw_lock over the whole operation. */ - ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock); + ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&rmw_lock); r = blk_read (next_ops, nxdata, blknum, block, err); if (r != -1) { memset (&block[blkoffs], 0, n); @@ -399,7 +395,6 @@ cow_zero (struct nbdkit_next_ops *next_ops, void *nxdata, /* XXX There is the possibility of optimizing this: since this loop is * writing a whole, aligned block, we should use FALLOC_FL_ZERO_RANGE. */ - ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock); r = blk_write (blknum, block, err); if (r == -1) return -1; @@ -411,7 +406,7 @@ cow_zero (struct nbdkit_next_ops *next_ops, void *nxdata, /* Unaligned tail */ if (count) { - ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock); + ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&rmw_lock); r = blk_read (next_ops, nxdata, blknum, block, err); if (r != -1) { memset (&block[count], 0, BLKSIZE - count); @@ -455,7 +450,7 @@ cow_trim (struct nbdkit_next_ops *next_ops, void *nxdata, /* Do a read-modify-write operation on the current block. * Hold the lock over the whole operation. */ - ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock); + ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&rmw_lock); r = blk_read (next_ops, nxdata, blknum, block, err); if (r != -1) { memset (&block[blkoffs], 0, n); @@ -471,7 +466,6 @@ cow_trim (struct nbdkit_next_ops *next_ops, void *nxdata, /* Aligned body */ while (count >= BLKSIZE) { - ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock); r = blk_trim (blknum, err); if (r == -1) return -1; @@ -483,7 +477,7 @@ cow_trim (struct nbdkit_next_ops *next_ops, void *nxdata, /* Unaligned tail */ if (count) { - ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock); + ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&rmw_lock); r = blk_read (next_ops, nxdata, blknum, block, err); if (r != -1) { memset (&block[count], 0, BLKSIZE - count); @@ -553,7 +547,6 @@ cow_cache (struct nbdkit_next_ops *next_ops, void *nxdata, /* Aligned body */ while (remaining) { - ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock); r = blk_cache (next_ops, nxdata, blknum, block, mode, err); if (r == -1) return -1; @@ -588,12 +581,6 @@ cow_extents (struct nbdkit_next_ops *next_ops, void *nxdata, assert (IS_ALIGNED (count, BLKSIZE)); assert (count > 0); /* We must make forward progress. */ - /* We hold the lock for the whole time, even when requesting extents - * from the plugin, because we want to present an atomic picture of - * the current state. - */ - ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock); - while (count > 0) { bool present, trimmed; struct nbdkit_extent e; -- 2.29.0.rc2

4 years, 4 months

2
1
0 / 0

[PATCH] nbdcopy: Speed up raw images copy

by Nir Soffer

When exporting sparse raw image, qemu-nbd reports unallocated area as zero: $ qemu-nbd --persistent --socket /tmp/nbd.sock --read-only --shared=10 \ --format raw empty-6g.raw --cache=none --aio=native $ nbdinfo --map nbd+unix:///?socket=/tmp/nbd.sock 0 6442450944 2 zero When using qcow2 image, it reports unallocated areas as a hole: $ qemu-nbd --persistent --socket /tmp/nbd.sock --read-only --shared=10 \ --format qcow2 empty-6g.qcow2 --cache=none --aio=native $ nbdinfo --map nbd+unix:///?socket=/tmp/nbd.sock 0 6442450944 3 hole,zero Since nbdcopy is ignoring the ZERO flag and using only the HOLE flag, coping raw images is extremely slow: $ hyperfine -w3 "./nbdcopy nbd+unix:///?socket=/tmp/nbd.sock null:" Benchmark #1: ./nbdcopy nbd+unix:///?socket=/tmp/nbd.sock null: Time (mean ± σ): 1.595 s ± 0.034 s [User: 2.284 s, System: 3.565 s] Range (min … max): 1.522 s … 1.630 s 10 runs This is 69 times slower than qemu-img: $ hyperfine -w3 "qemu-img convert -n nbd+unix:///?socket=/tmp/nbd.sock \ 'json:{\"file.driver\":\"null-co\",\"file.size\":\"6g\"}'" Benchmark #1: qemu-img convert -n nbd+unix:///?socket=/tmp/nbd.sock 'json:{"file.driver":"null-co","file.size":"6g"}' Time (mean ± σ): 23.1 ms ± 0.5 ms [User: 6.3 ms, System: 16.5 ms] Range (min … max): 22.6 ms … 25.5 ms 124 runs Using ZERO instead of HOLE, nbdcopy does not read zero extents from the server so it can copy this image 165 times faster: $ hyperfine -w3 "./nbdcopy nbd+unix:///?socket=/tmp/nbd.sock null:" Benchmark #1: ./nbdcopy nbd+unix:///?socket=/tmp/nbd.sock null: Time (mean ± σ): 9.8 ms ± 0.8 ms [User: 6.7 ms, System: 5.2 ms] Range (min … max): 9.2 ms … 15.4 ms 287 runs Real images show smaller speedup, only 2 times faster: $ qemu-nbd --persistent --socket /tmp/nbd.sock --read-only --shared=10 \ --format raw fedora-32.raw --cache=none --aio=native Before: $ hyperfine -w3 "./nbdcopy nbd+unix:///?socket=/tmp/nbd.sock null:" Benchmark #1: ./nbdcopy nbd+unix:///?socket=/tmp/nbd.sock null: Time (mean ± σ): 1.613 s ± 0.181 s [User: 1.843 s, System: 2.820 s] Range (min … max): 1.407 s … 1.829 s 10 runs After: $ hyperfine -w3 "./nbdcopy nbd+unix:///?socket=/tmp/nbd.sock null:" Benchmark #1: ./nbdcopy nbd+unix:///?socket=/tmp/nbd.sock null: Time (mean ± σ): 795.5 ms ± 78.7 ms [User: 198.3 ms, System: 743.1 ms] Range (min … max): 743.3 ms … 1012.0 ms 10 runs For reference, copying same image with qemu-img: $ hyperfine -w3 "qemu-img convert -n nbd+unix:///?socket=/tmp/nbd.sock \ 'json:{\"file.driver\":\"null-co\",\"file.size\":\"6g\"}'" Benchmark #1: qemu-img convert -n nbd+unix:///?socket=/tmp/nbd.sock 'json:{"file.driver":"null-co","file.size":"6g"}' Time (mean ± σ): 1.046 s ± 0.028 s [User: 122.3 ms, System: 354.5 ms] Range (min … max): 1.026 s … 1.121 s 10 runs This issue does not exist when copying from file, since in this case we detect unallocated areas as holes. Signed-off-by: Nir Soffer <nsoffer(a)redhat.com> --- copy/file-ops.c | 4 ++-- copy/main.c | 2 +- copy/multi-thread-copying.c | 4 ++-- copy/nbd-ops.c | 10 ++++++++-- copy/nbdcopy.h | 2 +- copy/synch-copying.c | 2 +- 6 files changed, 15 insertions(+), 9 deletions(-) diff --git a/copy/file-ops.c b/copy/file-ops.c index f61b67e..2a239d0 100644 --- a/copy/file-ops.c +++ b/copy/file-ops.c @@ -250,7 +250,7 @@ file_get_extents (struct rw *rw, uintptr_t index, if (pos > offset) { e.offset = offset; e.length = pos - offset; - e.hole = true; + e.zero = true; if (extent_list_append (ret, e) == -1) { perror ("realloc"); exit (EXIT_FAILURE); @@ -271,7 +271,7 @@ file_get_extents (struct rw *rw, uintptr_t index, if (pos > offset) { e.offset = offset; e.length = pos - offset; - e.hole = false; + e.zero = false; if (extent_list_append (ret, e) == -1) { perror ("realloc"); exit (EXIT_FAILURE); diff --git a/copy/main.c b/copy/main.c index cfecb32..68a6030 100644 --- a/copy/main.c +++ b/copy/main.c @@ -667,7 +667,7 @@ default_get_extents (struct rw *rw, uintptr_t index, e.offset = offset; e.length = count; - e.hole = false; + e.zero = false; if (extent_list_append (ret, e) == -1) { perror ("realloc"); exit (EXIT_FAILURE); diff --git a/copy/multi-thread-copying.c b/copy/multi-thread-copying.c index 4576119..98b4056 100644 --- a/copy/multi-thread-copying.c +++ b/copy/multi-thread-copying.c @@ -157,8 +157,8 @@ worker_thread (void *indexp) char *data; size_t len; - if (exts.ptr[i].hole) { - /* The source is a hole so we can proceed directly to + if (exts.ptr[i].zero) { + /* The source is zero so we can proceed directly to * skipping, trimming or writing zeroes at the destination. */ command = calloc (1, sizeof *command); diff --git a/copy/nbd-ops.c b/copy/nbd-ops.c index f7dc37c..0bcf29b 100644 --- a/copy/nbd-ops.c +++ b/copy/nbd-ops.c @@ -190,8 +190,14 @@ add_extent (void *vp, const char *metacontext, e.offset = offset; e.length = entries[i]; - /* Note we deliberately don't care about the ZERO flag. */ - e.hole = (entries[i+1] & LIBNBD_STATE_HOLE) != 0; + + /* + * Note we deliberately don't care about the HOLE flag. There is no need to + * read extent that reads as zeroes. We will convert to it to a hole or + * allocated extents based on the command line arguments. + */ + e.zero = (entries[i+1] & LIBNBD_STATE_ZERO) != 0; + if (extent_list_append (ret, e) == -1) { perror ("realloc"); exit (EXIT_FAILURE); diff --git a/copy/nbdcopy.h b/copy/nbdcopy.h index f586fc5..69fac2a 100644 --- a/copy/nbdcopy.h +++ b/copy/nbdcopy.h @@ -100,7 +100,7 @@ struct command { struct extent { uint64_t offset; uint64_t length; - bool hole; + bool zero; }; DEFINE_VECTOR_TYPE(extent_list, struct extent); diff --git a/copy/synch-copying.c b/copy/synch-copying.c index 043893f..2712c10 100644 --- a/copy/synch-copying.c +++ b/copy/synch-copying.c @@ -68,7 +68,7 @@ synch_copying (void) for (i = 0; i < exts.size; ++i) { assert (exts.ptr[i].length <= count); - if (exts.ptr[i].hole) { + if (exts.ptr[i].zero) { if (!dst.ops->synch_trim (&dst, offset, exts.ptr[i].length) && !dst.ops->synch_zero (&dst, offset, exts.ptr[i].length)) { /* If neither trimming nor efficient zeroing are possible, -- 2.26.2

4 years, 4 months

2
1
0 / 0

[PATCH nbdkit] cache, cow: Do not round size down.

by Richard W.M. Jones

Both the cache and cow filters have the annoying behaviour that the size of the underlying plugin is rounded down to the block size used by the filters (ie. 4K). This is especially visible for single sector disk images: $ ./nbdkit memory size=512 --filter=cow --run 'nbdinfo --size $uri' 0 $ ./nbdkit memory size=512 --filter=cache --run 'nbdinfo --size $uri' 0 but affects all users of these filters somewhat randomly and unexpectedly. (For example I had a GPT partitioned disk image from a customer which did not have a 4K aligned size, which was unexpectedly "corrupted" when I tried to open it using the cow filter - the reason for the corruption was the backup partition table at the end of the disk being truncated.) Getting rid of this assumption is awkward: the general idea is that we round up the size of the backing file / bitmap by a full block. But whenever we are asked to read or write to the underlying plugin we handle the tail case carefully. This also tests various corner cases. --- filters/cache/nbdkit-cache-filter.pod | 9 +--- filters/cow/nbdkit-cow-filter.pod | 11 +---- tests/Makefile.am | 4 ++ filters/cache/blk.c | 54 +++++++++++++++++++--- filters/cache/cache.c | 12 ++--- filters/cow/blk.c | 49 +++++++++++++++++--- filters/cow/cow.c | 23 +++++++--- tests/test-cache-unaligned.sh | 66 +++++++++++++++++++++++++++ tests/test-cow-unaligned.sh | 56 +++++++++++++++++++++++ 9 files changed, 239 insertions(+), 45 deletions(-) diff --git a/filters/cache/nbdkit-cache-filter.pod b/filters/cache/nbdkit-cache-filter.pod index 3601dcd5..34fd0b29 100644 --- a/filters/cache/nbdkit-cache-filter.pod +++ b/filters/cache/nbdkit-cache-filter.pod @@ -25,12 +25,6 @@ does not have effective caching, or (with C<cache=unsafe>) to defeat flush requests from the client (which is unsafe and can cause data loss, as the name suggests). -Note that the use of this filter rounds the image size down to a -multiple of the caching granularity (the larger of 4096 or the -C<f_bsize> field of L<fstatvfs(3)>), to ease the implementation. If -you need to round the image size up instead to access the last few -bytes, combine this filter with L<nbdkit-truncate-filter(1)>. - This filter only caches image contents. To cache image metadata, use L<nbdkit-cacheextents-filter(1)> between this filter and the plugin. To accelerate sequential reads, use L<nbdkit-readahead-filter(1)> @@ -168,7 +162,6 @@ L<nbdkit(1)>, L<nbdkit-file-plugin(1)>, L<nbdkit-cacheextents-filter(1)>, L<nbdkit-readahead-filter(1)>, -L<nbdkit-truncate-filter(1)>, L<nbdkit-filter(3)>, L<qemu-img(1)>. @@ -180,4 +173,4 @@ Richard W.M. Jones =head1 COPYRIGHT -Copyright (C) 2018-2020 Red Hat Inc. +Copyright (C) 2018-2021 Red Hat Inc. diff --git a/filters/cow/nbdkit-cow-filter.pod b/filters/cow/nbdkit-cow-filter.pod index 4d5ae856..64df3fbd 100644 --- a/filters/cow/nbdkit-cow-filter.pod +++ b/filters/cow/nbdkit-cow-filter.pod @@ -31,14 +31,6 @@ that away. It also allows us to create diffs, see below. The plugin is opened read-only (as if the I<-r> flag was passed), but you should B<not> pass the I<-r> flag to nbdkit. -=item * - -Note that the use of this filter rounds the image size down to a -multiple of the caching granularity (4096), to ease the -implementation. If you need to round the image size up instead to -access the last few bytes, combine this filter with -L<nbdkit-truncate-filter(1)>. - =back Limitations of the filter include: @@ -140,7 +132,6 @@ C<nbdkit-cow-filter> first appeared in nbdkit 1.2. L<nbdkit(1)>, L<nbdkit-file-plugin(1)>, -L<nbdkit-truncate-filter(1)>, L<nbdkit-xz-filter(1)>, L<nbdkit-filter(3)>, L<qemu-img(1)>. @@ -153,4 +144,4 @@ Richard W.M. Jones =head1 COPYRIGHT -Copyright (C) 2018 Red Hat Inc. +Copyright (C) 2018-2021 Red Hat Inc. diff --git a/tests/Makefile.am b/tests/Makefile.am index 42e842b2..70898f20 100644 --- a/tests/Makefile.am +++ b/tests/Makefile.am @@ -1353,11 +1353,13 @@ TESTS += \ test-cache.sh \ test-cache-on-read.sh \ test-cache-max-size.sh \ + test-cache-unaligned.sh \ $(NULL) EXTRA_DIST += \ test-cache.sh \ test-cache-on-read.sh \ test-cache-max-size.sh \ + test-cache-unaligned.sh \ $(NULL) # cacheextents filter test. @@ -1380,6 +1382,7 @@ TESTS += \ test-cow.sh \ test-cow-extents1.sh \ test-cow-extents2.sh \ + test-cow-unaligned.sh \ $(NULL) endif TESTS += test-cow-null.sh @@ -1388,6 +1391,7 @@ EXTRA_DIST += \ test-cow-extents1.sh \ test-cow-extents2.sh \ test-cow-null.sh \ + test-cow-unaligned.sh \ $(NULL) # delay filter tests. diff --git a/filters/cache/blk.c b/filters/cache/blk.c index dc1bcc57..b9af6909 100644 --- a/filters/cache/blk.c +++ b/filters/cache/blk.c @@ -1,5 +1,5 @@ /* nbdkit - * Copyright (C) 2018-2020 Red Hat Inc. + * Copyright (C) 2018-2021 Red Hat Inc. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are @@ -54,6 +54,7 @@ #include "bitmap.h" #include "minmax.h" +#include "rounding.h" #include "utils.h" #include "cache.h" @@ -165,18 +166,25 @@ blk_free (void) lru_free (); } +/* Because blk_set_size is called before the other blk_* functions + * this should be set to the true size before we need it. + */ +static uint64_t size = 0; + int blk_set_size (uint64_t new_size) { - if (bitmap_resize (&bm, new_size) == -1) + size = new_size; + + if (bitmap_resize (&bm, size) == -1) return -1; - if (ftruncate (fd, new_size) == -1) { + if (ftruncate (fd, ROUND_UP (size, blksize)) == -1) { nbdkit_error ("ftruncate: %m"); return -1; } - if (lru_set_size (new_size) == -1) + if (lru_set_size (size) == -1) return -1; return 0; @@ -199,9 +207,22 @@ blk_read (struct nbdkit_next_ops *next_ops, void *nxdata, "unknown"); if (state == BLOCK_NOT_CACHED) { /* Read underlying plugin. */ - if (next_ops->pread (nxdata, block, blksize, offset, 0, err) == -1) + unsigned n = blksize, tail = 0; + + if (offset + n > size) { + tail = offset + n - size; + n -= tail; + } + + if (next_ops->pread (nxdata, block, n, offset, 0, err) == -1) return -1; + /* Normally we're reading whole blocks, but at the very end of the + * file we might read a partial block. Deal with that case by + * zeroing the tail. + */ + memset (block + n, 0, tail); + /* If cache-on-read, copy the block to the cache. */ if (cache_on_read) { nbdkit_debug ("cache: cache-on-read block %" PRIu64 @@ -247,9 +268,22 @@ blk_cache (struct nbdkit_next_ops *next_ops, void *nxdata, if (state == BLOCK_NOT_CACHED) { /* Read underlying plugin, copy to cache regardless of cache-on-read. */ - if (next_ops->pread (nxdata, block, blksize, offset, 0, err) == -1) + unsigned n = blksize, tail = 0; + + if (offset + n > size) { + tail = offset + n - size; + n -= tail; + } + + if (next_ops->pread (nxdata, block, n, offset, 0, err) == -1) return -1; + /* Normally we're reading whole blocks, but at the very end of the + * file we might read a partial block. Deal with that case by + * zeroing the tail. + */ + memset (block + n, 0, tail); + nbdkit_debug ("cache: cache block %" PRIu64 " (offset %" PRIu64 ")", blknum, (uint64_t) offset); @@ -281,6 +315,12 @@ blk_writethrough (struct nbdkit_next_ops *next_ops, void *nxdata, int *err) { off_t offset = blknum * blksize; + unsigned n = blksize, tail = 0; + + if (offset + n > size) { + tail = offset + n - size; + n -= tail; + } reclaim (fd, &bm); @@ -293,7 +333,7 @@ blk_writethrough (struct nbdkit_next_ops *next_ops, void *nxdata, return -1; } - if (next_ops->pwrite (nxdata, block, blksize, offset, flags, err) == -1) + if (next_ops->pwrite (nxdata, block, n, offset, flags, err) == -1) return -1; bitmap_set_blk (&bm, blknum, BLOCK_CLEAN); diff --git a/filters/cache/cache.c b/filters/cache/cache.c index 91dcc43d..b979f256 100644 --- a/filters/cache/cache.c +++ b/filters/cache/cache.c @@ -1,5 +1,5 @@ /* nbdkit - * Copyright (C) 2018-2020 Red Hat Inc. + * Copyright (C) 2018-2021 Red Hat Inc. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are @@ -201,9 +201,7 @@ cache_config_complete (nbdkit_next_config_complete *next, void *nxdata) return next (nxdata); } -/* Get the file size; round it down to cache granularity before - * setting cache size. - */ +/* Get the file size, set the cache size. */ static int64_t cache_get_size (struct nbdkit_next_ops *next_ops, void *nxdata, void *handle) @@ -216,7 +214,6 @@ cache_get_size (struct nbdkit_next_ops *next_ops, void *nxdata, return -1; nbdkit_debug ("cache: underlying file size: %" PRIi64, size); - size = ROUND_DOWN (size, blksize); ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock); r = blk_set_size (size); @@ -226,8 +223,9 @@ cache_get_size (struct nbdkit_next_ops *next_ops, void *nxdata, return size; } -/* Force an early call to cache_get_size, consequently truncating the - * cache to the correct size. +/* Force an early call to cache_get_size because we have to set the + * backing file size and bitmap size before any other read or write + * calls. */ static int cache_prepare (struct nbdkit_next_ops *next_ops, void *nxdata, diff --git a/filters/cow/blk.c b/filters/cow/blk.c index ae0db1fe..51ba91c4 100644 --- a/filters/cow/blk.c +++ b/filters/cow/blk.c @@ -94,6 +94,7 @@ #include "bitmap.h" #include "fdatasync.h" +#include "rounding.h" #include "pread.h" #include "pwrite.h" #include "utils.h" @@ -176,14 +177,21 @@ blk_free (void) bitmap_free (&bm); } +/* Because blk_set_size is called before the other blk_* functions + * this should be set to the true size before we need it. + */ +static uint64_t size = 0; + /* Allocate or resize the overlay file and bitmap. */ int blk_set_size (uint64_t new_size) { - if (bitmap_resize (&bm, new_size) == -1) + size = new_size; + + if (bitmap_resize (&bm, size) == -1) return -1; - if (ftruncate (fd, new_size) == -1) { + if (ftruncate (fd, ROUND_UP (size, BLKSIZE)) == -1) { nbdkit_error ("ftruncate: %m"); return -1; } @@ -216,8 +224,24 @@ blk_read (struct nbdkit_next_ops *next_ops, void *nxdata, nbdkit_debug ("cow: blk_read block %" PRIu64 " (offset %" PRIu64 ") is %s", blknum, (uint64_t) offset, state_to_string (state)); - if (state == BLOCK_NOT_ALLOCATED) /* Read underlying plugin. */ - return next_ops->pread (nxdata, block, BLKSIZE, offset, 0, err); + if (state == BLOCK_NOT_ALLOCATED) { /* Read underlying plugin. */ + unsigned n = BLKSIZE, tail = 0; + + if (offset + n > size) { + tail = offset + n - size; + n -= tail; + } + + if (next_ops->pread (nxdata, block, n, offset, 0, err) == -1) + return -1; + + /* Normally we're reading whole blocks, but at the very end of the + * file we might read a partial block. Deal with that case by + * zeroing the tail. + */ + memset (block + n, 0, tail); + return 0; + } else if (state == BLOCK_ALLOCATED) { /* Read overlay. */ if (pread (fd, block, BLKSIZE, offset) == -1) { *err = errno; @@ -238,6 +262,12 @@ blk_cache (struct nbdkit_next_ops *next_ops, void *nxdata, { off_t offset = blknum * BLKSIZE; enum bm_entry state = bitmap_get_blk (&bm, blknum, BLOCK_NOT_ALLOCATED); + unsigned n = BLKSIZE, tail = 0; + + if (offset + n > size) { + tail = offset + n - size; + n -= tail; + } nbdkit_debug ("cow: blk_cache block %" PRIu64 " (offset %" PRIu64 ") is %s", blknum, (uint64_t) offset, state_to_string (state)); @@ -258,9 +288,16 @@ blk_cache (struct nbdkit_next_ops *next_ops, void *nxdata, if (mode == BLK_CACHE_IGNORE) return 0; if (mode == BLK_CACHE_PASSTHROUGH) - return next_ops->cache (nxdata, BLKSIZE, offset, 0, err); - if (next_ops->pread (nxdata, block, BLKSIZE, offset, 0, err) == -1) + return next_ops->cache (nxdata, n, offset, 0, err); + + if (next_ops->pread (nxdata, block, n, offset, 0, err) == -1) return -1; + /* Normally we're reading whole blocks, but at the very end of the + * file we might read a partial block. Deal with that case by + * zeroing the tail. + */ + memset (block + n, 0, tail); + if (mode == BLK_CACHE_COW) { if (pwrite (fd, block, BLKSIZE, offset) == -1) { *err = errno; diff --git a/filters/cow/cow.c b/filters/cow/cow.c index 75d3b907..93e10f24 100644 --- a/filters/cow/cow.c +++ b/filters/cow/cow.c @@ -1,5 +1,5 @@ /* nbdkit - * Copyright (C) 2018-2020 Red Hat Inc. + * Copyright (C) 2018-2021 Red Hat Inc. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions are @@ -103,9 +103,7 @@ cow_open (nbdkit_next_open *next, void *nxdata, return NBDKIT_HANDLE_NOT_NEEDED; } -/* Get the file size; round it down to overlay granularity before - * setting overlay size. - */ +/* Get the file size, set the cache size. */ static int64_t cow_get_size (struct nbdkit_next_ops *next_ops, void *nxdata, void *handle) @@ -118,7 +116,6 @@ cow_get_size (struct nbdkit_next_ops *next_ops, void *nxdata, return -1; nbdkit_debug ("cow: underlying file size: %" PRIi64, size); - size = ROUND_DOWN (size, BLKSIZE); ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock); r = blk_set_size (size); @@ -128,8 +125,9 @@ cow_get_size (struct nbdkit_next_ops *next_ops, void *nxdata, return size; } -/* Force an early call to cow_get_size, consequently truncating the - * overlay to the correct size. +/* Force an early call to cow_get_size because we have to set the + * backing file size and bitmap size before any other read or write + * calls. */ static int cow_prepare (struct nbdkit_next_ops *next_ops, void *nxdata, @@ -627,6 +625,7 @@ cow_extents (struct nbdkit_next_ops *next_ops, void *nxdata, uint64_t range_offset = offset; uint32_t range_count = 0; size_t i; + int64_t size; /* Asking the plugin for a single block of extents is not * efficient for some plugins (eg. VDDK) so ask for as much data @@ -643,6 +642,16 @@ cow_extents (struct nbdkit_next_ops *next_ops, void *nxdata, if (present) break; } + /* Don't ask for extent data beyond the end of the plugin. */ + size = next_ops->get_size (nxdata); + if (size == -1) + return -1; + + if (range_offset + range_count > size) { + unsigned tail = range_offset + range_count - size; + range_count -= tail; + } + CLEANUP_EXTENTS_FREE struct nbdkit_extents *extents2 = nbdkit_extents_full (next_ops, nxdata, range_count, range_offset, flags, err); diff --git a/tests/test-cache-unaligned.sh b/tests/test-cache-unaligned.sh new file mode 100755 index 00000000..80b3d128 --- /dev/null +++ b/tests/test-cache-unaligned.sh @@ -0,0 +1,66 @@ +#!/usr/bin/env bash +# nbdkit +# Copyright (C) 2018-2021 Red Hat Inc. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions are +# met: +# +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# +# * Neither the name of Red Hat nor the names of its contributors may be +# used to endorse or promote products derived from this software without +# specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY RED HAT AND CONTRIBUTORS ''AS IS'' AND +# ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, +# THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A +# PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL RED HAT OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, +# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT +# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF +# USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND +# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, +# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT +# OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +# SUCH DAMAGE. + +source ./functions.sh +set -e +set -x + +requires_filter cache +requires_nbdsh_uri + +sock=$(mktemp -u /tmp/nbdkit-test-sock.XXXXXX) +files="cache-unaligned.img $sock cache-unaligned.pid" +rm -f $files +cleanup_fn rm -f $files + +# Create an empty base image which is not a multiple of 4K. +truncate -s 130000 cache-unaligned.img + +# Run nbdkit with the caching filter. +start_nbdkit -P cache-unaligned.pid -U $sock --filter=cache \ + file cache-unaligned.img + +nbdsh --connect "nbd+unix://?socket=$sock" \ + -c ' +# Write some pattern data to the overlay and check it reads back OK. +buf = b"abcdefghijklm" * 10000 +h.pwrite(buf, 0) +buf2 = h.pread(130000, 0) +assert buf == buf2 + +# Flushing should write through to the underlying file. +h.flush() + +with open("cache-unaligned.img", "rb") as file: + buf2 = file.read(130000) + assert buf == buf2 +' diff --git a/tests/test-cow-unaligned.sh b/tests/test-cow-unaligned.sh new file mode 100755 index 00000000..d89b84b9 --- /dev/null +++ b/tests/test-cow-unaligned.sh @@ -0,0 +1,56 @@ +#!/usr/bin/env bash +# nbdkit +# Copyright (C) 2018-2021 Red Hat Inc. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions are +# met: +# +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# +# * Neither the name of Red Hat nor the names of its contributors may be +# used to endorse or promote products derived from this software without +# specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY RED HAT AND CONTRIBUTORS ''AS IS'' AND +# ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, +# THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A +# PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL RED HAT OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, +# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT +# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF +# USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND +# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, +# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT +# OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +# SUCH DAMAGE. + +source ./functions.sh +set -e +set -x + +requires_filter cow +requires_nbdsh_uri + +sock=$(mktemp -u /tmp/nbdkit-test-sock.XXXXXX) +files="$sock cow-unaligned.pid" +rm -f $files +cleanup_fn rm -f $files + +# Run nbdkit with the cow filter and a size which is not a multiple of +# the cow filter block size (4K). +start_nbdkit -P cow-unaligned.pid -U $sock --filter=cow memory 130000 + +nbdsh --connect "nbd+unix://?socket=$sock" \ + -c ' +# Write some pattern data to the overlay and check it reads back OK. +buf = b"abcdefghijklm" * 10000 +h.pwrite(buf, 0) +buf2 = h.pread(130000, 0) +assert buf == buf2 +' -- 2.29.0.rc2

4 years, 4 months

2
2
0 / 0

[PATCH] v2v: support configuration of viosock driver

by Valeriy Vdovin

Check that install_drivers function has copied viosock driver files to the windows guest file system. If positive, this means the drivers have passed minor/major version check and the guest is able to use them. After we know that that the drivers are on the guest, we can enable virtio sock option in configuration and starting script files. Signed-off-by: Valeriy Vdovin <valeriy.vdovin(a)virtuozzo.com> --- v2v/convert_linux.ml | 1 + v2v/convert_windows.ml | 4 +++- v2v/create_json.ml | 1 + v2v/create_libvirt_xml.ml | 6 ++++++ v2v/linux_kernels.ml | 4 ++++ v2v/linux_kernels.mli | 1 + v2v/output_qemu.ml | 4 ++++ v2v/types.ml | 1 + v2v/types.mli | 2 +- v2v/windows_virtio.ml | 5 +++-- v2v/windows_virtio.mli | 2 +- 11 files changed, 26 insertions(+), 5 deletions(-) diff --git a/v2v/convert_linux.ml b/v2v/convert_linux.ml index 86d387f1..9f22fe3c 100644 --- a/v2v/convert_linux.ml +++ b/v2v/convert_linux.ml @@ -154,6 +154,7 @@ let convert (g : G.guestfs) inspect source_disks output rcaps _ = gcaps_virtio_rng = kernel.ki_supports_virtio_rng; gcaps_virtio_balloon = kernel.ki_supports_virtio_balloon; gcaps_isa_pvpanic = kernel.ki_supports_isa_pvpanic; + gcaps_virtio_socket = kernel.ki_supports_virtio_socket; gcaps_machine = machine; gcaps_arch = Utils.kvm_arch inspect.i_arch; gcaps_acpi = acpi; diff --git a/v2v/convert_windows.ml b/v2v/convert_windows.ml index b452c09b..7842f443 100644 --- a/v2v/convert_windows.ml +++ b/v2v/convert_windows.ml @@ -214,7 +214,8 @@ let convert (g : G.guestfs) inspect _ output rcaps static_ips = video_driver, virtio_rng_supported, virtio_ballon_supported, - isa_pvpanic_supported = + isa_pvpanic_supported, + virtio_socket_supported = Registry.with_hive_write g inspect.i_windows_system_hive update_system_hive in @@ -256,6 +257,7 @@ let convert (g : G.guestfs) inspect _ output rcaps static_ips = gcaps_virtio_rng = virtio_rng_supported; gcaps_virtio_balloon = virtio_ballon_supported; gcaps_isa_pvpanic = isa_pvpanic_supported; + gcaps_virtio_socket = virtio_socket_supported; gcaps_machine = machine; gcaps_arch = Utils.kvm_arch inspect.i_arch; gcaps_acpi = true; diff --git a/v2v/create_json.ml b/v2v/create_json.ml index fdf7b12f..316a5536 100644 --- a/v2v/create_json.ml +++ b/v2v/create_json.ml @@ -229,6 +229,7 @@ let create_json_metadata source targets target_buses "virtio-rng", JSON.Bool guestcaps.gcaps_virtio_rng; "virtio-balloon", JSON.Bool guestcaps.gcaps_virtio_balloon; "isa-pvpanic", JSON.Bool guestcaps.gcaps_isa_pvpanic; + "virtio-socket", JSON.Bool guestcaps.gcaps_virtio_socket; "acpi", JSON.Bool guestcaps.gcaps_acpi; ] in List.push_back doc ("guestcaps", JSON.Dict guestcaps_dict); diff --git a/v2v/create_libvirt_xml.ml b/v2v/create_libvirt_xml.ml index 212ace2d..6a764cb2 100644 --- a/v2v/create_libvirt_xml.ml +++ b/v2v/create_libvirt_xml.ml @@ -521,6 +521,12 @@ let create_libvirt_xml ?pool source targets target_buses guestcaps e "address" ["type", "isa"; "iobase", "0x505"] [] ] ); + List.push_back devices ( + e "viosock" + ["model", + if guestcaps.gcaps_virtio_socket then "virtio" else "none"] + [] + ); (* Standard devices added to every guest. *) List.push_back_list devices [ diff --git a/v2v/linux_kernels.ml b/v2v/linux_kernels.ml index 7e171eae..6dead217 100644 --- a/v2v/linux_kernels.ml +++ b/v2v/linux_kernels.ml @@ -44,6 +44,7 @@ type kernel_info = { ki_supports_virtio_rng : bool; ki_supports_virtio_balloon : bool; ki_supports_isa_pvpanic : bool; + ki_supports_virtio_socket : bool; ki_is_xen_pv_only_kernel : bool; ki_is_debug : bool; ki_config_file : string option; @@ -246,6 +247,8 @@ let detect_kernels (g : G.guestfs) inspect family bootloader = kernel_supports "virtio_balloon" "VIRTIO_BALLOON" in let supports_isa_pvpanic = kernel_supports "pvpanic" "PVPANIC" in + let supports_virtio_socket = + kernel_supports "virtio_socket" "VIRTIO_SOCKET" in let is_xen_pv_only_kernel = check_config "X86_XEN" config_file || check_config "X86_64_XEN" config_file in @@ -272,6 +275,7 @@ let detect_kernels (g : G.guestfs) inspect family bootloader = ki_supports_virtio_rng = supports_virtio_rng; ki_supports_virtio_balloon = supports_virtio_balloon; ki_supports_isa_pvpanic = supports_isa_pvpanic; + ki_supports_virtio_socket = supports_virtio_socket; ki_is_xen_pv_only_kernel = is_xen_pv_only_kernel; ki_is_debug = is_debug; ki_config_file = config_file; diff --git a/v2v/linux_kernels.mli b/v2v/linux_kernels.mli index 028eba81..fe81a036 100644 --- a/v2v/linux_kernels.mli +++ b/v2v/linux_kernels.mli @@ -33,6 +33,7 @@ type kernel_info = { ki_supports_virtio_rng : bool; (** Kernel supports virtio-rng? *) ki_supports_virtio_balloon : bool; (** Kernel supports memory balloon? *) ki_supports_isa_pvpanic : bool; (** Kernel supports ISA pvpanic device? *) + ki_supports_virtio_socket : bool; (** Kernel supports virtio-socket? *) ki_is_xen_pv_only_kernel : bool; (** Is a Xen paravirt-only kernel? *) ki_is_debug : bool; (** Is debug kernel? *) ki_config_file : string option; (** Path of config file, if found. *) diff --git a/v2v/output_qemu.ml b/v2v/output_qemu.ml index be3a3c5e..d6d70c23 100644 --- a/v2v/output_qemu.ml +++ b/v2v/output_qemu.ml @@ -247,6 +247,10 @@ object arg "-balloon" "none"; if guestcaps.gcaps_isa_pvpanic then arg_list "-device" ["pvpanic"; "ioport=0x505"]; + if guestcaps.gcaps_virtio_socket then + arg "-viosock" "virtio" + else + arg "-viosock" "none"; (* Add a serial console to Linux guests. *) if inspect.i_type = "linux" then diff --git a/v2v/types.ml b/v2v/types.ml index a8949e4b..4c7ee864 100644 --- a/v2v/types.ml +++ b/v2v/types.ml @@ -411,6 +411,7 @@ type guestcaps = { gcaps_virtio_rng : bool; gcaps_virtio_balloon : bool; gcaps_isa_pvpanic : bool; + gcaps_virtio_socket : bool; gcaps_machine : guestcaps_machine; gcaps_arch : string; gcaps_acpi : bool; diff --git a/v2v/types.mli b/v2v/types.mli index f474dcaa..42a80d9d 100644 --- a/v2v/types.mli +++ b/v2v/types.mli @@ -252,7 +252,7 @@ type guestcaps = { gcaps_virtio_rng : bool; (** Guest supports virtio-rng. *) gcaps_virtio_balloon : bool; (** Guest supports virtio balloon. *) gcaps_isa_pvpanic : bool; (** Guest supports ISA pvpanic device. *) - + gcaps_virtio_socket : bool; (** Guest supports virtio socket. *) gcaps_machine : guestcaps_machine; (** Machine model. *) gcaps_arch : string; (** Architecture that KVM must emulate. *) gcaps_acpi : bool; (** True if guest supports acpi. *) diff --git a/v2v/windows_virtio.ml b/v2v/windows_virtio.ml index 74a43cc7..cf417a45 100644 --- a/v2v/windows_virtio.ml +++ b/v2v/windows_virtio.ml @@ -68,7 +68,7 @@ let rec install_drivers ((g, _) as reg) inspect rcaps = match net_type with | Some model -> model | None -> RTL8139 in - (IDE, net_type, Cirrus, false, false, false) + (IDE, net_type, Cirrus, false, false, false, false) ) else ( (* Can we install the block driver? *) @@ -178,9 +178,10 @@ let rec install_drivers ((g, _) as reg) inspect rcaps = let virtio_rng_supported = g#exists (driverdir // "viorng.inf") in let virtio_ballon_supported = g#exists (driverdir // "balloon.inf") in let isa_pvpanic_supported = g#exists (driverdir // "pvpanic.inf") in + let virtio_socket_supported = g#exists (driverdir // "viosock.inf") in (block, net, video, - virtio_rng_supported, virtio_ballon_supported, isa_pvpanic_supported) + virtio_rng_supported, virtio_ballon_supported, isa_pvpanic_supported, virtio_socket_supported) ) and install_linux_tools g inspect = diff --git a/v2v/windows_virtio.mli b/v2v/windows_virtio.mli index c063af3f..642317b1 100644 --- a/v2v/windows_virtio.mli +++ b/v2v/windows_virtio.mli @@ -20,7 +20,7 @@ val install_drivers : Registry.t -> Types.inspect -> Types.requested_guestcaps -> - Types.guestcaps_block_type * Types.guestcaps_net_type * Types.guestcaps_video_type * bool * bool * bool + Types.guestcaps_block_type * Types.guestcaps_net_type * Types.guestcaps_video_type * bool * bool * bool * bool (** [install_drivers reg inspect rcaps] installs virtio drivers from the driver directory or driver ISO into the guest driver directory and updates the registry -- 2.27.0

4 years, 4 months

1
0
0 / 0

Re: [Libguestfs] [libguestfs/libnbd] nbd_connect_tcp: handshake: server policy prevents NBD_OPT_GO (#4)

by Richard W.M. Jones

On Thu, Feb 11, 2021 at 08:37:10AM -0800, Shem Pasamba wrote: > Running ./nbdfuse ../mnt --tcp 172.27.20.55 2222 results in: > > nbd_connect_tcp: handshake: server policy prevents NBD_OPT_GO > > # ./nbdfuse --version > nbdfuse 1.6.1 > libnbd 1.7.1 > > With nbd-server started on Ubuntu 20.04 with command: > > sudo nbd-server 2222 /dev/md1 In general I would recommend using a more featureful server, ie. qemu-nbd or nbdkit. The error means that the server returned error NBD_REP_ERR_POLICY, but the true reason will only be contained in the logs on the server side. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-top is 'top' for virtual machines. Tiny program with many powerful monitoring features, net stats, disk stats, logging, etc. http://people.redhat.com/~rjones/virt-top

4 years, 5 months

1
0
0 / 0

ANNOUNCE: libnbd and nbdkit projects are moving to GitLab

by Martin Kletzander

I am pleased to announce that we made a successful switch from GitHub to GitLab for nbdkit[0] and libnbd[1] projects today under the collective group name nbdkit[2]. GitLab is the primary repository from now on. All changes will be automatically mirrored to GitHub on a regular basis in order not to break any current workflow. It is advised, however, that since GitLab is the new primary repository all repository URLs should switch to GitLab as well. This move will help us expand CI for these projects as we will be able to utilise all the effort that went to the great libvirt-ci[3] project and I hope to post patches soon as I have them almost ready. If you have any questions or concerns, don't hesitate to contact me and/or Rich. Have a nice day everyone! Martin [0] https://gitlab.com/nbdkit/nbdkit [1] https://gitlab.com/nbdkit/libnbd [2] https://gitlab.com/nbdkit [3] https://gitlab.com/libvirt/libvirt-ci

4 years, 5 months

1
0
0 / 0

[PATCH libnbd 0/2] Fix for connect(2) returning EAGAIN.

by Richard W.M. Jones

I don't believe there's really a good way to fix this, but we can at least print a better error message. Thanks: Xin Long and Lukas Doktor

4 years, 5 months

2
4
0 / 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Libguestfs February 2021