The calculation of eflags skips calling .can_zero, .can_trim, and
.can_fua if .can_write returns false, because 1) time spent calling
.can_zero would be wasted, and 2) plugin writers can be lazy and let
.can_zero return constant true, regardless of what they return for
.can_write, and are relying on nbdkit to not advertise or call .zero.
We don't want to regress on this (that is, it is easy to write a sh
plugin script that would be noticeably slower if we call into
.can_zero in spite of .can_write returning false). But at the same
time, just as plugins can blindly return .can_zero==true in spite of
.can_write==false, it is also possible for a plugin to blindly return
.can_write==true in spite of the readonly=true argument to .open (that
is, when the command line option -r is in effect), yet we were NOT
skipping that call into .can_write (meaning we can write a sh plugin
to observe a wasted call to .can_write [1]).
Next, consider that although we don't have such a filter today, it is
feasible that someone could write a filter that exposes .can_write==0
to the client but which wants to open the underlying plugin read-write
and uses next_ops->pwrite and next_ops->zero (for example, a scenario
of exposing read-only contents of a single file extracted from an
underlying file system, where the underlying device containing the
file system must be opened read-write even though its file system will
only be mounted read-only, because the act of mounting triggers a
journal replay that modifies the underlying device, but not the file
being extracted). This implies that -r must only affect the outermost
backend, and not every backend along the chain.
Then, to date, we've documented that such a filter merely need check
next_ops->can_zero before calling next_ops->zero, but do not actually
enforce that; furthermore, we are back to the case of what to do when
a plugin hard-codes can_zero to return true even when can_write is
false. All filters are in-tree, so we could merely change our
documentation to require the filters to also check
next_ops->can_write, but it is nicer to just do the extra checking as
part of backend.c.
At any rate, adding more calls to can_write implies that it is
important enough to be cached; and that's easy to do using the same
ideas as already used in commit b9d4875b for .get_size. (Other things
like can_zero and can_fua are also worth caching, but this patch is
already going to be big enough to save that one for later.)
The next consideration is when we should check can_write and can_zero.
If the first time we actually call into the plugin is during
next_ops->zero itself, we have to deal with any errors reported by
next_ops->can_zero - except that the backend interface has no *err
parameter for can_zero, and we are not ensured that there will be a
sane errno to report for the failed .zero. It would be a lot nicer to
just assert during .zero that the cache for .can_write is already
populated from the handshake phase (eflags computation for the
outermost backend, and during a filter's .prepare for the next
backend), to prove that the caller is not using .zero except when the
backend is writable. Thankfully, all filters which have both an
override of .can_zero and which call next_ops->can_zero do check
next_ops->can_zero during .prepare, so as long as we fix can_zero to
trigger the call to can_write under the hood, we don't have to modify
those filters.
Finally, with all these changes in place, we can drop the gating logic
during eflags computation as redundant with the gating logic in
backend.c; eflags computations can just blindly call all of the
underlying functions (and when the next patch adds even more can_FOO
caching, we'll never risk those values being uncached during request
handling). Likewise, plugins.c no longer has to fall back to
.can_write during its computation of .can_zero (because we have
ensured that .can_zero won't be reached).
[1] Demo time:
$ cat script
case "$1" in
get_size) echo 1m;;
can_write) echo can_write >/dev/tty; sleep 1; ${DIE} ;;
can_zero) echo can_zero >/dev/tty; sleep 2;;
*) exit 2 ;;
esac
Pre-patch:
$ /bin/time -f %e ./nbdkit -U - -r sh script \
--run 'qemu-nbd --list -k $unixsocket >/dev/null'
can_write
1.10
Wasted call to .can_write, but at least .can_zero was skipped
$ /bin/time -f %e ./nbdkit -U - sh script \
--run 'qemu-nbd --list -k $unixsocket >/dev/null'
can_write
can_zero
can_write
4.10
Without -r, eflags computation calls .can_write, then .can_zero; but
then plugin_can_zero calls .can_write again.
$ DIE='exit 3' /bin/time -f %e ./nbdkit -U - sh script \
--run 'qemu-nbd --list -k $unixsocket >/dev/null'
can_write
1.06
But when .can_write fails, we see how .can_zero was skipped.
$ /bin/time -f %e ./nbdkit -U - -r --filter=nozero sh script zeromode=notrim \
--run 'qemu-nbd --list -k $unixsocket >/dev/null'
can_zero
can_write
can_write
4.11
The nozero filter probes next_ops->can_zero unconditionally during
.prepare (here, if we had the theoretical filter on top of nozero that
advertises no write support but uses writes under the hood, then this
is actually good - although perhaps it is worth a followup patch to
filters to be smarter during .prepare if .open was passed
readonly=true), then plugin_can_zero calls .can_write. The second
.can_write comes from eflags computation (since the nozero filter did
not intercept that one).
$ DIE='exit 3' /bin/time -f %e ./nbdkit -U - --filter=nozero sh script \
zeromode=notrim --run 'qemu-nbd --list -k $unixsocket >/dev/null'
can_zero
can_write
nbdkit: sh[1]: error: zeromode 'notrim' requires plugin zero support
qemu-nbd: Failed to read initial magic: Unexpected end-of-file before all bytes were read
Command exited with non-zero status 1
3.07
A bit faster (we didn't get to eflags computation because of the
nozero failure), but still long.
With this applied:
$ /bin/time -f %e ./nbdkit -U - -r sh script \
--run 'qemu-nbd --list -k $unixsocket >/dev/null'
0.07
Yay - no time wasted!
$ /bin/time -f %e ./nbdkit -U - -r --filter=nozero sh script zeromode=notrim \
--run 'qemu-nbd --list -k $unixsocket >/dev/null'
can_write
can_zero
3.10
Note the act of calling .can_zero from nozero's .prepare forces
.can_write to happen first, and because it was cached, we don't see it
again even during eflags computation.
$ DIE='exit 3' /bin/time -f %e ./nbdkit -U - --filter=nozero sh script \
zeromode=notrim --run 'qemu-nbd --list -k $unixsocket >/dev/null'
can_write
nbdkit: sh[1]: error: zeromode 'notrim' requires plugin zero support
qemu-nbd: Failed to read initial magic: Unexpected end-of-file before all bytes were read
Command exited with non-zero status 1
1.06
We still get the nozero failure, but this time much faster (without
wasting time on the .can_zero call, since the .can_write failure
already constrains us).
Signed-off-by: Eric Blake <eblake(a)redhat.com>
---
server/internal.h | 2 +-
server/backend.c | 33 ++++++++++++++++-
server/connections.c | 3 +-
server/plugins.c | 2 +-
server/protocol-handshake.c | 74 ++++++++++++++++++-------------------
5 files changed, 72 insertions(+), 42 deletions(-)
diff --git a/server/internal.h b/server/internal.h
index 4db55315..a55d6406 100644
--- a/server/internal.h
+++ b/server/internal.h
@@ -155,6 +155,7 @@ struct b_conn_handle {
void *handle;
uint64_t exportsize;
+ int can_write;
};
struct connection {
@@ -172,7 +173,6 @@ struct connection {
uint32_t cflags;
uint16_t eflags;
- bool readonly;
bool can_flush;
bool is_rotational;
bool can_trim;
diff --git a/server/backend.c b/server/backend.c
index 52d53f80..749b1f15 100644
--- a/server/backend.c
+++ b/server/backend.c
@@ -174,6 +174,8 @@ backend_set_handle (struct backend *b, struct connection *conn, void
*handle)
conn->handles[b->i].handle = handle;
}
+/* Wrappers for all callbacks in a filter's struct nbdkit_next_ops. */
+
int64_t
backend_get_size (struct backend *b, struct connection *conn)
{
@@ -189,9 +191,17 @@ backend_get_size (struct backend *b, struct connection *conn)
int
backend_can_write (struct backend *b, struct connection *conn)
{
+ struct b_conn_handle *h = &conn->handles[b->i];
+
debug ("%s: can_write", b->name);
- return b->can_write (b, conn);
+ if (h->can_write == -1) {
+ /* Special case for outermost backend when -r is in effect. */
+ if (readonly && b == backend)
+ return h->can_write = 0;
+ h->can_write = b->can_write (b, conn);
+ }
+ return h->can_write;
}
int
@@ -213,16 +223,26 @@ backend_is_rotational (struct backend *b, struct connection *conn)
int
backend_can_trim (struct backend *b, struct connection *conn)
{
+ int r;
+
debug ("%s: can_trim", b->name);
+ r = backend_can_write (b, conn);
+ if (r != 1)
+ return r;
return b->can_trim (b, conn);
}
int
backend_can_zero (struct backend *b, struct connection *conn)
{
+ int r;
+
debug ("%s: can_zero", b->name);
+ r = backend_can_write (b, conn);
+ if (r != 1)
+ return r;
return b->can_zero (b, conn);
}
@@ -237,8 +257,13 @@ backend_can_extents (struct backend *b, struct connection *conn)
int
backend_can_fua (struct backend *b, struct connection *conn)
{
+ int r;
+
debug ("%s: can_fua", b->name);
+ r = backend_can_write (b, conn);
+ if (r != 1)
+ return r; /* Relies on 0 == NBDKIT_FUA_NONE */
return b->can_fua (b, conn);
}
@@ -280,8 +305,10 @@ backend_pwrite (struct backend *b, struct connection *conn,
const void *buf, uint32_t count, uint64_t offset,
uint32_t flags, int *err)
{
+ struct b_conn_handle *h = &conn->handles[b->i];
int r;
+ assert (h->can_write == 1);
assert (!(flags & ~NBDKIT_FLAG_FUA));
debug ("%s: pwrite count=%" PRIu32 " offset=%" PRIu64 "
fua=%d",
b->name, count, offset, !!(flags & NBDKIT_FLAG_FUA));
@@ -312,8 +339,10 @@ backend_trim (struct backend *b, struct connection *conn,
uint32_t count, uint64_t offset, uint32_t flags,
int *err)
{
+ struct b_conn_handle *h = &conn->handles[b->i];
int r;
+ assert (h->can_write == 1);
assert (flags == 0);
debug ("%s: trim count=%" PRIu32 " offset=%" PRIu64 "
fua=%d",
b->name, count, offset, !!(flags & NBDKIT_FLAG_FUA));
@@ -329,8 +358,10 @@ backend_zero (struct backend *b, struct connection *conn,
uint32_t count, uint64_t offset, uint32_t flags,
int *err)
{
+ struct b_conn_handle *h = &conn->handles[b->i];
int r;
+ assert (h->can_write == 1);
assert (!(flags & ~(NBDKIT_FLAG_MAY_TRIM | NBDKIT_FLAG_FUA)));
debug ("%s: zero count=%" PRIu32 " offset=%" PRIu64 "
may_trim=%d fua=%d",
b->name, count, offset, !!(flags & NBDKIT_FLAG_MAY_TRIM),
diff --git a/server/connections.c b/server/connections.c
index 2a5150cd..7609e9a7 100644
--- a/server/connections.c
+++ b/server/connections.c
@@ -274,8 +274,9 @@ new_connection (int sockin, int sockout, int nworkers)
return NULL;
}
conn->nr_handles = backend->i + 1;
+ memset (conn->handles, -1, conn->nr_handles * sizeof *conn->handles);
for_each_backend (b)
- conn->handles[b->i].exportsize = -1;
+ conn->handles[b->i].handle = NULL;
conn->status = 1;
conn->nworkers = nworkers;
diff --git a/server/plugins.c b/server/plugins.c
index 828b0dee..1dec4008 100644
--- a/server/plugins.c
+++ b/server/plugins.c
@@ -361,7 +361,7 @@ plugin_can_zero (struct backend *b, struct connection *conn)
if (p->plugin.can_zero &&
p->plugin.can_zero (connection_get_handle (conn, 0)) == -1)
return -1;
- return plugin_can_write (b, conn);
+ return 1;
}
static int
diff --git a/server/protocol-handshake.c b/server/protocol-handshake.c
index 4d12b3dc..6d44deb5 100644
--- a/server/protocol-handshake.c
+++ b/server/protocol-handshake.c
@@ -52,37 +52,37 @@ protocol_compute_eflags (struct connection *conn, uint16_t *flags)
uint16_t eflags = NBD_FLAG_HAS_FLAGS;
int fl;
+ /* Check all flags even if they won't be advertised, to prime the
+ * cache and make later request validation easier.
+ */
fl = backend_can_write (backend, conn);
if (fl == -1)
return -1;
- if (readonly || !fl) {
+ if (!fl)
eflags |= NBD_FLAG_READ_ONLY;
- conn->readonly = true;
+
+ fl = backend_can_zero (backend, conn);
+ if (fl == -1)
+ return -1;
+ if (fl) {
+ eflags |= NBD_FLAG_SEND_WRITE_ZEROES;
+ conn->can_zero = true;
}
- if (!conn->readonly) {
- fl = backend_can_zero (backend, conn);
- if (fl == -1)
- return -1;
- if (fl) {
- eflags |= NBD_FLAG_SEND_WRITE_ZEROES;
- conn->can_zero = true;
- }
- fl = backend_can_trim (backend, conn);
- if (fl == -1)
- return -1;
- if (fl) {
- eflags |= NBD_FLAG_SEND_TRIM;
- conn->can_trim = true;
- }
+ fl = backend_can_trim (backend, conn);
+ if (fl == -1)
+ return -1;
+ if (fl) {
+ eflags |= NBD_FLAG_SEND_TRIM;
+ conn->can_trim = true;
+ }
- fl = backend_can_fua (backend, conn);
- if (fl == -1)
- return -1;
- if (fl) {
- eflags |= NBD_FLAG_SEND_FUA;
- conn->can_fua = true;
- }
+ fl = backend_can_fua (backend, conn);
+ if (fl == -1)
+ return -1;
+ if (fl) {
+ eflags |= NBD_FLAG_SEND_FUA;
+ conn->can_fua = true;
}
fl = backend_can_flush (backend, conn);
@@ -101,16 +101,14 @@ protocol_compute_eflags (struct connection *conn, uint16_t *flags)
conn->is_rotational = true;
}
- /* multi-conn is useless if parallel connections are not allowed */
- if (backend->thread_model (backend) >
- NBDKIT_THREAD_MODEL_SERIALIZE_CONNECTIONS) {
- fl = backend_can_multi_conn (backend, conn);
- if (fl == -1)
- return -1;
- if (fl) {
- eflags |= NBD_FLAG_CAN_MULTI_CONN;
- conn->can_multi_conn = true;
- }
+ /* multi-conn is useless if parallel connections are not allowed. */
+ fl = backend_can_multi_conn (backend, conn);
+ if (fl == -1)
+ return -1;
+ if (fl && (backend->thread_model (backend) >
+ NBDKIT_THREAD_MODEL_SERIALIZE_CONNECTIONS)) {
+ eflags |= NBD_FLAG_CAN_MULTI_CONN;
+ conn->can_multi_conn = true;
}
fl = backend_can_cache (backend, conn);
@@ -122,10 +120,10 @@ protocol_compute_eflags (struct connection *conn, uint16_t *flags)
conn->emulate_cache = fl == NBDKIT_CACHE_EMULATE;
}
- /* The result of this is not returned to callers here (or at any
- * time during the handshake). However it makes sense to do it once
- * per connection and store the result in the handle anyway. This
- * protocol_compute_eflags function is a bit misnamed XXX.
+ /* The result of this is not directly advertised as part of the
+ * handshake, but priming the cache here makes BLOCK_STATUS handling
+ * not have to worry about errors, and makes test-layers easier to
+ * write.
*/
fl = backend_can_extents (backend, conn);
if (fl == -1)
--
2.21.0