Originally, I thought that since the plugin always emulates
.zero with a fallback to .pwrite, we didn't need to expose the
backend's .can_zero to plugins. But there is another
consideration, as shown at least in the nbd plugin: a plugin
may implement a .zero callback that depends on support from
a remote endpoint. If the remote endpoint doesn't support
zeroes, then the plugin HAS to fail .zero with EOPNOTSUPP to
get the automatic fallback to .pwrite; this is slightly
wasteful to just telling nbdkit to not use the .zero callback
when it won't work.
At the same time, we still want to advertise zero support to
the client, even if we won't be calling .zero, since handling
NBD_CMD_WRITE_ZEROES allows for less network traffic; if it
is ever truly necessary to avoid advertising to the guest, the
nozero filter can accomplish that task.
So, this patch exposes a .can_zero callback for plugins,
which can usually be omitted, but which can be used to short-
circuit calls to .zero, with the nbd plugin as the first user;
while plugins.c continues to report true (well, the same result
as .can_write) to the backend regardless of the plugin's
answer, unless the plugin hits an error.
Testing of the nbd changes (and thus of the general nbdkit
handling of plugin .can_zero) is best done as part of the
nozero filter: using the filter on the server side means the
nbd plugin won't see a zero advertisement, and thus the client
side does not call nbd's .zero, but still advertises zero to
the client.
Now that the set of feature callbacks is the same for plugins
and filters, we can consolidate the documentation for filters
and merely focus on the slight differences.
This patch breaks ABI from the earlier exposure of .can_fua to
plugins a few patches ago, but that is okay as there has not
been a release in the meantime (the logical grouping was nicer
this way). Had a release happened, .can_zero would have to
be placed after .zero in struct nbdkit_plugin. Also, in
plugins.c, rearrange get_error() to a more logical location.
Signed-off-by: Eric Blake <eblake(a)redhat.com>
---
This sort of fell out from the v2v rhv-upload conversation on
the list, although Rich's v7 patch there demonstrates that
even when oVirt lacks zero support, the rhv-upload plugin still
wants .zero to be called (because it can optimize zeros done
beyond the highest write), so that particular plugin would not
be able to take advantage of this change. At any rate, in
writing it up, I noticed that the nbd plugin benefits from it,
so it's still useful in that right.
docs/nbdkit-filter.pod | 79 ++++++++++++++++++-------------------------------
docs/nbdkit-plugin.pod | 22 +++++++++++++-
include/nbdkit-plugin.h | 1 +
src/plugins.c | 72 ++++++++++++++++++++++++++++----------------
plugins/nbd/nbd.c | 15 ++++++----
tests/test-nozero.sh | 34 +++++++++++++++++----
6 files changed, 135 insertions(+), 88 deletions(-)
diff --git a/docs/nbdkit-filter.pod b/docs/nbdkit-filter.pod
index e6bf0c3..9bdc690 100644
--- a/docs/nbdkit-filter.pod
+++ b/docs/nbdkit-filter.pod
@@ -343,6 +343,10 @@ calls.
=head2 C<.can_trim>
+=head2 C<.can_zero>
+
+=head2 C<.can_fua>
+
int (*can_write) (struct nbdkit_next_ops *next_ops, void *nxdata,
void *handle);
int (*can_flush) (struct nbdkit_next_ops *next_ops, void *nxdata,
@@ -352,66 +356,39 @@ calls.
void *handle);
int (*can_trim) (struct nbdkit_next_ops *next_ops, void *nxdata,
void *handle);
+ int (*can_zero) (struct nbdkit_next_ops *next_ops, void *nxdata,
+ void *handle);
+ int (*can_fua) (struct nbdkit_next_ops *next_ops, void *nxdata,
+ void *handle);
These intercept the corresponding plugin methods, and control feature
bits advertised to the client.
+Of note, the C<.can_zero> callback in the filter controls whether the
+server advertises zero support to the client, which is slightly
+different semantics than the plugin; that is,
+C<next_ops-E<gt>can_zero> always returns true for a plugin, even when
+the plugin's own C<.can_zero> callback returned false, because nbdkit
+implements a fallback to C<.pwrite> at the plugin layer.
+
+Remember that most of the feature check functions return merely a
+boolean success value, while C<.can_fua> has three success values.
+The difference between values may affect choices made in the filter:
+when splitting a write request that requested FUA from the client, if
+C<next_ops-E<gt>can_fua> returns C<NBDKIT_FUA_NATIVE>, then the filter
+should pass the FUA flag on to each sub-request; while if it is known
+that FUA is emulated by a flush because of a return of
+C<NBDKIT_FUA_EMULATE>, it is more efficient to only flush once after
+all sub-requests have completed (often by passing C<NBDKIT_FLAG_FUA>
+on to only the final sub-request, or by dropping the flag and ending
+with a direct call to C<next_ops-E<gt>flush>).
+
If there is an error, the callback should call C<nbdkit_error> with an
error message and return C<-1>. If these functions are called more
than once for the same connection, they should return the same value;
similarly, the filter may cache the results of each counterpart in
C<next_ops> for a given connection rather than repeating calls.
-=head2 C<.can_zero>
-
- int (*can_zero) (struct nbdkit_next_ops *next_ops, void *nxdata,
- void *handle);
-
-This controls whether write zero support will be advertised to the
-client. This function has no counterpart in plugins, because nbdkit
-can always emulate zero by using pwrite; but a filter may want to
-force different handling than the nbdkit implementation. If this
-callback is omitted, the default returned for the plugin layer is
-true.
-
-If there is an error, the callback should call C<nbdkit_error> with an
-error message and return C<-1>. Like the other initial queries
-documented above, per-connection caching the of return value of this
-function is allowed.
-
-=head2 C<.can_fua>
-
- int (*can_fua) (struct nbdkit_next_ops *next_ops, void *nxdata,
- void *handle);
-
-This controls how the Forced Unit Access flag will be handled; it is
-only relevant for connections that are not read-only. This intercepts
-the corresponding plugin method, to control feature bits advertised to
-the client. Unlike other feature functions with just two success
-states, this one returns three success values: C<NBDKIT_FUA_NONE> to
-avoid advertising FUA support to the client, C<NBDKIT_FUA_EMULATE> if
-FUA is emulated by nbdkit calling flush after a write transaction, and
-C<NBDKIT_FUA_NATIVE> if FUA semantics are handled by the plugin
-without the overhead of an extra flush from nbdkit.
-
-The difference between the two non-zero values may affect choices made
-in the filter: when splitting a write request that requested FUA from
-the client, native support should pass the FUA flag on to each
-sub-request; while if it is known that FUA is emulated by a flush, it
-is more efficient to only flush once after all sub-requests have
-completed (often by passing C<NBDKIT_FLAG_FUA> on to only the final
-sub-request, or by dropping the flag and ending with a direct call to
-C<next_ops-E<gt>flush>).
-
-If this callback is omitted, the default returned for the plugin layer
-is C<NBDKIT_FUA_EMULATE> if C<can_flush> returned true, otherwise it
-is C<NBDKIT_FUA_NONE>.
-
-If there is an error, the callback should call C<nbdkit_error> with an
-error message and return C<-1>. Like the other initial queries
-documented above, per-connection caching of the return value of this
-function is allowed.
-
=head2 C<.pread>
int (*pread) (struct nbdkit_next_ops *next_ops, void *nxdata,
@@ -505,7 +482,7 @@ value to return to the client.
This intercepts the plugin C<.zero> method and can be used to modify
zero requests.
-This function will not be called if C<.can_write> returned false; in
+This function will not be called if C<.can_zero> returned false; in
turn, the filter should not call C<next_ops-E<gt>zero> if
C<next_ops-E<gt>can_zero> did not return true.
diff --git a/docs/nbdkit-plugin.pod b/docs/nbdkit-plugin.pod
index d6d080a..6e89315 100644
--- a/docs/nbdkit-plugin.pod
+++ b/docs/nbdkit-plugin.pod
@@ -487,6 +487,25 @@ error message and return C<-1>.
This callback is not required. If omitted, then we return true iff a
C<.trim> callback has been defined.
+=head2 C<.can_zero>
+
+ int can_zero (void *handle);
+
+This is called during the option negotiation phase to find out if the
+plugin wants the C<.zero> callback to be utilized. Support for
+writing zeros is still advertised to the client (unless the nbdkit
+filter nozero is also used), so returning false merely serves as a way
+to avoid complicating the C<.zero> callback to have to fail with
+C<EOPNOTSUPP> on the connections where it will never be more efficient
+than using C<.pwrite> up front.
+
+If there is an error, C<.can_zero> should call C<nbdkit_error> with an
+error message and return C<-1>.
+
+This callback is not required. If omitted, then nbdkit always tries
+C<.zero> first if it is present, and gracefully falls back to
+C<.pwrite> if C<.zero> was absent or failed with C<EOPNOTSUPP>.
+
=head2 C<.can_fua>
int can_fua (void *handle);
@@ -601,7 +620,7 @@ error message, and C<nbdkit_set_error> to record an appropriate
error
During the data serving phase, this callback is used to write C<count>
bytes of zeroes at C<offset> in the backing store.
-This function will not be called if C<.can_write> returned false. On
+This function will not be called if C<.can_zero> returned false. On
input, the parameter C<flags> may include C<NBDKIT_FLAG_MAY_TRIM>
unconditionally, and C<NBDKIT_FLAG_FUA> based on the result of
C<.can_fua>.
@@ -826,6 +845,7 @@ and then users will be able to run it like this:
L<nbdkit(1)>,
L<nbdkit-filter(3)>,
+L<nbdkit-nozero-filter(3)>,
L<nbdkit-example1-plugin(1)>,
L<nbdkit-example2-plugin(1)>,
L<nbdkit-example3-plugin(1)>,
diff --git a/include/nbdkit-plugin.h b/include/nbdkit-plugin.h
index 3f541a1..65a2e0c 100644
--- a/include/nbdkit-plugin.h
+++ b/include/nbdkit-plugin.h
@@ -104,6 +104,7 @@ struct nbdkit_plugin {
void (*dump_plugin) (void);
+ int (*can_zero) (void *handle);
int (*can_fua) (void *handle);
#if NBDKIT_API_VERSION == 1
int (*_unused1) (void *, void *, uint32_t, uint64_t);
diff --git a/src/plugins.c b/src/plugins.c
index d44e724..ff4facf 100644
--- a/src/plugins.c
+++ b/src/plugins.c
@@ -355,25 +355,24 @@ plugin_can_trim (struct backend *b, struct connection *conn)
return p->plugin.trim || p->plugin._trim_old;
}
-/* Grab the appropriate error value.
- */
-static int
-get_error (struct backend_plugin *p)
-{
- int ret = threadlocal_get_error ();
-
- if (!ret && p->plugin.errno_is_preserved)
- ret = errno;
- return ret ? ret : EIO;
-}
-
static int
plugin_can_zero (struct backend *b, struct connection *conn)
{
+ struct backend_plugin *p = container_of (b, struct backend_plugin, backend);
+
+ assert (connection_get_handle (conn, 0));
+
debug ("can_zero");
- /* We always allow .zero to fall back to .write, so plugins don't
- * need to override this. */
+ /* Note the special case here: the plugin's .can_zero controls only
+ * whether we call .zero; while the backend expects .can_zero to
+ * return whether to advertise zero support. Since we ALWAYS know
+ * how to fall back to .pwrite in plugin_zero(), we ignore the
+ * difference between the plugin's true or false return, and only
+ * call it to catch a -1 failure during negotiation. */
+ if (p->plugin.can_zero &&
+ p->plugin.can_zero (connection_get_handle (conn, 0)) == -1)
+ return -1;
return plugin_can_write (b, conn);
}
@@ -401,6 +400,18 @@ plugin_can_fua (struct backend *b, struct connection *conn)
return NBDKIT_FUA_EMULATE;
}
+/* Grab the appropriate error value.
+ */
+static int
+get_error (struct backend_plugin *p)
+{
+ int ret = threadlocal_get_error ();
+
+ if (!ret && p->plugin.errno_is_preserved)
+ ret = errno;
+ return ret ? ret : EIO;
+}
+
static int
plugin_pread (struct backend *b, struct connection *conn,
void *buf, uint32_t count, uint64_t offset, uint32_t flags,
@@ -534,6 +545,7 @@ plugin_zero (struct backend *b, struct connection *conn,
bool fua = flags & NBDKIT_FLAG_FUA;
bool emulate = false;
bool need_flush = false;
+ int can_zero = 1; /* TODO cache this per-connection? */
assert (connection_get_handle (conn, 0));
assert (!(flags & ~(NBDKIT_FLAG_MAY_TRIM | NBDKIT_FLAG_FUA)));
@@ -547,18 +559,26 @@ plugin_zero (struct backend *b, struct connection *conn,
}
if (!count)
return 0;
- errno = 0;
- if (p->plugin.zero)
- r = p->plugin.zero (connection_get_handle (conn, 0), count, offset, flags);
- else if (p->plugin._zero_old)
- r = p->plugin._zero_old (connection_get_handle (conn, 0), count, offset,
- may_trim);
- else
- emulate = true;
- if (r == -1)
- *err = emulate ? EOPNOTSUPP : get_error (p);
- if (r == 0 || *err != EOPNOTSUPP)
- goto done;
+ if (p->plugin.can_zero) {
+ can_zero = p->plugin.can_zero (connection_get_handle (conn, 0));
+ assert (can_zero != -1);
+ }
+
+ if (can_zero) {
+ errno = 0;
+ if (p->plugin.zero)
+ r = p->plugin.zero (connection_get_handle (conn, 0), count, offset,
+ flags);
+ else if (p->plugin._zero_old)
+ r = p->plugin._zero_old (connection_get_handle (conn, 0), count, offset,
+ may_trim);
+ else
+ emulate = true;
+ if (r == -1)
+ *err = emulate ? EOPNOTSUPP : get_error (p);
+ if (r == 0 || *err != EOPNOTSUPP)
+ goto done;
+ }
assert (p->plugin.pwrite || p->plugin._pwrite_old);
flags &= ~NBDKIT_FLAG_MAY_TRIM;
diff --git a/plugins/nbd/nbd.c b/plugins/nbd/nbd.c
index dd84b04..51de178 100644
--- a/plugins/nbd/nbd.c
+++ b/plugins/nbd/nbd.c
@@ -618,6 +618,14 @@ nbd_can_trim (void *handle)
return h->flags & NBD_FLAG_SEND_TRIM;
}
+static int
+nbd_can_zero (void *handle)
+{
+ struct handle *h = handle;
+
+ return h->flags & NBD_FLAG_SEND_WRITE_ZEROES;
+}
+
static int
nbd_can_fua (void *handle)
{
@@ -662,11 +670,7 @@ nbd_zero (void *handle, uint32_t count, uint64_t offset, uint32_t
flags)
int f = 0;
assert (!(flags & ~(NBDKIT_FLAG_FUA | NBDKIT_FLAG_MAY_TRIM)));
- if (!(h->flags & NBD_FLAG_SEND_WRITE_ZEROES)) {
- /* Trigger a fall back to regular writing */
- errno = EOPNOTSUPP;
- return -1;
- }
+ assert (h->flags & NBD_FLAG_SEND_WRITE_ZEROES);
if (!(flags & NBDKIT_FLAG_MAY_TRIM))
f |= NBD_CMD_FLAG_NO_HOLE;
@@ -716,6 +720,7 @@ static struct nbdkit_plugin plugin = {
.can_flush = nbd_can_flush,
.is_rotational = nbd_is_rotational,
.can_trim = nbd_can_trim,
+ .can_zero = nbd_can_zero,
.can_fua = nbd_can_fua,
.pread = nbd_pread,
.pwrite = nbd_pwrite,
diff --git a/tests/test-nozero.sh b/tests/test-nozero.sh
index 95ed045..57c4452 100755
--- a/tests/test-nozero.sh
+++ b/tests/test-nozero.sh
@@ -36,7 +36,9 @@ set -e
files="nozero1.img nozero1.log nozero1.sock nozero1.pid
nozero2.img nozero2.log nozero2.sock nozero2.pid
nozero3.img nozero3.log nozero3.sock nozero3.pid
- nozero4.img nozero4.log nozero4.sock nozero4.pid"
+ nozero4.img nozero4.log nozero4.sock nozero4.pid
+ nozero5.img nozero5a.log nozero5b.log nozero5a.sock nozero5b.sock
+ nozero5a.pid nozero5b.pid"
rm -f $files
# Prep images, and check that qemu-io understands the actions we plan on
@@ -45,6 +47,7 @@ for f in {0..1023}; do printf '%1024s' . >> nozero1.img;
done
cp nozero1.img nozero2.img
cp nozero1.img nozero3.img
cp nozero1.img nozero4.img
+cp nozero1.img nozero5.img
if ! qemu-io -f raw -d unmap -c 'w -z -u 0 1M' nozero1.img; then
echo "$0: missing or broken qemu-io"
rm nozero?.img
@@ -57,7 +60,7 @@ if test "$(stat -c %b nozero1.img)" = "$(stat -c %b
nozero2.img)"; then
fi
cp nozero2.img nozero1.img
-pid1= pid2= pid3= pid4=
+pid1= pid2= pid3= pid4= pid5a= pid5b=
# Kill any nbdkit processes on exit.
cleanup ()
@@ -68,6 +71,8 @@ cleanup ()
test "$pid2" && kill $pid2
test "$pid3" && kill $pid3
test "$pid4" && kill $pid4
+ test "$pid5a" && kill $pid5a
+ test "$pid5b" && kill $pid5b
# For easier debugging, dump the final log files before removing them.
echo "Log 1 file contents:"
cat nozero1.log || :
@@ -77,6 +82,10 @@ cleanup ()
cat nozero3.log || :
echo "Log 4 file contents:"
cat nozero4.log || :
+ echo "Log 5a file contents:"
+ cat nozero5a.log || :
+ echo "Log 5b file contents:"
+ cat nozero5b.log || :
rm -f $files
exit $status
@@ -88,6 +97,8 @@ trap cleanup INT QUIT TERM EXIT ERR
# 2: log before filter with zeromode=none (default), to ensure no ZERO request
# 3: log before filter with zeromode=emulate, to ensure ZERO from client
# 4: log after filter with zeromode=emulate, to ensure no ZERO to plugin
+# 5a/b: both sides of nbd plugin: even though server side does not advertise
+# ZERO, the client side still exposes it, and just skips calling nbd's .zero
nbdkit -P nozero1.pid -U nozero1.sock --filter=log \
file logfile=nozero1.log file=nozero1.img
nbdkit -P nozero2.pid -U nozero2.sock --filter=log --filter=nozero \
@@ -96,11 +107,15 @@ nbdkit -P nozero3.pid -U nozero3.sock --filter=log --filter=nozero \
file logfile=nozero3.log file=nozero3.img zeromode=emulate
nbdkit -P nozero4.pid -U nozero4.sock --filter=nozero --filter=log \
file logfile=nozero4.log file=nozero4.img zeromode=emulate
+nbdkit -P nozero5a.pid -U nozero5a.sock --filter=log --filter=nozero \
+ file logfile=nozero5a.log file=nozero5.img
+nbdkit -P nozero5b.pid -U nozero5b.sock --filter=log \
+ nbd logfile=nozero5b.log socket=nozero5a.sock
# We may have to wait a short time for the pid files to appear.
for i in `seq 1 10`; do
if test -f nozero1.pid && test -f nozero2.pid && test -f nozero3.pid
&&
- test -f nozero4.pid; then
+ test -f nozero4.pid && test -f nozero5a.pid && test -f
nozero5b.pid; then
break
fi
sleep 1
@@ -110,9 +125,11 @@ pid1="$(cat nozero1.pid)" || :
pid2="$(cat nozero2.pid)" || :
pid3="$(cat nozero3.pid)" || :
pid4="$(cat nozero4.pid)" || :
+pid5a="$(cat nozero5a.pid)" || :
+pid5b="$(cat nozero5b.pid)" || :
if ! test -f nozero1.pid || ! test -f nozero2.pid || ! test -f nozero3.pid ||
- ! test -f nozero4.pid; then
+ ! test -f nozero4.pid || ! test -f nozero5a.pid || ! test -f nozero5b.pid; then
echo "$0: PID files were not created"
exit 1
fi
@@ -122,6 +139,7 @@ qemu-io -f raw -c 'w -z -u 0 1M'
'nbd+unix://?socket=nozero1.sock'
qemu-io -f raw -c 'w -z -u 0 1M' 'nbd+unix://?socket=nozero2.sock'
qemu-io -f raw -c 'w -z -u 0 1M' 'nbd+unix://?socket=nozero3.sock'
qemu-io -f raw -c 'w -z -u 0 1M' 'nbd+unix://?socket=nozero4.sock'
+qemu-io -f raw -c 'w -z -u 0 1M' 'nbd+unix://?socket=nozero5b.sock'
# Check for expected ZERO vs. WRITE results
grep 'connection=1 Zero' nozero1.log
@@ -134,10 +152,16 @@ if grep 'connection=1 Zero' nozero4.log; then
echo "filter should have converted zero into write"
exit 1
fi
+grep 'connection=1 Zero' nozero5b.log
+if grep 'connection=1 Zero' nozero5a.log; then
+ echo "nbdkit should have converted zero into write before nbd plugin"
+ exit 1
+fi
-# Sanity check on contents - all 4 files should read identically
+# Sanity check on contents - all 5 files should read identically
cmp nozero1.img nozero2.img
cmp nozero2.img nozero3.img
cmp nozero3.img nozero4.img
+cmp nozero4.img nozero5.img
# The cleanup() function is called implicitly on exit.
--
2.14.3