Hi,

We've been using a modified nbdkit (ours is cbdkit internally) for about half a year now and since you guys appear to be working on a next version of the API I wanted to go over some of the limitations we hit with nbdkit that we think others may also hit for consideration into the batch of changes you are making to the api.

About Us: Our primary use for nbdkit is to facilitate a disk driver implementation in userspace. Our plugin provides a disk interface to cloud/blob/object storages (aws s3, gcs, openstack swift) and so every read and write that doesn't hit an internal cache within the plugin is going out over the network to one of these cloud storages. Because each read or write operation could take somewhere between 1ms (hitting an internal cache) and 30 seconds (hitting a cloud storage bucket that is overloaded causing slowdowns and retries) depending on load, we needed to support as many operations in parallel as possible to overcome potentially high latency and provide high throughput.

Limitation: The kernel will (with today's default settings) typically be willing to send up to 128 requests of 128kB size to the driver in parallel. We wanted to support 128 parallel read operations on different areas of the disk without requiring 128 separate threads and connections for the driver. Right now in nbdkit that is impossible. The main loop in connection.c will pull an nbd request off the socket and block until that read request is complete before sending a response and getting the next request, blocking other requests on the socket unless running X connections/threads in parallel. For write operations we can overcome this through flush mechanics and early success calls but with reads we get stuck.

Change: We introduced an additional set of functions to the nbdkit_plugin struct that supports asynchronous handling of the requests and a few helper functions for the plugin to use to respond when it has finished the request. This is very similar to the fuse filesystem low level api (async supported) vs the high level fuse fs api (sync only). The design goal here is that a single connection/thread on nbdkit can support as many requests in parallel as the plugin allows. The nbdkit side pulls the request off the socket and if the async function pointer is non-null it will wrap the request in an op struct and use the async plugin call for read/write/etc capturing any buffer allocated and some op details into the op pointer. The plugin async_* will start the op and return to nbdkit while the plugin works on it in the background. Nbdkit will then go back to the socket and begin the next request. Our plugin uses 1 connection/nbdkit thread and 2-4 threads internally with boost asio over sockets to service the requests to cloud. We are able to achieve ~1GB/s (yes bytes) read/write performance to aws s3 from an ec2 node with 10 gigabit networking on < 100MB of memory in the driver with this approach.

Here are some of what our function prototypes look like that support an asynchronous nbdkit model

 #define CBDKIT_THREAD_MODEL_SERIALIZE_REQUESTS        2
 #define CBDKIT_THREAD_MODEL_PARALLEL                  3
 #define CBDKIT_THREAD_MODEL_ASYNC                     4

 struct cbdkit_plugin {
 ...
  int (*pread) (void *handle, void *buf, uint32_t count, uint64_t offset);
  int (*pwrite) (void *handle, const void *buf, uint32_t count, uint64_t offset);
  int (*flush) (void *handle);
  int (*trim) (void *handle, uint32_t count, uint64_t offset);
  int (*zero) (void *handle, uint32_t count, uint64_t offset, int may_trim);

  int errno_is_preserved;

  void (*async_pread) (void *op, void *handle, void *buf, uint32_t count, uint64_t offset);
  void (*async_pwrite) (void *op, void *handle, const void *buf, uint32_t count, uint64_t offset, int fua);
  void (*async_flush) (void *op, void *handle);
  void (*async_trim) (void *op, void *handle, uint32_t count, uint64_t offset, int fua);
  void (*async_zero) (void *op, void *handle, uint32_t count, uint64_t offset, int may_trim, int fua);
 ...
 }

Additionally there are a few helper functions for the plugin to use to respond back to nbdkit when the job is eventually finished. The plugin contract when using the async functions is that every async func guarantees it will call an appropriate async_reply function.

 /* call for completion of successful async_pwrite, async_flush, async_trim, or async_zero */
 extern CBDKIT_CXX_LANG_C int cbdkit_async_reply (void *op);
 /* call for complete of successful async_pread */
 extern CBDKIT_CXX_LANG_C int cbdkit_async_reply_read (void *op);
 /* call for completion of any async operation with error */
 extern CBDKIT_CXX_LANG_C int cbdkit_async_reply_error (void *op, uint32_t error);
 
If there is any interest in supporting async ops in the next api version I am able to share the entire modified nbdkit (cbdkit) source that we use that supports this async op framework, fua, as well as some buffer pooling.

Thanks for consideration,
 - Shaun


On Fri, Jan 19, 2018 at 8:40 AM, Eric Blake <eblake@redhat.com> wrote:
[still a work in progress, as I finish rebasing to capture the
ideas raised on the list, but posting now for initial feedback]

The NBD protocol supports Forced Unit Access (FUA) as a more efficient
way to wait for just one write to land in persistent storage, rather
than all outstanding writes at the time of a flush; modeled after
the kernel's block I/O flag of the same name.  While we can emulate
the proper semantics with a full-blown flush, there are some plugins
that can properly pass the FUA flag on to the end storage and thereby
avoid some overhead.

This patch introduces new callbacks and documentations for those
callbacks, although the actual implementation to take advantage of
the new callbacks will be in later patches.  The biggest thing to
note is that we now support 2 API versions for the plugin, where
the plugin author chooses whether to keep version 1 (default, no
FUA support) or opt in to version 2 (FUA support).

Signed-off-by: Eric Blake <eblake@redhat.com>
---
 docs/nbdkit-plugin.pod  | 89 +++++++++++++++++++++++++++++++++++++++++++++++--
 docs/nbdkit.pod         |  7 +++-
 include/nbdkit-plugin.h | 31 ++++++++++++++++-
 src/internal.h          |  4 +--
 src/plugins.c           |  2 +-
 5 files changed, 125 insertions(+), 8 deletions(-)

diff --git a/docs/nbdkit-plugin.pod b/docs/nbdkit-plugin.pod
index 3cafc42..d982e65 100644
--- a/docs/nbdkit-plugin.pod
+++ b/docs/nbdkit-plugin.pod
@@ -6,6 +6,8 @@ nbdkit-plugin - How to write nbdkit plugins

 =head1 SYNOPSIS

+ #define NBDKIT_API_VERSION 2
+
  #include <nbdkit-plugin.h>

  #define THREAD_MODEL NBDKIT_THREAD_MODEL_SERIALIZE_ALL_REQUESTS
@@ -51,9 +53,21 @@ L<nbdkit-perl-plugin(3)>,
 L<nbdkit-python-plugin(3)>,
 L<nbdkit-ruby-plugin(3)>.

+=head1 C<#define NBDKIT_API_VERSION>
+
+Plugins must choose which API version they want to use.  The default
+version is 1; but if a plugin defines NBDKIT_API_VERSION to a positive
+integer prior to including C<nbdkit-plugin.h>, the signature of
+several callbacks is enhanced.  A newer nbdkit will always support
+plugins compiled against an older API version, but plugins that opt in
+to newer versions require a new enough nbdkit.  For now, the maximum
+version is 2, which enables fine-tuned response to client flags
+including efficient Forced Unit Access (FUA) on writes.
+
 =head1 C<nbdkit-plugin.h>

-All plugins should start by including this header file:
+All plugins should start by including this header file, after
+optionally choosing an API version:

  #include <nbdkit-plugin.h>

@@ -400,7 +414,28 @@ If there is an error, C<.can_trim> should call C<nbdkit_error> with an
 error message and return C<-1>.

 This callback is not required.  If omitted, then we return true iff a
-C<.trim> callback has been defined.
+C<.trim> or C<.trim_fua> callback has been defined.
+
+=head2 C<.can_fua>
+
+ int can_fua (void *handle);
+
+This is called during the option negotiation phase to find out if the
+plugin supports the Forced Unit Access (FUA) flag on write and trim
+requests.
+
+If there is an error, C<.can_fua> should call C<nbdkit_error> with an
+error message and return C<-1>.
+
+This callback is not required.  If omitted, then we return true iff
+either the C<.pwrite_fua> callback has been defined, or if C<.can_flush>
+returns true (in the latter case, FUA semantics are emulated by nbdkit
+calling C<.flush> before completing any write or trim operation with
+the FUA flag set).
+
+Note that if this defaults to true and C<.can_trim> also returns true,
+the plugin must provide either C<.flush> or C<.trim_fua> for correct
+FUA semantics.

 =head2 C<.pread>

@@ -442,6 +477,21 @@ recovered from), C<.pwrite> should call C<nbdkit_error> with an error
 message, and C<nbdkit_set_error> to record an appropriate error
 (unless C<errno> is sufficient), then return C<-1>.

+If the plugin can provide efficient Forced Unit Access (FUA) semantics,
+it should define C<.pwrite_fua> instead.
+
+=head2 C<.pwrite_fua>
+
+ int pwrite_fua (void *handle, const void *buf, uint32_t count, uint64_t offset, int fua);
+
+This callback has the same requirements as C<.pwrite>, with the
+additional parameter C<fua> set to a non-zero value if the client
+wants FUA semantics (where the command must not return until the
+actions of the write have landed in persistent storage).  If the
+plugin cannot provide efficient FUA, but C<.can_flush> returns true
+and C<.can_fua> does not return false, then client requests for FUA
+semantics are emulated by nbdkit calling C<.flush>.
+
 =head2 C<.flush>

  int flush (void *handle);
@@ -455,6 +505,11 @@ If there is an error, C<.flush> should call C<nbdkit_error> with an
 error message, and C<nbdkit_set_error> to record an appropriate error
 (unless C<errno> is sufficient), then return C<-1>.

+Note that C<.flush> can be called both by the client doing an explicit
+flush request, and by nbdkit when emulating Forced Unit Access (FUA)
+semantics after a write or trim where the plugin did not provide FUA
+callbacks (C<.pwrite_fua>, C<.zero_fua>, and C<.trim_fua>).
+
 =head2 C<.trim>

  int trim (void *handle, uint32_t count, uint64_t offset);
@@ -467,6 +522,21 @@ If there is an error, C<.trim> should call C<nbdkit_error> with an
 error message, and C<nbdkit_set_error> to record an appropriate error
 (unless C<errno> is sufficient), then return C<-1>.

+If the plugin can provide efficient Forced Unit Access (FUA) semantics,
+it should define C<.trim_fua> instead.
+
+=head2 C<.trim_fua>
+
+ int trim_fua (void *handle, uint32_t count, uint64_t offset, int fua);
+
+This callback has the same requirements as C<.trim>, with the
+additional parameter C<fua> set to a non-zero value if the client
+wants FUA semantics (where the command must not return until the
+actions of the trim have landed in persistent storage).  If the plugin
+cannot provide efficient FUA, but C<.can_flush> returns true and
+C<.can_fua> does not return false, then client requests for FUA
+semantics are emulated by nbdkit calling C<.flush>.
+
 =head2 C<.zero>

  int zero (void *handle, uint32_t count, uint64_t offset, int may_trim);
@@ -488,6 +558,21 @@ If there is an error, C<.zero> should call C<nbdkit_error> with an
 error message, and C<nbdkit_set_error> to record an appropriate error
 (unless C<errno> is sufficient), then return C<-1>.

+If the plugin can provide efficient Forced Unit Access (FUA) semantics,
+it should define C<.zero_fua> instead.
+
+=head2 C<.zero_fua>
+
+ int zero_fua (void *handle, uint32_t count, uint64_t offset, int may_trim, int fua);
+
+This callback has the same requirements as C<.zero>, with the
+additional parameter C<fua> set to a non-zero value if the client
+wants FUA semantics (where the command must not return until the
+actions of the write have landed in persistent storage).  If the
+plugin cannot provide efficient FUA, but C<.can_flush> returns true
+and C<.can_fua> does not return false, then client requests for FUA
+semantics are emulated by nbdkit calling C<.flush>.
+
 =head1 THREADS

 Each nbdkit plugin must declare its thread safety model by defining
diff --git a/docs/nbdkit.pod b/docs/nbdkit.pod
index 636eedc..eaa638b 100644
--- a/docs/nbdkit.pod
+++ b/docs/nbdkit.pod
@@ -804,7 +804,12 @@ information about that plugin, eg:
  [etc]

 Plugins which ship with nbdkit usually have the same version as the
-corresponding nbdkit binary.
+corresponding nbdkit binary.  The nbdkit binary will always be able
+to utilize plugins compiled against an older version of the header;
+however, there are cases where a newer plugin may not be fully
+supported by an older nbdkit binary (for example, a plugin that
+supplies C<.pwrite_fua> but not C<.pwrite> may not support writes
+when loaded by the older nbdkit).

 =head2 Detect if a plugin is installed

diff --git a/include/nbdkit-plugin.h b/include/nbdkit-plugin.h
index 13541e5..b67d343 100644
--- a/include/nbdkit-plugin.h
+++ b/include/nbdkit-plugin.h
@@ -1,5 +1,5 @@
 /* nbdkit
- * Copyright (C) 2013-2017 Red Hat Inc.
+ * Copyright (C) 2013-2018 Red Hat Inc.
  * All rights reserved.
  *
  * Redistribution and use in source and binary forms, with or without
@@ -48,7 +48,16 @@ extern "C" {
 #define NBDKIT_THREAD_MODEL_SERIALIZE_REQUESTS        2
 #define NBDKIT_THREAD_MODEL_PARALLEL                  3

+#define NBDKIT_FLAG_MAY_TRIM           (1<<0)
+#define NBDKIT_FLAG_FUA                (1<<1)
+
+/* By default, a plugin gets API version 1; but you may request
+ * version 2 prior to including this header */
+#ifndef NBDKIT_API_VERSION
 #define NBDKIT_API_VERSION                            1
+#elif (NBDKIT_API_VERSION - 0) < 1 || NBDKIT_API_VERSION > 2)
+#error Unsupported API version
+#endif

 struct nbdkit_plugin {
   /* Do not set these fields directly; use NBDKIT_REGISTER_PLUGIN.
@@ -87,15 +96,35 @@ struct nbdkit_plugin {
   int (*can_trim) (void *handle);

   int (*pread) (void *handle, void *buf, uint32_t count, uint64_t offset);
+#if NBDKIT_API_VERSION == 1
   int (*pwrite) (void *handle, const void *buf, uint32_t count, uint64_t offset);
+#else
+  int (*pwrite_old) (void *handle, const void *buf, uint32_t count, uint64_t offset);
+#endif
   int (*flush) (void *handle);
+#if NBDKIT_API_VERSION == 1
   int (*trim) (void *handle, uint32_t count, uint64_t offset);
   int (*zero) (void *handle, uint32_t count, uint64_t offset, int may_trim);
+#else
+  int (*trim_old) (void *handle, uint32_t count, uint64_t offset);
+  int (*zero_old) (void *handle, uint32_t count, uint64_t offset, int may_trim);
+#endif

   int errno_is_preserved;

   void (*dump_plugin) (void);

+  int (*can_fua) (void *handle);
+#if NBDKIT_API_VERSION == 1
+  int (*_unused1) (void *, const void *, uint32_t, uint64_t, uint32_t);
+  int (*_unused2) (void *, uint32_t, uint64_t, uint32_t);
+  int (*_unused3) (void *, uint32_t, uint64_t, uint32_t);
+#else
+  int (*pwrite) (void *handle, const void *buf, uint32_t count,
+                 uint64_t offset, uint32_t flags);
+  int (*zero) (void *handle, uint32_t count, uint64_t offset, uint32_t flags);
+  int (*trim) (void *handle, uint32_t count, uint64_t offset, uint32_t flags);
+#endif
   /* int (*set_exportname) (void *handle, const char *exportname); */
 };

diff --git a/src/internal.h b/src/internal.h
index 7fd52a2..c76c0d3 100644
--- a/src/internal.h
+++ b/src/internal.h
@@ -40,6 +40,7 @@
 #include <sys/socket.h>
 #include <pthread.h>

+#define NBDKIT_API_VERSION 2
 #include "nbdkit-plugin.h"
 #include "nbdkit-filter.h"

@@ -98,9 +99,6 @@
       (type *) ((char *) __mptr - offsetof(type, member));       \
     })

-#define NBDKIT_FLAG_MAY_TRIM (1<<0) /* Maps to !NBD_CMD_FLAG_NO_HOLE */
-#define NBDKIT_FLAG_FUA      (1<<1) /* Maps to NBD_CMD_FLAG_FUA */
-
 /* main.c */
 extern const char *exportname;
 extern const char *ipaddr;
diff --git a/src/plugins.c b/src/plugins.c
index 1de2ba2..fd5e843 100644
--- a/src/plugins.c
+++ b/src/plugins.c
@@ -553,7 +553,7 @@ plugin_register (size_t index, const char *filename,
   }

   /* Check for incompatible future versions. */
-  if (plugin->_api_version != 1) {
+  if (plugin->_api_version < 0 || plugin->_api_version > 2) {
     fprintf (stderr, "%s: %s: plugin is incompatible with this version of nbdkit (_api_version = %d)\n",
              program_name, p->filename, plugin->_api_version);
     exit (EXIT_FAILURE);
--
2.14.3

_______________________________________________
Libguestfs mailing list
Libguestfs@redhat.com
https://www.redhat.com/mailman/listinfo/libguestfs