On 6/4/19 4:59 AM, Richard W.M. Jones wrote:
Callers may now optionally set up a concurrent writer thread. The
outline of this idea is explained here:
https://www.redhat.com/archives/libguestfs/2019-June/msg00010.html
The change is quite small, but here are some points which are true but
may not be obvious:
* All writes return immediately with success (unless we run out of
memory), and writes never block.
* When going down the READY -> ISSUE_COMMAND.START (etc) path, because
writes never block, we never stop for a read notification, so we
always send all commands in the issue queue before starting to read
any replies. (However this only means that the raw commands are
enqueued in the writer thread, which can still take its merry time
writing the requests to the socket.)
* Commands can be counted as "in flight" when they haven't been
written to the socket yet. This is another reason for increasing
the in flight limits. Does this matter? Probably not. The kernel
might also queue up packets before sending them, or they might be in
flight across the internet or queued at the receiver. Nothing about
"in flight" ever meant that the server has received and is
processing those packets.
* Even with this change, full parallelism isn't quite achievable.
It's still possible to be in a state such as
REPLY.STRUCTURED_REPLY.RECV_* waiting for an unusual fast writer /
slow reader server. If you then decide that you want to send yet
Or more likely, when NBD_CMD_READ is so large that the server can't send
it all in TCP window, and therefore we are guaranteed to have to wait
for POLLIN between reading the first few packets of the chunk to clear
up the server's ability to send, and before the server's second half of
the chunk arrives. As long as we are waiting in
REPLY.STRUCTURED_REPLY.RECV_OFFSET_DATA or
REPLY.SIMPLE_REPLY.RECV_READ_PAYLOAD, we are locking out our ability to
send more commands unless we rework the state machine a bit.
more commands then those commands will only be enqueued in the
handle, not dispatched to the writer thread. To avoid this it is
best to send as many commands as possible as soon as possible before
entering poll, but to a certain extent this is unavoidable with
having only one state machine.
If the state machine can make decisions based on whether there is a
writer callback, perhaps we can allow more states to (conditionally)
process a pending request immediately, any time we are otherwise blocked
waiting for POLLIN. I'll have to think more about this.
---
docs/libnbd.pod | 73 +++++++++++++++++++++++++++++++++++++++++++++
generator/generator | 29 ++++++++++++++++++
lib/handle.c | 32 ++++++++++++++++++++
lib/internal.h | 7 +++++
lib/socket.c | 27 ++++++++++++++---
podwrapper.pl.in | 3 +-
6 files changed, 166 insertions(+), 5 deletions(-)
diff --git a/docs/libnbd.pod b/docs/libnbd.pod
index ede2539..07d259f 100644
--- a/docs/libnbd.pod
+++ b/docs/libnbd.pod
@@ -400,6 +400,79 @@ If you are issuing multiple in-flight requests (see above) and
limiting the number, then the limit should be applied to each
individual NBD connection.
+=head2 Concurrent writer thread
+
+To achieve the maximum possible performance from libnbd and NBD
+servers, as well as the above techniques you must also use a
+concurrent writer thread. This feature allows requests to be issued
+on the NBD socket at the same time that replies are being read from
+the socket. In other words L<send(2)> and L<recv(2)> calls will be
+running at the same time on the same socket from different threads.
+
+There is a full example using a concurrent writer available at
+L<https://github.com/libguestfs/libnbd/blob/master/examples/concurrent-writer.c>
+
+To implement this, you change your ordinary AIO code in four ways:
+
+=over 4
+
+=item 1. Call nbd_set_concurrent_writer
+
+ struct writer_data {
+ struct nbd_handle *nbd;
+ /* other data here as required */
+ } data;
+
+ nbd_set_concurrent_writer (nbd, &data, writer);
+
+This function can be called on the handle at any time, either after
+the handle is created or after the connection and handshaking has
+completed.
+
+=item 2. Implement non-blocking writer callback
+
+C<writer> is a I<non-blocking> callback which enqueues the buffer into
+a ring or similar FIFO structure:
+
+ struct ring_item {
+ struct writer_data *data;
+ const void *buf;
+ size_t len;
+ };
+
+ void writer (void *data, const void *buf, size_t len)
Update to cover the int return type, and whether we add a flags argument.
+ {
+ struct ring_item item;
+
+ /* add (data, buf, len) to a shared ring */
+ item.data = data;
+ item.buf = malloc (len);
+ memcpy (item.buf, buf, len);
+ item.len = len;
+ ring_add (&item);
+
+ writer_signal (); /* kick the writer thread */
Update to document immediate error return (ENOMEM)
+ }
+
+=item 3. Implement writer thread
+
+You must also supply another thread which picks up data off the ring
+and writes it to the socket (see C<nbd_aio_get_fd>). If there is an
+error when writing to the socket, call C<nbd_concurrent_writer_error>
+with the C<errno>.
+
+You have a choice of whether to implement one thread per nbd_handle or
+one thread shared between all handles.
+
+=item 4. Modify main loop
+
+Finally your main loop can unconditionally call
+C<nbd_aio_notify_write> when C<nbd_aio_get_direction> returns
C<WRITE>
+or C<BOTH> (since the concurrent thread can always enqueue more data
+and so is always "ready to write").
Should it likewise unconditionally poll for POLLIN, even if
aio_get_direction does not currently request reads (in the case where
our atomic read of the current state spots a transient condition by some
other thread progressing through emitting a request)? Or are we trying
to beef up the state machine so that h->state never exposes transient
states to other threads?
+
+=back
+
=head1 ENCRYPTION AND AUTHENTICATION
The NBD protocol and libnbd supports TLS (sometimes incorrectly called
diff --git a/generator/generator b/generator/generator
index ff6075d..718e253 100755
--- a/generator/generator
+++ b/generator/generator
@@ -1094,6 +1094,35 @@ C<\"qemu:dirty-bitmap:...\"> for qemu-nbd
(see qemu-nbd I<-B> option). See also C<nbd_block_status>.";
};
+ "set_concurrent_writer", {
+ default_call with
+ args = [ Opaque "data";
+ CallbackPersist ("writer", [Opaque "data";
+ BytesIn ("buf", "len")])
];
+ ret = RErr;
+ permitted_states = [ Created; Connecting; Connected ];
+ shortdesc = "set a concurrent writer thread";
+ longdesc = "\
+Provide an optional concurrent writer thread for better performance.
+See L<libnbd(3)/Concurrent writer thread> for how to use this.";
+ };
+
+ "concurrent_writer_error", {
+ default_call with
+ args = [ Int "err" ]; ret = RErr;
+ shortdesc = "signal an error from the concurrent writer thread";
+ longdesc = "\
+This can be called from the concurrent writer thread to signal
+that there was an error writing to the socket. As there is no
+way to recover from such errors, the connection will move to the
+dead state soon after.
+
+The parameter is the C<errno> returned by the failed L<send(2)> call.
+It must be non-zero.
+
+See L<libnbd(3)/Concurrent writer thread> for how to use this.";
+ };
Do we also need a function for the writer thread to call when it is
confirming that the writer callback was passed a flag stating that no
further writes are needed? I'm trying to figure out if
nbd_shutdown/nbd_close should wait for acknowledgment that the writer
thread has reached clean shutdown; it may especially matter for clean
TLS shutdown.
--
Eric Blake, Principal Software Engineer
Red Hat, Inc. +1-919-301-3226
Virtualization:
qemu.org |
libvirt.org