[nbdkit PATCH v2 0/4] enable parallel nbd forwarding

[PATCH v13 0/3]...

[nbdkit PATCH] maint: Mention...

Eric Blake

Tuesday, 21 November 2017 Tue, 21 Nov '17

1:14 p.m.

With this, I am finally able to get the nbd plugin to do out-of-order responses to the client. Once this series goes in, we should be ready for Rich to cut a release. Eric Blake (4): nbd: Split reading into separate thread nbd: Protect writes with mutex nbd: Enable parallel handling tests: Test parallel nbd behavior plugins/nbd/nbd.c | 217 ++++++++++++++++++++++++++++++++++++--------- tests/Makefile.am | 1 + tests/test-parallel-nbd.sh | 81 +++++++++++++++++ 3 files changed, 255 insertions(+), 44 deletions(-) create mode 100755 tests/test-parallel-nbd.sh -- 2.13.6

Show replies by date

Eric Blake

Tuesday, 21 November Tue, 21 Nov

1:14 p.m.

New subject: [nbdkit PATCH v2 1/4] nbd: Split reading into separate thread

In preparation for allowing interleaved response, refactor the nbd forwarder so that writes are still done from the thread handling the original request from the client, but all reads are done by a dedicated reader thread. Control between the two threads is gated by a new mutex for storing the transaction information, coupled with a pipe for the reader thread to send the final status back to the handler thread. Signed-off-by: Eric Blake <eblake(a)redhat.com> --- plugins/nbd/nbd.c | 145 ++++++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 119 insertions(+), 26 deletions(-) diff --git a/plugins/nbd/nbd.c b/plugins/nbd/nbd.c index 739eecd..53bf104 100644 --- a/plugins/nbd/nbd.c +++ b/plugins/nbd/nbd.c @@ -44,6 +44,7 @@ #include <sys/socket.h> #include <sys/un.h> #include <assert.h> +#include <pthread.h> #include <nbdkit-plugin.h> #include "protocol.h" @@ -119,18 +120,23 @@ nbd_config_complete (void) /* The per-transaction details */ struct transaction { - /* TODO: the protocol uses a 64-bit handle, but until we allow - interleaved transactions, 31 bits with wraparound is plenty */ - int cookie; + union { + uint64_t cookie; + int fds[2]; + } u; void *buf; uint32_t count; }; /* The per-connection handle */ struct handle { + /* These fields are read-only once initialized */ int fd; int flags; int64_t size; + pthread_t reader; + + pthread_mutex_t trans_lock; /* Covers access to all fields below */ /* Our choice of THREAD_MODEL means at most one outstanding transaction */ struct transaction trans; bool dead; @@ -179,6 +185,20 @@ write_full (int fd, const void *buf, size_t len) return 0; } +static void +nbd_lock (struct handle *h) +{ + int r = pthread_mutex_lock (&h->trans_lock); + assert (!r); +} + +static void +nbd_unlock (struct handle *h) +{ + int r = pthread_mutex_unlock (&h->trans_lock); + assert (!r); +} + /* Called during transmission phases when there is no hope of * resynchronizing with the server, and all further requests from the * client will fail. Returns -1 for convenience. */ @@ -187,6 +207,7 @@ nbd_mark_dead (struct handle *h) { int err = errno; + nbd_lock (h); if (!h->dead) { nbdkit_debug ("permanent failure while talking to server %s: %m", sockname); @@ -194,6 +215,7 @@ nbd_mark_dead (struct handle *h) } else if (!err) errno = ESHUTDOWN; + nbd_unlock (h); /* NBD only accepts a limited set of errno values over the wire, and nbdkit converts all other values to EINVAL. If we died due to an errno value that cannot transmit over the wire, translate it to @@ -223,22 +245,40 @@ nbd_request_raw (struct handle *h, uint32_t type, uint64_t offset, } /* Perform the request half of a transaction. On success, return the - non-negative cookie to match to the reply; on error return -1. */ + non-negative fd for reading the reply; on error return -1. */ static int nbd_request_full (struct handle *h, uint32_t type, uint64_t offset, uint32_t count, const void *req_buf, void *rep_buf) { - if (h->dead) - return nbd_mark_dead (h); - h->trans.buf = rep_buf; - h->trans.count = rep_buf ? count : 0; - if (++h->trans.cookie > INT_MAX) - h->trans.cookie = 1; - if (nbd_request_raw (h, type, offset, count, h->trans.cookie) < 0) + int err; + struct transaction *trans; + + nbd_lock (h); + if (h->dead) { + nbd_unlock (h); return nbd_mark_dead (h); + } + trans = &h->trans; + nbd_unlock (h); + if (pipe (trans->u.fds)) { + nbdkit_error ("unable to create pipe: %m"); + /* Still in sync with server, so don't mark connection dead */ + return -1; + } + trans->buf = rep_buf; + trans->count = rep_buf ? count : 0; + if (nbd_request_raw (h, type, offset, count, trans->u.cookie) < 0) + goto err; if (req_buf && write_full (h->fd, req_buf, count) < 0) - return nbd_mark_dead (h); - return h->trans.cookie; + goto err; + return trans->u.fds[0]; + + err: + err = errno; + close (trans->u.fds[0]); + close (trans->u.fds[1]); + errno = err; + return nbd_mark_dead (h); } /* Shorthand for nbd_request_full when no extra buffers are involved. */ @@ -248,23 +288,28 @@ nbd_request (struct handle *h, uint32_t type, uint64_t offset, uint32_t count) return nbd_request_full (h, type, offset, count, NULL, NULL); } -/* Read a reply, and look up the corresponding transaction. Return - the server's non-negative answer (converted to local errno value) - on success, or -1 on read failure. */ +/* Read a reply, and look up the fd corresponding to the transaction. + Return the server's non-negative answer (converted to local errno + value) on success, or -1 on read failure. */ static int -nbd_reply_raw (struct handle *h, struct transaction *trans) +nbd_reply_raw (struct handle *h, int *fd) { struct reply rep; + struct transaction trans; + *fd = -1; if (read_full (h->fd, &rep, sizeof rep) < 0) return nbd_mark_dead (h); - *trans = h->trans; + nbd_lock (h); + trans = h->trans; + nbd_unlock (h); + *fd = trans.u.fds[1]; nbdkit_debug ("received reply for cookie %#" PRIx64, rep.handle); - if (be32toh (rep.magic) != NBD_REPLY_MAGIC || rep.handle != trans->cookie) + if (be32toh (rep.magic) != NBD_REPLY_MAGIC || rep.handle != trans.u.cookie) return nbd_mark_dead (h); switch (be32toh (rep.error)) { case NBD_SUCCESS: - if (trans->buf && read_full (h->fd, trans->buf, trans->count) < 0) + if (trans.buf && read_full (h->fd, trans.buf, trans.count) < 0) return nbd_mark_dead (h); return 0; case NBD_EPERM: @@ -295,17 +340,51 @@ nbd_reply_raw (struct handle *h, struct transaction *trans) } } +/* Reader loop. */ +void * +nbd_reader (void *handle) +{ + struct handle *h = handle; + bool done = false; + + while (!done) { + int r; + int fd; + + r = nbd_reply_raw (h, &fd); + if (r >= 0) { + if (write (fd, &r, sizeof r) != sizeof r) { + nbdkit_error ("failed to write pipe: %m"); + abort (); + } + } + if (fd >= 0) + close (fd); + nbd_lock (h); + done = h->dead; + nbd_unlock (h); + } + return NULL; +} + /* Perform the reply half of a transaction. */ static int -nbd_reply (struct handle *h, int cookie) +nbd_reply (struct handle *h, int fd) { int err; - struct transaction trans = { 0 }; - err = nbd_reply_raw (h, &trans); - assert (err < 0 || cookie == trans.cookie); - if (err > 0) - errno = err; + if (read (fd, &err, sizeof err) != sizeof err) { + nbdkit_debug ("failed to read pipe: %m"); + err = EIO; + } + nbd_lock (h); + /* TODO This check is just for sanity that the reader thread concept + works; it won't work once we allow interleaved requests */ + assert (fd == h->trans.u.fds[0]); + h->trans.u.fds[0] = -1; + nbd_unlock (h); + close (fd); + errno = err; return err ? -1 : 0; } @@ -396,6 +475,17 @@ nbd_open (int readonly) goto err; } + /* Spawn a dedicated reader thread */ + if ((errno = pthread_mutex_init (&h->trans_lock, NULL))) { + nbdkit_error ("failed to initialize transaction mutex: %m"); + goto err; + } + if ((errno = pthread_create (&h->reader, NULL, nbd_reader, h))) { + nbdkit_error ("failed to initialize reader thread: %m"); + pthread_mutex_destroy (&h->trans_lock); + goto err; + } + return h; err: @@ -412,6 +502,9 @@ nbd_close (void *handle) if (!h->dead) nbd_request_raw (h, NBD_CMD_DISC, 0, 0, 0); close (h->fd); + if ((errno = pthread_join (h->reader, NULL))) + nbdkit_debug ("failed to join reader thread: %m"); + pthread_mutex_destroy (&h->trans_lock); free (h); } -- 2.13.6

Eric Blake

1:14 p.m.

New subject: [nbdkit PATCH v2 2/4] nbd: Protect writes with mutex

Once we allow parallel threads, we need to ensure that at most one thread is writing to the wrapped NBD server at a time. Reads are already protected from concurrent access by virtue of being limited to a single reader thread. Signed-off-by: Eric Blake <eblake(a)redhat.com> --- plugins/nbd/nbd.c | 33 +++++++++++++++++++++++---------- 1 file changed, 23 insertions(+), 10 deletions(-) diff --git a/plugins/nbd/nbd.c b/plugins/nbd/nbd.c index 53bf104..a9146a9 100644 --- a/plugins/nbd/nbd.c +++ b/plugins/nbd/nbd.c @@ -136,6 +136,9 @@ struct handle { int64_t size; pthread_t reader; + /* Prevents concurrent threads from interleaving writes to server */ + pthread_mutex_t write_lock; + pthread_mutex_t trans_lock; /* Covers access to all fields below */ /* Our choice of THREAD_MODEL means at most one outstanding transaction */ struct transaction trans; @@ -228,7 +231,7 @@ nbd_mark_dead (struct handle *h) /* Send a request, return 0 on success or -1 on write failure. */ static int nbd_request_raw (struct handle *h, uint32_t type, uint64_t offset, - uint32_t count, uint64_t cookie) + uint32_t count, uint64_t cookie, const void *buf) { struct request req = { .magic = htobe32 (NBD_REQUEST_MAGIC), @@ -238,10 +241,16 @@ nbd_request_raw (struct handle *h, uint32_t type, uint64_t offset, .offset = htobe64 (offset), .count = htobe32 (count), }; + int r; + pthread_mutex_lock (&h->write_lock); nbdkit_debug ("sending request with type %d and cookie %#" PRIx64, type, cookie); - return write_full (h->fd, &req, sizeof req); + r = write_full (h->fd, &req, sizeof req); + if (buf && !r) + r = write_full (h->fd, buf, count); + pthread_mutex_unlock (&h->write_lock); + return r; } /* Perform the request half of a transaction. On success, return the @@ -267,13 +276,10 @@ nbd_request_full (struct handle *h, uint32_t type, uint64_t offset, } trans->buf = rep_buf; trans->count = rep_buf ? count : 0; - if (nbd_request_raw (h, type, offset, count, trans->u.cookie) < 0) - goto err; - if (req_buf && write_full (h->fd, req_buf, count) < 0) - goto err; - return trans->u.fds[0]; + if (nbd_request_raw (h, type, offset, count, trans->u.cookie, + req_buf) == 0) + return trans->u.fds[0]; - err: err = errno; close (trans->u.fds[0]); close (trans->u.fds[1]); @@ -334,7 +340,7 @@ nbd_reply_raw (struct handle *h, int *fd) TODO: Once we allow interleaved requests, handling soft-disconnect properly will be trickier */ - nbd_request_raw (h, NBD_CMD_DISC, 0, 0, 0); + nbd_request_raw (h, NBD_CMD_DISC, 0, 0, 0, NULL); errno = ESHUTDOWN; return nbd_mark_dead (h); } @@ -476,12 +482,18 @@ nbd_open (int readonly) } /* Spawn a dedicated reader thread */ + if ((errno = pthread_mutex_init (&h->write_lock, NULL))) { + nbdkit_error ("failed to initialize write mutex: %m"); + goto err; + } if ((errno = pthread_mutex_init (&h->trans_lock, NULL))) { nbdkit_error ("failed to initialize transaction mutex: %m"); + pthread_mutex_destroy (&h->write_lock); goto err; } if ((errno = pthread_create (&h->reader, NULL, nbd_reader, h))) { nbdkit_error ("failed to initialize reader thread: %m"); + pthread_mutex_destroy (&h->write_lock); pthread_mutex_destroy (&h->trans_lock); goto err; } @@ -500,10 +512,11 @@ nbd_close (void *handle) struct handle *h = handle; if (!h->dead) - nbd_request_raw (h, NBD_CMD_DISC, 0, 0, 0); + nbd_request_raw (h, NBD_CMD_DISC, 0, 0, 0, NULL); close (h->fd); if ((errno = pthread_join (h->reader, NULL))) nbdkit_debug ("failed to join reader thread: %m"); + pthread_mutex_destroy (&h->write_lock); pthread_mutex_destroy (&h->trans_lock); free (h); } -- 2.13.6

Eric Blake

1:14 p.m.

New subject: [nbdkit PATCH v2 3/4] nbd: Enable parallel handling

The final step in making the nbd forwarder fully parallel is to track a list of pending transactions, rather than a single transaction. The list is only ever manipulated with the lock taken, so all points that use the list see a consistent state; If we get a response that does not have a cookie from our list, we have no choice but to consider the server broken and declare the connection dead. If we get an ESHUTDOWN from the server, we rely on nbdkit's main handling to eventually shut us down, as we do not want to interrupt any other pending in-flight operations. Once our reader loop quits, we make sure to mop up any remaining in-flight operations. Signed-off-by: Eric Blake <eblake(a)redhat.com> --- plugins/nbd/nbd.c | 95 ++++++++++++++++++++++++++++++++++--------------------- 1 file changed, 59 insertions(+), 36 deletions(-) diff --git a/plugins/nbd/nbd.c b/plugins/nbd/nbd.c index a9146a9..a4a9639 100644 --- a/plugins/nbd/nbd.c +++ b/plugins/nbd/nbd.c @@ -115,8 +115,7 @@ nbd_config_complete (void) "socket=<SOCKNAME> (required) The Unix socket to connect to.\n" \ "export=<NAME> Export name to connect to (default \"\").\n" \ -/* TODO Allow more parallelism than one request at a time */ -#define THREAD_MODEL NBDKIT_THREAD_MODEL_SERIALIZE_REQUESTS +#define THREAD_MODEL NBDKIT_THREAD_MODEL_PARALLEL /* The per-transaction details */ struct transaction { @@ -126,6 +125,7 @@ struct transaction { } u; void *buf; uint32_t count; + struct transaction *next; }; /* The per-connection handle */ @@ -140,8 +140,7 @@ struct handle { pthread_mutex_t write_lock; pthread_mutex_t trans_lock; /* Covers access to all fields below */ - /* Our choice of THREAD_MODEL means at most one outstanding transaction */ - struct transaction trans; + struct transaction *trans; bool dead; }; @@ -262,27 +261,36 @@ nbd_request_full (struct handle *h, uint32_t type, uint64_t offset, int err; struct transaction *trans; - nbd_lock (h); - if (h->dead) { - nbd_unlock (h); - return nbd_mark_dead (h); + trans = calloc (1, sizeof *trans); + if (!trans) { + nbdkit_error ("unable to track transaction: %m"); + /* Still in sync with server, so don't mark connection dead */ + return -1; } - trans = &h->trans; - nbd_unlock (h); if (pipe (trans->u.fds)) { nbdkit_error ("unable to create pipe: %m"); /* Still in sync with server, so don't mark connection dead */ + free (trans); return -1; } trans->buf = rep_buf; trans->count = rep_buf ? count : 0; - if (nbd_request_raw (h, type, offset, count, trans->u.cookie, - req_buf) == 0) + nbd_lock (h); + if (h->dead) { + nbd_unlock (h); + goto err; + } + trans->next = h->trans; + h->trans = trans; + nbd_unlock (h); + if (nbd_request_raw (h, type, offset, count, trans->u.cookie, req_buf) == 0) return trans->u.fds[0]; + err: err = errno; close (trans->u.fds[0]); close (trans->u.fds[1]); + free (trans); errno = err; return nbd_mark_dead (h); } @@ -301,21 +309,34 @@ static int nbd_reply_raw (struct handle *h, int *fd) { struct reply rep; - struct transaction trans; + struct transaction **ptr; + struct transaction *trans; *fd = -1; if (read_full (h->fd, &rep, sizeof rep) < 0) return nbd_mark_dead (h); - nbd_lock (h); - trans = h->trans; - nbd_unlock (h); - *fd = trans.u.fds[1]; + if (be32toh (rep.magic) != NBD_REPLY_MAGIC) + return nbd_mark_dead (h); nbdkit_debug ("received reply for cookie %#" PRIx64, rep.handle); - if (be32toh (rep.magic) != NBD_REPLY_MAGIC || rep.handle != trans.u.cookie) + nbd_lock (h); + ptr = &h->trans; + while ((trans = *ptr) != NULL) { + if (rep.handle == trans->u.cookie) + break; + ptr = &trans->next; + } + if (trans) + *ptr = trans->next; + nbd_unlock (h); + if (!trans) { + nbdkit_error ("reply with unexpected cookie %#" PRIx64, rep.handle); return nbd_mark_dead (h); + } + + *fd = trans->u.fds[1]; switch (be32toh (rep.error)) { case NBD_SUCCESS: - if (trans.buf && read_full (h->fd, trans.buf, trans.count) < 0) + if (trans->buf && read_full (h->fd, trans->buf, trans->count) < 0) return nbd_mark_dead (h); return 0; case NBD_EPERM: @@ -333,16 +354,7 @@ nbd_reply_raw (struct handle *h, int *fd) case NBD_ENOSPC: return ENOSPC; case NBD_ESHUTDOWN: - /* The server wants us to initiate soft-disconnect. Because our - THREAD_MODEL does not permit interleaved requests, we know that - there are no other pending outstanding messages, so we can - attempt that immediately. - - TODO: Once we allow interleaved requests, handling - soft-disconnect properly will be trickier */ - nbd_request_raw (h, NBD_CMD_DISC, 0, 0, 0, NULL); - errno = ESHUTDOWN; - return nbd_mark_dead (h); + return ESHUTDOWN; } } @@ -352,9 +364,9 @@ nbd_reader (void *handle) { struct handle *h = handle; bool done = false; + int r; while (!done) { - int r; int fd; r = nbd_reply_raw (h, &fd); @@ -370,6 +382,23 @@ nbd_reader (void *handle) done = h->dead; nbd_unlock (h); } + + /* Clean up any stranded in-flight requests */ + done = false; + r = ESHUTDOWN; + while (!done) { + struct transaction *trans; + + nbd_lock (h); + trans = h->trans; + h->trans = trans->next; + nbd_unlock (h); + if (write (trans->u.fds[1], &r, sizeof r) != sizeof r) { + nbdkit_error ("failed to write pipe: %m"); + abort (); + } + close (trans->u.fds[1]); + } return NULL; } @@ -383,12 +412,6 @@ nbd_reply (struct handle *h, int fd) nbdkit_debug ("failed to read pipe: %m"); err = EIO; } - nbd_lock (h); - /* TODO This check is just for sanity that the reader thread concept - works; it won't work once we allow interleaved requests */ - assert (fd == h->trans.u.fds[0]); - h->trans.u.fds[0] = -1; - nbd_unlock (h); close (fd); errno = err; return err ? -1 : 0; -- 2.13.6

Eric Blake

1:14 p.m.

New subject: [nbdkit PATCH v2 4/4] tests: Test parallel nbd behavior

Copy the paradigm of test-parallel-file.sh to further test that the nbd forwarder can handle out-of-order requests. Once again, the delay values I chose requires at least 5 seconds to complete; I don't know if it is worth trying to fine-tune it to run faster while still being robust in the presence of a heavy load. With multiple nbdkit processes active during the test, it gets hairy to write a nested --run that properly quotes the arguments needed for two automatic Unix socket names. So this test instead resorts to running the file plugin using --exit-with-parent within a subshell. Signed-off-by: Eric Blake <eblake(a)redhat.com> --- tests/Makefile.am | 1 + tests/test-parallel-nbd.sh | 81 ++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 82 insertions(+) create mode 100755 tests/test-parallel-nbd.sh diff --git a/tests/Makefile.am b/tests/Makefile.am index 3989c0f..ddd53d5 100644 --- a/tests/Makefile.am +++ b/tests/Makefile.am @@ -153,6 +153,7 @@ file-data: if HAVE_QEMU_IO TESTS_ENVIRONMENT += QEMU_IO=$(QEMU_IO) TESTS += test-parallel-file.sh +TESTS += test-parallel-nbd.sh endif HAVE_QEMU_IO endif HAVE_PLUGINS diff --git a/tests/test-parallel-nbd.sh b/tests/test-parallel-nbd.sh new file mode 100755 index 0000000..233118b --- /dev/null +++ b/tests/test-parallel-nbd.sh @@ -0,0 +1,81 @@ +#!/bin/bash - +# nbdkit +# Copyright (C) 2017 Red Hat Inc. +# All rights reserved. +# +# Redistribution and use in source and binary forms, with or without +# modification, are permitted provided that the following conditions are +# met: +# +# * Redistributions of source code must retain the above copyright +# notice, this list of conditions and the following disclaimer. +# +# * Redistributions in binary form must reproduce the above copyright +# notice, this list of conditions and the following disclaimer in the +# documentation and/or other materials provided with the distribution. +# +# * Neither the name of Red Hat nor the names of its contributors may be +# used to endorse or promote products derived from this software without +# specific prior written permission. +# +# THIS SOFTWARE IS PROVIDED BY RED HAT AND CONTRIBUTORS ''AS IS'' AND +# ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, +# THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A +# PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL RED HAT OR +# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, +# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT +# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF +# USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND +# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, +# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT +# OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF +# SUCH DAMAGE. + +# Makefile sets $QEMU_IO and builds file-data, but it's also nice if the +# script runs again standalone afterwards for diagnosing any failures +test -f file-data || { echo "Missing file-data"; exit 77; } +: ${QEMU_IO=qemu-io} + +# Sanity check that qemu-io can issue parallel requests +$QEMU_IO -f raw -c "aio_read 0 1" -c "aio_write -P 2 1 1" -c aio_flush \ + file-data || { echo "'$QEMU_IO' can't drive parallel requests"; exit 77; } + +# We require --exit-with-parent to work +( nbdkit --exit-with-parent --help ) >/dev/null 2>&1 || + { echo "Missing --exit-with-parent support"; exit 77; } + +# Set up the file plugin to delay both reads and writes (for a good chance +# that parallel requests are in flight), and with reads longer than writes +# (to more easily detect if out-of-order completion happens). This test +# may have spurious failures under heavy loads on the test machine, where +# tuning the delays may help. + +trap 'rm -f test-parallel-nbd.out test-parallel-nbd.sock' 0 1 2 3 15 +( +rm -f test-parallel-nbd.sock +nbdkit --exit-with-parent -v -U test-parallel-nbd.sock \ + file file=file-data rdelay=2 wdelay=1 & + +# With --threads=1, the read should complete first because it was issued first +nbdkit -v -t 1 -U - nbd socket=test-parallel-nbd.sock --run ' + $QEMU_IO -f raw -c "aio_read 0 1" -c "aio_write -P 2 1 1" -c aio_flush $nbd +' | tee test-parallel-nbd.out +if test "$(grep '1/1' test-parallel-nbd.out)" != \ +"read 1/1 bytes at offset 0 +wrote 1/1 bytes at offset 1"; then + exit 1 +fi + +# With default --threads, the faster write should complete first +nbdkit -v -U - nbd socket=test-parallel-nbd.sock --run ' + $QEMU_IO -f raw -c "aio_read 0 1" -c "aio_write -P 2 1 1" -c aio_flush $nbd +' | tee test-parallel-nbd.out +if test "$(grep '1/1' test-parallel-nbd.out)" != \ +"wrote 1/1 bytes at offset 1 +read 1/1 bytes at offset 0"; then + exit 1 +fi + +) || exit $? + +exit 0 -- 2.13.6

Richard W.M. Jones

2:20 p.m.

On Tue, Nov 21, 2017 at 01:14:01PM -0600, Eric Blake wrote:

...

With this, I am finally able to get the nbd plugin to do out-of-order responses to the client. Once this series goes in, we should be ready for Rich to cut a release.

Yes these are fine, please push them. I will do a release after. Thanks, Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-df lists disk usage of guests without needing to install any software inside the virtual machine. Supports Linux and Windows. http://people.redhat.com/~rjones/virt-df/

Richard W.M. Jones

3:21 p.m.

This works OK on x86_64, but fails on our fast new Amberwing (aarch64) machine. I've attached the test-suite.log file, but I'm not very sure what's going wrong from that. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com virt-df lists disk usage of guests without needing to install any software inside the virtual machine. Supports Linux and Windows. http://people.redhat.com/~rjones/virt-df/

Richard W.M. Jones

3:32 p.m.

On Tue, Nov 21, 2017 at 09:21:59PM +0000, Richard W.M. Jones wrote:

...

This works OK on x86_64, but fails on our fast new Amberwing (aarch64) machine. I've attached the test-suite.log file, but I'm not very sure what's going wrong from that.

... although not every time. It seems to be an intermittent race. Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming and virtualization blog: http://rwmj.wordpress.com libguestfs lets you edit virtual machines. Supports shell scripting, bindings from many languages. http://libguestfs.org

Eric Blake

3:48 p.m.

On 11/21/2017 03:21 PM, Richard W.M. Jones wrote:

...

This works OK on x86_64, but fails on our fast new Amberwing (aarch64) machine. I've attached the test-suite.log file, but I'm not very sure what's going wrong from that.

I'll see what I can spot...

...

FAIL: test-parallel-file.sh =========================== read 1/1 bytes at offset 0 1 bytes, 1 ops; 0.0007 sec (1.340 KiB/sec and 1371.7421 ops/sec) wrote 1/1 bytes at offset 1 1 bytes, 1 ops; 0.0203 sec (49.193231 bytes/sec and 49.1932 ops/sec) /var/tmp/nbdkit-1.1.18/src/nbdkit -v -t 1 -U - /var/tmp/nbdkit-1.1.18/plugins/file/.libs/nbdkit-file-plugin.so file=file-data rdelay=2 wdelay=1 --run $QEMU_IO -f raw -c "aio_read 0 1" -c "aio_write -P 2 1 1" -c aio_flush $nbd

...

nbdkit: file[1]: debug: handshake complete, processing requests serially

When run serially,

...

read 1/1 bytes at offset 0 1 bytes, 1 ops; 0:00:02.00 (0.499933 bytes/sec and 0.4999 ops/sec) wrote 1/1 bytes at offset 1 1 bytes, 1 ops; 0:00:03.01 (0.331378 bytes/sec and 0.3314 ops/sec)

the read takes 2 seconds, and the write takes an additional second.

...

nbdkit: debug: /var/tmp/nbdkit-1.1.18/plugins/file/.libs/nbdkit-file-plugin.so: unload /var/tmp/nbdkit-1.1.18/src/nbdkit -v -U - /var/tmp/nbdkit-1.1.18/plugins/file/.libs/nbdkit-file-plugin.so file=file-data rdelay=2 wdelay=1 --run $QEMU_IO -f raw -c "aio_read 0 1" -c "aio_write -P 2 1 1" -c aio_flush $nbd

When run with threads,

...

nbdkit: file[1]: debug: handshake complete, processing requests with 16 threads nbdkit: debug: starting worker thread file.0 nbdkit: debug: starting worker thread file.1 nbdkit: file.0: debug: acquire unload prevention lock nbdkit: file.0: debug: pread count=1 offset=0 nbdkit: debug: starting worker thread file.2 nbdkit: file.1: debug: acquire unload prevention lock nbdkit: file.1: debug: pwrite count=1 offset=1

there are definitely two threads that both obtain the unload prevention lock at once (so we do have parallel processing),

...

nbdkit: file.1: debug: flush nbdkit: file.0: debug: release unload prevention lock nbdkit: file.1: debug: release unload prevention lock nbdkit: file.2: debug: acquire unload prevention lock nbdkit: file.2: debug: flush nbdkit: file.2: debug: release unload prevention lock nbdkit: file.4: debug: acquire unload prevention lock nbdkit: file.4: debug: flush nbdkit: file.4: debug: release unload prevention lock

and then, because we have not wired up nbdkit to expose FUA support to the plugins, we are calling the flush callback in parallel through three threads at once (probably way more flushing than we need, but doesn't seem to be the cause of the failure). I don't know how many of those flushes were triggered by nbdkit connections.c and how many were triggered by NBD_CMD_FLUSH from qemu-io.

...

nbdkit: file.0: debug: client closed input socket, closing connection nbdkit: file.0: debug: exiting worker thread file.0 read 1/1 bytes at offset 0 1 bytes, 1 ops; 0:00:02.47 (0.403374 bytes/sec and 0.4034 ops/sec) wrote 1/1 bytes at offset 1 1 bytes, 1 ops; 0:00:02.55 (0.391644 bytes/sec and 0.3916 ops/sec)

qemu-io finally records the replies it receives, and says that the read took 2.4 seconds, then the write took an additional .08 seconds beyond the read. The two commands were obviously in flight in parallel (because the wdelay is 1 second; as evidenced by the earlier results), but there must be enough lock contention or other interaction between the threads that we did not manage to get the write complete before the read. When I run the same test locally, and get a PASS result, my qemu-io results look more like: wrote 1/1 bytes at offset 1 1 bytes, 1 ops; 0:00:01.00 (0.994919 bytes/sec and 0.9949 ops/sec) read 1/1 bytes at offset 0 1 bytes, 1 ops; 0:00:02.00 (0.499855 bytes/sec and 0.4999 ops/sec) Hmm - I wonder - could it be extra delay induced by the flush that happens to make the write delay longer than the read? Should I swap the commands (have read be 2 seconds, write be 1 second, so that any flushing tied to the write is more likely to still finish before the read)? Also, this makes it VERY obvious that without timestamps in our nbdkit -v output, it's hard to tell WHEN things are happening, to see where the bottlenecks are.

...

FAIL: test-parallel-nbd.sh ==========================

...

nbdkit: file[1]: debug: release unload prevention lock nbdkit: file[1]: debug: handshake complete, processing requests with 16 threads nbdkit: nbd[1]: nbdkit: debug: debug: acquire unload prevention lockstarting worker thread file.0 nbdkit: nbd[1]: nbdkit: debug: debug: get_sizestarting worker thread file.2 nbdkit: nbd[1]: debug: can_write

Here, we have a nice demonstration that commid d02d9c9d works for messages from one process (my debugging was worse without the mutex in errors.c), but no difference when interleaving messages from multiple processes. If we're that worried about it, we could make errors.c create the entire message into a single buffer (prologue, body, and \n), and only then have a single stdio call on that giant buffer, to make cross-process interleaving less likely, at the expense of more memory manipulation to build up the error message into a temporary buffer. I'm not too fussed, though - you aren't often trying to debug two nbdkit processes writing to the same file at once.

...

/var/tmp/nbdkit-1.1.18/src/nbdkit -v -U - /var/tmp/nbdkit-1.1.18/plugins/nbd/.libs/nbdkit-nbd-plugin.so socket=test-parallel-nbd.sock --run $QEMU_IO -f raw -c "aio_read 0 1" -c "aio_write -P 2 1 1" -c aio_flush $nbd

And when we run the parallel nbd plugin over parallel file plugin, we see:

...

nbdkit: nbd[1]: debug: handshake complete, processing requests with 16 threads nbdkit: debug: starting worker thread file.1 nbdkit: debug: nbdkit: starting worker thread file.10debug: starting worker thread nbd.0 nbdkit: debug: starting worker thread nbd.2 nbdkit: debug: starting worker thread nbd.3 nbdkit: nbdkit: debug: debug: starting worker thread file.4starting worker thread nbd.1 nbdkit: nbd.3: debug: acquire unload prevention lock nbdkit: nbd.3nbdkit: : debug: debug: starting worker thread file.12pread count=1 offset=0 nbdkit: nbd.3: debug: sending request with type 0 and cookie 0xa00000009

nbd plugin sends the read request first,

...

nbdkit: debug: nbdkit: starting worker thread file.6debug: starting worker thread nbd.4 nbdkit: nbdkit: file.0debug: : starting worker thread nbd.6debug: acquire unload prevention lock nbdkit: file.0: debug: pread count=1 offset=0

which the file plugin starts processing;

...

nbdkit: nbd.0: debug: acquire unload prevention lock nbdkit: nbdkit: debug: nbd.0starting worker thread file.15: debug: pwrite count=1 offset=1 nbdkit: debug: starting worker thread nbd.5 nbdkit: debug: starting worker thread file.8nbdkit: nbd.0: debug: sending request with type 1 and cookie 0xc0000000b

then in another thread nbd sends the write request,

...

nbdkit: debug: starting worker thread file.14 nbdkit: file.6: nbdkit: debug: debug: acquire unload prevention lockstarting worker thread nbd.7 nbdkit: file.6: debug: pwrite count=1 offset=1

which the file plugin starts processing. So both processes are indeed interleaved, as desired.

...

nbdkit: debug: starting worker thread nbd.8 nbdkit: debug: starting worker thread nbd.9 nbdkit: debug: starting worker thread nbd.10 nbdkit: debug: starting worker thread nbd.12 nbdkit: debug: starting worker thread nbd.13 nbdkit: debug: starting worker thread nbd.14 nbdkit: debug: starting worker thread nbd.11 nbdkit: debug: starting worker thread nbd.15 nbdkit: file.6: debug: release unload prevention lock nbdkit: debug: received reply for cookie 0xc0000000b

And the write completes first, as desired.

...

nbdkit: nbd.0: debug: flush nbdkit: nbd.0: debug: sending request with type 3 and cookie 0xc0000000b nbdkit: file.5: debug: acquire unload prevention lock nbdkit: file.5: debug: flush nbdkit: file.0: debug: release unload prevention lock nbdkit: debug: received reply for cookie 0xa00000009 nbdkit: nbd.3: debug: release unload prevention lock nbdkit: file.5: debug: release unload prevention lock nbdkit: debug: received reply for cookie 0xc0000000b nbdkit: nbd.0: debug: release unload prevention lock nbdkit: nbd.6: debug: acquire unload prevention lock nbdkit: nbd.6: debug: flush nbdkit: nbd.6: debug: sending request with type 3 and cookie 0xa00000009 nbdkit: file.7: debug: acquire unload prevention lock nbdkit: file.7: debug: flush nbdkit: file.7: debug: release unload prevention lock nbdkit: debug: received reply for cookie 0xa00000009 nbdkit: nbd.6: debug: release unload prevention lock nbdkit: nbd.2: debug: acquire unload prevention lock nbdkit: nbd.2: debug: flush nbdkit: nbd.2: debug: sending request with type 3 and cookie 0xa00000009 nbdkit: file.9: debug: acquire unload prevention lock nbdkit: file.9: debug: flush nbdkit: file.9: debug: release unload prevention lock nbdkit: debug: received reply for cookie 0xa00000009

but we have a lot of traffic on those flush requests

...

nbdkit: nbd.2: debug: release unload prevention lock nbdkit: nbd.4: debug: client sent disconnect command, closing connection nbdkit: nbd.4: debug: exiting worker thread nbd.4 nbdkit: nbd.1: debug: client closed input socket, closing connection read 1/1 bytes at offset 0 1 bytes, 1 ops; 0:00:02.47 (0.403408 bytes/sec and 0.4034 ops/sec) wrote 1/1 bytes at offset 1 1 bytes, 1 ops; 0:00:02.56 (0.390498 bytes/sec and 0.3905 ops/sec)

so that qemu-io sees the read take 2.4 seconds and the write took .09 seconds longer. -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org

Eric Blake

4:02 p.m.

On 11/21/2017 03:48 PM, Eric Blake wrote:

...

On 11/21/2017 03:21 PM, Richard W.M. Jones wrote: > This works OK on x86_64, but fails on our fast new Amberwing (aarch64) > machine. I've attached the test-suite.log file, but I'm not very sure > what's going wrong from that. > I'll see what I can spot...

...

Hmm - I wonder - could it be extra delay induced by the flush that happens to make the write delay longer than the read? Should I swap the commands (have read be 2 seconds, write be 1 second, so that any flushing tied to the write is more likely to still finish before the read)?

That should read: If we write the test do use file ... rdelay=1 wdelay=2 coupled with qemu-io -c 'aio_write ...' -c 'aio_read ...' instead of the current file ... rdelay=2 wdelay=1 qemu-io -c 'aio_read ...' -c 'aio_write ...' then the read (which probably does not have to flush) is more likely to complete quickly than the write, but only if the read gets issued in parallel. Want me to submit a patch along those lines? -- Eric Blake, Principal Software Engineer Red Hat, Inc. +1-919-301-3266 Virtualization: qemu.org | libvirt.org

2897

days inactive

2897

days old

guestfs@lists.libguestfs.org

Manage subscription

9 comments

2 participants

tags (0)

participants (2)

Eric Blake
Richard W.M. Jones

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

[nbdkit PATCH v2 0/4] enable parallel nbd forwarding