On Mon, Mar 8, 2021 at 12:59 AM Nir Soffer <nsoffer@redhat.com> wrote:

On Mon, Mar 8, 2021 at 12:25 AM Richard W.M. Jones <rjones@redhat.com> wrote:
On Sat, Mar 06, 2021 at 11:21:22PM +0200, Nir Soffer wrote:
> Handling extents is complicated, in particular when using async block
> status. But I could not get this working using sync block status; It
> seems that libnbd eats the events on the source, and then we hang
> forever waiting for inflight reads.

Is this a bug in libnbd?

Maybe, I can try to post a simple reproducer later. Even if this is not a bug
it would be nice to show how sync calls can be mixed with async call with
libev (or another event loop).

Also is libev going to be faster than poll(2) [as used by nbdcopy] for
copying? I would imagine that it should make hardly any difference
since we're only using a couple of file descriptors with poll(2) in
nbdcopy, and it's my understanding that poll(2) is only pessimal when
you have to poll large numbers of FDs.

I think that poll or select should be faster for 2 file descriptors, but the
difference is negligible in this context.

Some data when copying fredora 32 qcow2 image to nbdkit memory plugin:

$ strace -c ./copy-libev $SRC $DST
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
50.07 0.271421 17 15728 4 recvfrom
39.10 0.211987 30 6928 sendto
8.25 0.044734 4 9114 epoll_wait
1.13 0.006102 4 1335 epoll_ctl
1.01 0.005484 457 12 6 wait4
0.10 0.000522 4 122 mmap
0.06 0.000311 3 78 48 openat
0.05 0.000298 49 6 clone
0.04 0.000227 5 43 read
0.03 0.000143 3 42 close
0.02 0.000135 2 52 rt_sigprocmask
0.02 0.000132 3 37 rt_sigaction
0.02 0.000105 3 32 mprotect
0.02 0.000100 3 30 15 stat
0.02 0.000095 47 2 execve
0.01 0.000061 2 29 fstat
0.01 0.000044 4 9 poll
0.01 0.000038 2 16 pread64
0.00 0.000027 5 5 pipe
0.00 0.000021 1 13 brk
0.00 0.000019 3 6 rt_sigreturn
0.00 0.000017 1 11 2 ioctl
0.00 0.000015 7 2 munmap
0.00 0.000015 7 2 connect
0.00 0.000010 5 2 socket
0.00 0.000007 3 2 shutdown
0.00 0.000007 7 1 sysinfo
0.00 0.000005 2 2 getgid
0.00 0.000004 0 5 4 access
0.00 0.000004 2 2 getsockopt
0.00 0.000004 4 1 uname
0.00 0.000004 2 2 getuid
0.00 0.000004 2 2 geteuid
0.00 0.000004 2 2 getegid
0.00 0.000004 1 4 2 arch_prctl
0.00 0.000003 1 2 2 setsockopt
0.00 0.000002 0 5 lseek
0.00 0.000000 0 1 dup2
0.00 0.000000 0 2 getpid
0.00 0.000000 0 5 1 fcntl
0.00 0.000000 0 1 getppid
0.00 0.000000 0 1 getpgrp
0.00 0.000000 0 1 set_tid_address
0.00 0.000000 0 1 set_robust_list
0.00 0.000000 0 1 eventfd2
0.00 0.000000 0 1 epoll_create1
0.00 0.000000 0 3 prlimit64
0.00 0.000000 0 1 getrandom
------ ----------- ----------- --------- --------- ----------------
100.00 0.542115 16 33704 84 total

$ strace -f -c ../copy/nbdcopy --sparse=1048576 --request-size=1048576 --requests=16 --connections=1 $SRC $DST
strace: Process 1094611 attached
strace: Process 1094612 attached
strace: Process 1094613 attached
strace: Process 1094614 attached
strace: Process 1094615 attached
strace: Process 1094616 attached
strace: Process 1094617 attached
strace: Process 1094618 attached
strace: Process 1094619 attached
strace: Process 1094620 attached
strace: Process 1094621 attached
strace: Process 1094622 attached
strace: Process 1094623 attached
strace: Process 1094641 attached

(Not sure why we start so many threads with --connections=1 - bug?)

% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
58.10 0.881783 146963 6 futex
18.70 0.283841 17 16634 2 recvfrom
17.42 0.264326 31 8398 4 sendto
2.73 0.041494 4 8877 poll
1.97 0.029895 1299 23 10 wait4
0.46 0.006970 240 29 madvise
0.13 0.001903 6 311 mmap
0.09 0.001330 190 7 execve
0.07 0.001041 40 26 munmap
0.05 0.000825 5 157 67 openat
0.04 0.000557 5 100 mprotect
0.04 0.000548 4 113 read
0.04 0.000545 38 14 clone
0.03 0.000440 2 151 6 close
0.02 0.000329 3 97 fstat
0.02 0.000267 2 125 rt_sigprocmask
0.02 0.000251 3 66 pread64
0.01 0.000214 1 148 rt_sigaction
0.01 0.000188 18 10 statfs
0.01 0.000145 5 28 brk
0.01 0.000143 3 37 write
0.01 0.000140 5 24 10 access
0.01 0.000083 1 69 31 stat
0.00 0.000063 10 6 socket
0.00 0.000049 3 15 6 ioctl
0.00 0.000047 15 3 1 lstat
0.00 0.000045 7 6 4 connect
0.00 0.000032 4 8 pipe
0.00 0.000031 2 14 7 arch_prctl
0.00 0.000028 2 10 rt_sigreturn
0.00 0.000023 2 8 prlimit64
0.00 0.000021 3 7 set_robust_list
0.00 0.000019 3 6 set_tid_address
0.00 0.000011 11 1 lgetxattr
0.00 0.000009 4 2 shutdown
0.00 0.000008 1 6 lseek
0.00 0.000007 7 1 1 getxattr
0.00 0.000006 0 15 getpid
0.00 0.000006 6 1 sysinfo
0.00 0.000005 5 1 uname
0.00 0.000004 0 16 dup2
0.00 0.000000 0 2 2 setsockopt
0.00 0.000000 0 2 getsockopt
0.00 0.000000 0 4 1 fcntl
0.00 0.000000 0 1 chdir
0.00 0.000000 0 9 getuid
0.00 0.000000 0 9 getgid
0.00 0.000000 0 9 geteuid
0.00 0.000000 0 9 getegid
0.00 0.000000 0 1 getppid
0.00 0.000000 0 1 getpgrp
0.00 0.000000 0 2 getdents64
0.00 0.000000 0 1 getrandom
------ ----------- ----------- --------- --------- ----------------
100.00 1.517672 42 35626 152 total

If you want to use poll in libev, you can:

diff --git a/examples/copy-libev.c b/examples/copy-libev.c
index 6e6cbcb..1878f27 100644
--- a/examples/copy-libev.c
+++ b/examples/copy-libev.c
@@ -530,7 +530,7 @@ main (int argc, char *argv[])
{
int i;

- loop = EV_DEFAULT;
+ loop = ev_loop_new (EVBACKEND_POLL);

if (argc != 3)
FAIL ("Usage: %s src-uri dst-uri", PROG);

But it does not look useful:

$ strace -c ./copy-libev $SRC $DST
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
48.50 0.261526 16 15626 1 recvfrom
39.01 0.210365 30 7007 sendto
8.24 0.044413 4 9214 poll

The advantage in using an event loop is an easier way to extend the application,
for example I'm using async block status, this will be much harder to do in nbdcopy.

Another advantage is adding more features like handling signals, timeouts,
child processes, or other other network protocols (e.g. nbd <-> http).

Anyway patch series is fine, ACK.

Rich.

> Since getting extents is asynchronous, and starting new request depends
> on being able to get the next extent, requests have now a new a waiting
> state. When a request is started when extents are not available, it is
> marked as waiting. When extents request is completed, we start all
> waiting requests.
>
> Here is example log showing whats going on while copying fedora 32
> image:
>
> The first request detect that we don't have extents yet, so it starts
> aync block status request.
>
> copy-libev: r0: start extents offset=0 count=134217728
> copy-libev: r0: received 14 extents for base:allocation
> copy-libev: r0: extents completed time=0.001098
>
> When extent request completes, we start al waiting requests. The first
> request (r0) looked into the first extent (e0) and consumed all of it:
>
> copy-libev: r0: start request offset=0
> copy-libev: e0: offset=0 len=65536 zero=0
> copy-libev: r0: extent offset=0 len=65536 zero=0
> copy-libev: r0: start read offset=0 len=65536
>
> The second request (r15) is looking into the second extent (e1) and
> consume all of it, starting a zero request:
>
> copy-libev: r15: start request offset=65536
> copy-libev: e1: offset=65536 len=983040 zero=1
> copy-libev: r15: extent offset=65536 len=983040 zero=1
> copy-libev: r15: start zero offset=65536 len=983040
>
> ...
>
> Request (r12) looked into the fifth extent (e4), but since this extent
> was large (10747904), it consume only 1m from it:
>
> copy-libev: r12: start request offset=2097152
> copy-libev: e4: offset=2097152 len=10747904 zero=0
> copy-libev: r12: extent offset=2097152 len=1048576 zero=0
> copy-libev: r12: start read offset=2097152 len=1048576
>
> The next request consumed the next 1m from the same extent (e4):
>
> copy-libev: r11: start request offset=3145728
> copy-libev: e4: offset=3145728 len=9699328 zero=0
> copy-libev: r11: extent offset=3145728 len=1048576 zero=0
> copy-libev: r11: start read offset=3145728 len=1048576
> copy-libev: r10: start request offset=4194304
>
> ..
>
> The last part of extent e4 was consumed, and we switched to extent e5:
>
> copy-libev: r2: start request offset=12582912
> copy-libev: e4: offset=12582912 len=262144 zero=0
> copy-libev: r2: extent offset=12582912 len=262144 zero=0
> copy-libev: r2: start read offset=12582912 len=262144
> copy-libev: r1: start request offset=12845056
> copy-libev: e5: offset=12845056 len=131072 zero=1
> copy-libev: r1: extent offset=12845056 len=131072 zero=1
> copy-libev: r1: start zero offset=12845056 len=131072
>
> ...
>
> Request (r11) consumed the last extent (e13), starting a zero request.
> This free the extents array:
>
> copy-libev: r11: start request offset=133955584
> copy-libev: e13: offset=133955584 len=262144 zero=1
> copy-libev: r11: extent offset=133955584 len=262144 zero=1
> copy-libev: r11: consumed all extents offset=134217728
> copy-libev: r11: start zero offset=133955584 len=262144
>
> ...
>
> Request (r12) started when extents array as cleared, so it started new
> block status request:
>
> copy-libev: r12: start extents offset=134217728 count=134217728
> ...
> copy-libev: r12: received 3 extents for base:allocation
> copy-libev: r12: extents completed time=0.003027
>
> ...
>
> The rest of the flow is same as before. When all requests are done,
> we shutdown the event loop and flush:
>
> copy-libev: r4: request completed offset=6438256640 len=1048576 time=0.000132
> copy-libev: r1: read completed offset=6442385408 len=65536
> copy-libev: r1: start write offset=6442385408 len=65536
> copy-libev: r14: request completed offset=6439305216 len=1048576 time=0.000126
> copy-libev: r8: request completed offset=6440353792 len=1048576 time=0.000151
> copy-libev: r2: request completed offset=6441402368 len=983040 time=0.000143
> copy-libev: r1: request completed offset=6442385408 len=65536 time=0.000142
> copy-libev: flush
>
> Signed-off-by: Nir Soffer <nsoffer@redhat.com>
> ---
> examples/copy-libev.c | 232 ++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 224 insertions(+), 8 deletions(-)
>
> diff --git a/examples/copy-libev.c b/examples/copy-libev.c
> index 84d5c03..3030955 100644
> --- a/examples/copy-libev.c
> +++ b/examples/copy-libev.c
> @@ -41,6 +41,7 @@
> */
> #define MAX_REQUESTS 16
> #define REQUEST_SIZE (1024 * 1024)
> +#define EXTENTS_SIZE (128 * 1024 * 1024)
>
> #define MIN(a,b) (a) < (b) ? (a) : (b)
>
> @@ -62,14 +63,18 @@ struct connection {
> ev_io watcher;
> struct nbd_handle *nbd;
> bool can_zero;
> + bool can_extents;
> };
>
> struct request {
> + ev_timer watcher; /* For starting on next loop iteration. */
> int64_t offset;
> size_t length;
> + bool zero;
> unsigned char *data;
> size_t index;
> ev_tstamp started;
> + bool waiting; /* Waiting for extents completion. */
> };
>
> static struct ev_loop *loop;
> @@ -77,11 +82,29 @@ static ev_prepare prepare;
> static struct connection src;
> static struct connection dst;
> static struct request requests[MAX_REQUESTS];
> +
> +/* List of extents received from source server. Using the same format returned
> + * by libnbd, array of uint32_t pairs. The first item is the length of the
> + * extent, and the second is the extent flags.
> + *
> + * The number of extents is extents_len / 2. extents_pos is the index of the
> + * current extent.
> + *
> + * extents_in_progress flag is set when we start asynchronous block status
> + * request.
> + */
> +static uint32_t *extents;
> +static size_t extents_len;
> +static size_t extents_pos;
> +static bool extents_in_progress;
> +
> static int64_t size;
> static int64_t offset;
> static int64_t written;
> static bool debug;
>
> +static inline void start_request_soon (struct request *r);
> +static void start_request_cb (struct ev_loop *loop, ev_timer *w, int revents);
> static void start_request(struct request *r);
> static void start_read(struct request *r);
> static void start_write(struct request *r);
> @@ -133,23 +156,206 @@ get_events(struct connection *c)
> default:
> return 0;
> }
> +}
> +
> +static int
> +extent_callback (void *user_data, const char *metacontext, uint64_t offset,
> + uint32_t *entries, size_t nr_entries, int *error)
> +{
> + struct request *r = user_data;
> +
> + if (strcmp (metacontext, LIBNBD_CONTEXT_BASE_ALLOCATION) != 0) {
> + DEBUG ("Unexpected meta context: %s", metacontext);
> + return 1;
> + }
> +
> + extents = malloc (nr_entries * sizeof *extents);
> + if (extents == NULL)
> + FAIL ("Cannot allocated extents: %s", strerror (errno));
> +
> + memcpy (extents, entries, nr_entries * sizeof *extents);
> + extents_len = nr_entries;
> +
> + DEBUG ("r%d: received %d extents for %s",
> + r->index, nr_entries / 2, metacontext);
> +
> + return 1;
> +}
> +
> +static int
> +extents_completed (void *user_data, int *error)
> +{
> + struct request *r = (struct request *)user_data;
> + int i;
> +
> + DEBUG ("r%d: extents completed time=%.6f",
> + r->index, ev_now (loop) - r->started);
> +
> + extents_in_progress = false;
> +
> + if (extents == NULL) {
> + DEBUG ("r%d: received no extents, disabling extents", r->index);
> + src.can_extents = false;
> + }
>
> + /* Start requests waiting for extents completion on the next loop
> + * iteration, to avoid deadlock if we need to start a read.
> + */
> + for (i = 0; i < MAX_REQUESTS; i++) {
> + struct request *r = &requests[i];
> + if (r->waiting) {
> + r->waiting = false;
> + start_request_soon (r);
> + }
> + }
>
> + return 1;
> +}
> +
> +static bool
> +start_extents (struct request *r)
> +{
> + size_t count = MIN (EXTENTS_SIZE, size - offset);
> + int64_t cookie;
> +
> + if (extents_in_progress) {
> + r->waiting = true;
> + return true;
> + }
> +
> + DEBUG ("r%d: start extents offset=%ld count=%ld", r->index, offset, count);
> +
> + cookie = nbd_aio_block_status (
> + src.nbd, count, offset,
> + (nbd_extent_callback) { .callback=extent_callback,
> + .user_data=r },
> + (nbd_completion_callback) { .callback=extents_completed,
> + .user_data=r },
> + 0);
> + if (cookie == -1) {
> + DEBUG ("Cannot get extents: %s", nbd_get_error ());
> + src.can_extents = false;
> + return false;
> + }
> +
> + r->waiting = true;
> + extents_in_progress = true;
> +
> + return true;
> +}
> +
> +/* Return next extent to process. */
> +static void
> +next_extent (struct request *r)
> +{
> + uint32_t limit = MIN (REQUEST_SIZE, size - offset);
> + uint32_t length = 0;
> + bool is_zero;
> +
> + assert (extents);
> +
> + is_zero = extents[extents_pos + 1] & LIBNBD_STATE_ZERO;
> +
> + while (length < limit) {
> + DEBUG ("e%d: offset=%ld len=%ld zero=%d",
> + extents_pos / 2, offset, extents[extents_pos], is_zero);
> +
> + /* If this extent is too large, steal some data from it to
> + * complete the request.
> + */
> + if (length + extents[extents_pos] > limit) {
> + uint32_t stolen = limit - length;
> +
> + extents[extents_pos] -= stolen;
> + length += stolen;
> + break;
> + }
> +
> + /* Consume the entire extent and start looking at the next one. */
> + length += extents[extents_pos];
> + extents[extents_pos] = 0;
> +
> + if (extents_pos + 2 == extents_len)
> + break;
> +
> + extents_pos += 2;
> +
> + /* If next extent is different, we are done. */
> + if ((extents[extents_pos + 1] & LIBNBD_STATE_ZERO) != is_zero)
> + break;
> + }
> +
> + assert (length > 0 && length <= limit);
> +
> + r->offset = offset;
> + r->length = length;
> + r->zero = is_zero;
> +
> + DEBUG ("r%d: extent offset=%ld len=%ld zero=%d",
> + r->index, r->offset, r->length, r->zero);
> +
> + offset += length;
> +
> + if (extents_pos + 2 == extents_len && extents[extents_pos] == 0) {
> + /* Processed all extents, clear extents. */
> + DEBUG ("r%d: consumed all extents offset=%ld", r->index, offset);
> + free (extents);
> + extents = NULL;
> + extents_pos = 0;
> + extents_len = 0;
> + }
> +}
> +
> +static inline void
> +start_request_soon (struct request *r)
> +{
> + ev_timer_init (&r->watcher, start_request_cb, 0, 0);
> + ev_timer_start (loop, &r->watcher);
> +}
> +
> +static void
> +start_request_cb (struct ev_loop *loop, ev_timer *w, int revents)
> +{
> + struct request *r = (struct request *)w;
> + start_request (r);
> }
>
> /* Start async copy or zero request. */
> static void
> start_request(struct request *r)
> {
> - assert (offset < size);
> + /* Cancel the request if we are done. */
> + if (offset == size)
> + return;
>
> r->started = ev_now (loop);
> - r->length = MIN (REQUEST_SIZE, size - offset);
> - r->offset = offset;
>
> - start_read (r);
> -
> - offset += r->length;
> + /* If needed, get more extents from server. */
> + if (src.can_extents && extents == NULL && start_extents (r))
> + return;
> +
> + DEBUG ("r%d: start request offset=%ld", r->index, offset);
> +
> + if (src.can_extents) {
> + /* Handle the next extent. */
> + next_extent (r);
> + if (r->zero) {
> + if (dst.can_zero) {
> + start_zero (r);
> + } else {
> + memset (r->data, 0, r->length);
> + start_write (r);
> + }
> + } else {
> + start_read (r);
> + }
> + } else {
> + /* Extents not available. */
> + r->length = MIN (REQUEST_SIZE, size - offset);
> + r->offset = offset;
> + start_read (r);
> + offset += r->length;
> + }
> }
>
> static void
> @@ -240,9 +446,11 @@ request_completed (void *user_data, int *error)
> ev_break (loop, EVBREAK_ALL);
> }
>
> - /* If we have data to read, start a new read. */
> + /* If we have more work, start a new request on the next loop
> + * iteration, to avoid deadlock if we need to start a zero or write.
> + */
> if (offset < size)
> - start_request(r);
> + start_request_soon(r);
>
> return 1;
> }
> @@ -304,11 +512,19 @@ main (int argc, char *argv[])
>
> debug = getenv ("COPY_LIBEV_DEBUG") != NULL;
>
> + /* Configure soruce to report extents. */
> +
> + if (nbd_add_meta_context (src.nbd, LIBNBD_CONTEXT_BASE_ALLOCATION))
> + FAIL ("Cannot add base:allocation: %s", nbd_get_error ());
> +
> /* Connecting is fast, so use the syncronous API. */
>
> if (nbd_connect_uri (src.nbd, argv[1]))
> FAIL ("Cannot connect to source: %s", nbd_get_error ());
>
> + src.can_extents = nbd_can_meta_context (
> + src.nbd, LIBNBD_CONTEXT_BASE_ALLOCATION) > 0;
> +
> if (nbd_connect_uri (dst.nbd, argv[2]))
> FAIL ("Cannot connect to destination: %s", nbd_get_error ());
>
> --
> 2.26.2

--
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html