Re: [Libguestfs] [PATCH libnbd v2 1/6] api: Synchronous connect waits til all connections are connected.

Thursday, 23 May 2019

On Wed, May 22, 2019 at 09:21:30PM -0500, Eric Blake wrote:
...
 On 5/22/19 4:50 AM, Richard W.M. Jones wrote:
 > If not using multi-conn then obviously the synchronous connection
 > calls ‘nbd_connect_unix’, ‘nbd_connect_tcp’ and ‘nbd_connect_command’
 > should only return when the (one) connection object is connected.
 > 
 > In the multi-conn case it's not very clear what these synchronous
 > calls should do.  Previously I had it so that they would return as
 > soon as at least one connection was connected.  However this is a
 > problem if you are using them as a convenient way to set up a
 > multi-threaded main loop, because it can be that some of them have not
 > finished connecting, but then you issue commands on those connections
 > and that will fail.  The failure is pernicious because most of the
 > time you won't see it, only if one connection is slow.  So it's
 > (probably) better that the synchronous ‘nbd_connect_unix’ and
 > ‘nbd_connect_tcp’ should connect every connection object before
 > returning.
 > 
 > For ‘nbd_connect_command’, it essentially ignored multi-conn anyway,
 > and I changed it so that it waits for conn[0] to get connected and
 > returns, the other connections (if they exist) being ignored.  It
 > should probably be an error for the user to enable multi-conn on the
 > handle and then use nbd_connect_command, but I did not make that
 > change yet.

 Ouch - I just realized that we created a different problem. If the
 server does not add multi-conn because it does NOT permit parallel
 connections (such as 'nbdkit --filter=noparallel memory 1m
 serialize=connections'), then a client requesting
 nbd_set_multi_conn(nbd, 2) will now hang because we will never be able
 to move all our connections through handshake. While we can still try to
 fire off multiple connections at once, we need to add some logic in
 nbd_connect_unix() and nbd_connect_tcp() to check can_multi_conn after
 the FIRST connection completes (whichever one wins), and if
 !can_multi_conn(), it would be nice if we could automatically kill off
 the losers, swap the winning connection into conns[0], and then behave
 as if we implicitly called set_multi_conn(nbd, 1), so that the
 synchronous connect command will succeed after all. In such a case, the
 user can learn after the fact via nbd_can_multi_conn and
 nbd_get_multi_conn that things were changed because multi_conn was not
 supported after all. 
It's somewhat unavoidable given the way that the NBD protocol itself
works.  While it's a pain, below is the logic that I came up with in
the forthcoming nbdtool.  nr_multi_conn is a command line parameter to
select multi_conn, with the desired behaviour being to fall back to
single threaded if multi-conn is not supported.  Also this code will
become shorter and simpler when we get URL support:

static void
connect_to (struct nbd_handle *nbd, char *arg)
{
  char *p = strchr (arg, ':');
  int r;

  if (p) {
    *p = '\0';
    r = nbd_connect_tcp (nbd, arg, p+1);
  }
  else
    r = nbd_connect_unix (nbd, arg);
  if (r == -1) NBD_ERROR_EXIT;

  /* Check if the connected handle supports multi-conn, and enable or
   * disable it.
   */
  if (nr_multi_conn == 1)
    return;

  if (nbd_can_multi_conn (nbd) == 1) {
    if (nbd_set_multi_conn (nbd, nr_multi_conn) == -1)
      NBD_ERROR_EXIT;

    /* Now make the other connections on the handle. */
    if (p)
      r = nbd_connect_tcp (nbd, arg, p+1);
    else
      r = nbd_connect_unix (nbd, arg);
    if (r == -1) NBD_ERROR_EXIT;
  }
  else
    nr_multi_conn = 1;
}

(In fact it was this code which provokes the error which required
this fix:

https://www.redhat.com/archives/libguestfs/2019-May/msg00134.html )

However that still leaves this problem:

...
 Or maybe we want both 'set_multi_conn' (connect hard-fails if
the FIRST
 connection reports !can_multi_conn) vs. 'request_multi_conn' (connect
 downgrades gracefully), similar to how request_meta_context downgrades
 gracefully if the server lacks SR or a particular meta context. 
IOW:

* What if the user sets multi-conn > 1, does NOT check can_multi_conn
  after connecting, and the server allows multiple connections but
  does not give the true multi-conn guarantees?

That results in a data corruption issue.  So I think that connections
should fail if we get to the point where we have read the eflags,
multi-conn is not set, but h->multi_conn > 1.  I've put this on my
to-do list.

Requesting multi-conn optionally is harder: We could adjust
h->multi_conn == 1, swap the winning connection object into
h->conns[0] and close the others, but I think that's very tricksy.

...
 On IRC we were chatting about how multi-conn > 1 makes no sense
with
 connect_command() (where we are limited to exactly one channel over
 stdin/stdout), so it may also be worth adding some logic to have
 connect_command fail if set_multi_conn is > 1, 
Also added to my to-do list.

...
 and to likewise have
 set_multi_conn(2) fail if the existing connection is already started in
 command mode (do we even track which mode a commands started in, and/or
 insist that all connections used the same nbd_connect_*? Or can you
 mix-and-match a TCP and a Unix socket into a multi-conn struct nbd_handle?) 
We don't track any of this, and yes it's possible.

...
 On the other hand, a user that is aware of multi-conn being optional
may
 manually try to make one connection, check can_multi_conn, and THEN call
 set_multi_conn(2), with some expectation of being able to then connect
 the remaining slots (again, if we stored the nbd_connect_* used by the
 FIRST connection, then all the remaining connections requested could
 automatically be fired off to start connecting to the same server - and
 you'd need a synchronous way to wait for them all to reach stable
 state).  Perhaps that's even a reason to have two separate API - one
 that is callable only while you are in created state (a request of what
 to try if the server supports it), and the other callable only after
 your single connection has completed handshake (to then expand to
 actually get the additional connections now that it is known to be
 safe).  At any rate, I don't think we're done with the design in this area.

That's what the code does above.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines.  Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Re: [Libguestfs] [PATCH libnbd v2 1/6] api: Synchronous connect waits til all connections are connected.