Thoughts on nbdkit automatic reconnection

Wednesday, 18 September 2019

We have a running problem with the nbdkit VDDK plugin where the VDDK
side apparently disconnects or the network connection is interrupted.
During a virt-v2v conversion this causes the entire operation to fail,
and since v2v conversions take many hours that's not a happy outcome.

(Aside: I should say that we see many cases where it's claimed that
the connection was dropped, but often when we examine them in detail
the cause is something else.  But it seems like this disconnection
thing does happen sometimes.)

To put this isn't concrete terms which don't involve v2v, let's say
you were doing something like:

  nbdkit ssh host=remote /var/tmp/test.iso \
    --run 'qemu-img convert -p -f raw $nbd -O qcow2 test.qcow2'

which copies a file over ssh to local.  If /var/tmp/test.iso is very
large and/or the connection is very slow, and the network connection
is interrupted then the whole operation fails.  If nbdkit could
retry/reconnect on failure then the operation might succeed.

There are lots of parameters associated with retrying, eg:

 - how often should you retry before giving up?

 - how long should you wait between retries?

 - which errors should cause a retry, which are a hard failure?

So I had an idea we could implement this as a generic "retry" filter,
like:

  nbdkit ssh ... --filter=retry retries=5 retry-delay=5 retry-exponential=yes

This cannot be implemented with the current design of filters because
a filter would have to call the plugin .close and .open methods, but
filters don't have access to those from regular data functions, and in
any case this would cause a new plugin handle to be allocated.

We could probably do it if we added a special .reopen method to
plugins.  We could either require plugins which support the concept of
retrying to implement this, or we could have a generic implementation
in server/backend.c which would call .close, .open and cope with the
new handle.

Another way to do this would be to modify each plugin to add the
feature.  nbdkit-nbd-plugin has this for a very limited case, but no
others do, and it's quite complex to implement in plugins.  As far as
I can see it involves checking the return value of any data call that
the plugin makes and performing the reconnection logic, while not
changing the handle (so just calling self->close, self->open isn't
going to work).

If anyone has any thoughts about this I'd be happy to hear them.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009