On Sun, Feb 13, 2022 at 11:41 AM Richard W.M. Jones <rjones(a)redhat.com> wrote:
On Sat, Feb 12, 2022 at 10:49:42PM +0200, Nir Soffer wrote:
> rhv-upload plugin is translating every NBD command to HTTP request,
> translated back to NBD command on imageio server. The HTTP client and
> server, and the NBD client on the imageio server side are synchronous
> and implemented in python, so they have high overhead per request. To
> get good performance we need to use larger request size.
>
> Testing shows that request size of 8MiB is best, speeding up the copy
> disk phase from 14.7 seconds to 7.7 seconds (1.9x times faster).
Unfortunately this will break VDDK since it cannot handle very large
requests (I think 4M is about the max without reconfiguring the
server).
Are you sure it will break VDDK?
Request size limit is in VDDK API, not in the nbdkit plugin. When you
request 8M request, the VDDK plugin should allocated a 8M buffer,
and issue multiple calls to VDDK APIs, using VDDK maximum
request size to fill the buffer.
If the VDDK plugin does not do this, this is a bug in the plugin, since
it must respect the underlying API.
If 8M does break the vddk plugin, we can use 4M, it is only a little
slower then 8M but still much faster than 256k.
Also larger requests have adverse performance effects in
other configurations, although I understand this patch tries to
retrict the change to when the output mode is rhv-upload.
Yes, this affects only -o rhv-upload.
We need to think of some other approach, but I'm not sure what it
is.
I'd really like to be able to talk to imageio's NBD server directly!
We have a RFE to implement a local nbd socket, which should be easy,
but the only attention we got so far was an attempt to close it ;-)
Even if we have a local only nbd socket, lets's say in oVirt 4.6, it will not
help existing users.
This looks like a nicer way to implement this change.
[...]
> This is an ugly hack; the preferred request size should be a function of
> the output module that only output_rhv_upload will override, but I don't
> know how to implement this with the current code.
Just add a new value to output/output.ml{,i}. There is no
superclassing (this is not OO) so you'll have to add the value to
every output module implementation, defaulting to None.
Sounds reasonable, we have only a few outputs.
However I'd like to think of another approach first.
- Have nbdcopy split and combine requests so request size for input
and output can be different? Sounds complicated but might be
necessary one day to support minimum block size.
I think this is not needed for the case of large request size,
since on nbd client/server size it is easy to handle large requests
regardless of the underlying API limits.
- More efficient Python plugin that might combine requests? Also
complicated ...
This will be slower and complicated.
First we need to solve the issue of getting the right connection to handle
the nbd command - now every command is handled by a random connection
from the pool.
Assuming we solved connection pooling, we can keep a buffer for the next
request, and on every pwrite() fill this buffer. When the buffer is full, or
when receiving a zero()/flush()/close() request, we need to send the
buffer. This introduces additional copy which will slow down the transfer.
Finally there is the issue of reporting errors - since we buffer nbd commands
data, we cannot report errors for every commands, we need to report
errors later, either in the middle of the command (buffer becomes full)
or when the next command is received.
So this will be slower, very hard to implement, and hard to debug.
The way we solved this in imageio client - copying data between nbd
and http backends, is to implement the copy loop in the http client.
In imageio we have this (simplified) copy loop:
for extent in extents:
if zero:
dst.zero(extent)
else:
copy(src, dst, extent)
copy() is checking if the src or dst can do efficient copy,
and if not it fallbacks to a generic copy:
if hasattr(dst, "read_from"):
dst.read_from(src, extent)
elif hasattr(src, "write_to"):
srt.write_to(dst, extent)
else:
for chunk in extent:
src.read(chunk)
dst.write(chunk)
The http backend implements read_from like this:
send put request headers
for chunk in extent:
src.read(chunk)
socket.write(chunk)
In this way we can use the most efficient request size for the nbd
input (256k), for the http socket (256k is good), but minimize the
number of http requests.
Of course this will be harder with the async API, but it is possible.
We can keep the pread requests in a FIFO queue, and send each
chunk in the right order. The Go aio_copy example does this:
https://gitlab.com/nbdkit/libnbd/-/blob/master/golang/examples/aio_copy/a...
This can be done in nbdcopy by adding http-ops. The pipeline in
rhv-upload case will be:
nbdkit (any input) -> nbdcopy (nbd-ops/http-ops) -> imageio server
Another way is to drop nbdcopy for rhv-upload - if we provide nbd
input, we can use imageio client to read from the nbd input and write
to imageio server. The pipeline will be:
nbdkit (any input) -> imageio client -> imageio server
This means that the v2v output can optionally do the copy phase,
if it knows how to read from the nbdkit input.
Yet another way is to rewrite imageio in Go, and use libnbd async API,
which will eliminate most of the overhead of the http requests. I started
this work with the imageio client side. You can check what we have so
far here:
https://github.com/oVirt/ovirt-imageio/tree/master/ovirt-img
For the long term, I think improving nbdcopy to support additional
input/output backends is the right way and will make it more useful
for other purposes.
For now I will post a better v2. It will be nice to have this in RHEL 9.0.
Nir