On Tue, Jul 27, 2021 at 12:16:59PM +0100, Richard W.M. Jones wrote:
Hi Eric, a couple of questions below about nbdkit performance.
Modular virt-v2v will use disk pipelines everywhere. The input
pipeline looks something like this:
socket <- cow filter <- cache filter <- nbdkit
curl|vddk
We found there's a notable slow down in at least one case: When the
source plugin is very slow (eg. it's curl plugin to a slow and remote
website, or VDDK in general), everything runs very slowly.
I made a simple test case to demonstrate this:
$ virt-builder fedora-33
$ time ./nbdkit --filter=cache --filter=delay file /var/tmp/fedora-33.img
delay-read=500ms --run 'virt-inspector --format=raw -a "$uri" -vx'
This uses a local file with the delay filter on top injecting half
second delays into every read. It "feels" a lot like the slow case we
were observing. Virt-v2v also does inspection as a first step when
converting an image, so using virt-inspector is somewhat realistic.
Unfortunately this actually runs far too slowly for me to wait around
- at least 30 mins, and probably a lot longer. This compares to only
7 seconds if you remove the delay filter.
Reducing the delay to 50ms means at least it finishes in a reasonable time:
$ time ./nbdkit --filter=cache --filter=delay file /var/tmp/fedora-33.img \
delay-read=50ms \
--run 'virt-inspector --format=raw -a "$uri"'
real 5m16.298s
user 0m0.509s
sys 0m2.894s
Sounds like the reads are rather serialized (the application is not
proceeding to do a second read until after it has the result of the
first read) rather than highly parallel (where the application would
be reading multiple sites in the image at once, possibly by requesting
the start of a read at two different offsets before knowing which of
those two offsets is even useful). There's also a question of how
frequently a given portion of the disk image is re-read (caching will
speed things up if data is revisited multiple times, but just adds
overhead if the reads are truly once-only access for the life of the
process).
In the above scenario the cache filter is not actually doing anything
(since virt-inspector does not write). Adding cache-on-read=true lets
us cache the reads, avoiding going through the "slow" plugin in many
cases, and the result is a lot better:
$ time ./nbdkit --filter=cache --filter=delay file /var/tmp/fedora-33.img \
delay-read=50ms cache-on-read=true \
--run 'virt-inspector --format=raw -a "$uri"'
real 0m27.731s
user 0m0.304s
sys 0m1.771s
Okay, that sounds like there is indeed frequent re-reading of portions
of the disk (or at least reading of nearby smaller offsets that fall
within the same larger granularity used by the cache).
However this is still slower than the old method which used qcow2 +
qemu's copy-on-read. It's harder to demonstrate this, but I modified
virt-inspector to use the copy-on-read setting (which it doesn't do
normally). On top of nbdkit with 50ms delay and no other filters:
qemu + copy-on-read backed by nbdkit delay-read=50ms file:
real 0m23.251s
qemu's copy-on-read creates a qcow2 image backed by a read-only base
image; any read that the qcow2 can't satisfy causes the entire cluster
to be read from the backing image into the qcow2 file, even if that
cluster is larger than what the client was actually reading. It will
benefit from the same speedups of only hitting a given region of the
backing file once in the life of the process.
But it also assumes the presence of a backing chain. If you try to
use copy-on-read on something that does not have a backing chain (such
as a direct use of an NBD link), the performance suffers (as we
discussed on IRC). My understanding is that for every read operation,
the COR code does a block status query to see whether the data was
local or came from the backing chain; but in the case of an NBD image
which does not have a backing chain from qemu's point of view, EVERY
block status operation comes back as being local, and the COR has
nothing further to do - so the performance penalty is because of the
extra time spent on that block status call, particularly if that
results in another round trip NBD command over the wire before any
reading happens.
So 23s is the time to beat. (I believe that with longer delays, the
gap between qemu and nbdkit increases in favour of qemu.)
Q1: What other ideas could we explore to improve performance?
Have you played with block sizing? (Reading the git log, you have...)
Part of qemu's COR behavior is that for any read not found in the
qcow2 active layer, the entire cluster is copied up the backing chain;
a 512-byte client read becomes a 32k cluster read for the default
sizing. Other block sizes may be more efficient, such as 64k or 1M
per request actually sent over the wire.
- - -
In real scenarios we'll actually want to combine cow + cache, where
cow is caching writes, and cache is caching reads.
socket <- cow filter <- cache filter <- nbdkit
cache-on-read=true curl|vddk
The cow filter is necessary to prevent changes being written back to
the pristine source image.
This is actually surprisingly efficient, making no noticable
difference in this test:
time ./nbdkit --filter=cow --filter=cache --filter=delay \
file /var/tmp/fedora-33.img \
delay-read=50ms cache-on-read=true \
--run 'virt-inspector --format=raw -a "$uri"'
real 0m27.193s
user 0m0.283s
sys 0m1.776s
Q2: Should we consider a "cow-on-read" flag to the cow filter (thus
removing the need to use the cache filter at all)?
Since cow is already a form of caching (anything we touched now lives
locally, so we don't have to re-visit the original data source), yes,
it makes sense to have a cow-on-read mode that stores even reads
locally.
--
Eric Blake, Principal Software Engineer
Red Hat, Inc. +1-919-301-3266
Virtualization:
qemu.org |
libvirt.org