On 3/22/19 2:42 PM, Nir Soffer wrote:
> Add a protocol flag and corresponding transmission advertisement
flag
> to make it easier for clients to inform the server of their intent. If
> the server advertises NBD_FLAG_SEND_FAST_ZERO, then it promises two
> things: to perform a fallback to write when the client does not
> request NBD_CMD_FLAG_FAST_ZERO (so that the client benefits from the
> lower network overhead); and to fail quickly with ENOTSUP if the
> client requested the flag but the server cannot write zeroes more
> efficiently than a normal write (so that the client is not penalized
> with the time of writing data areas of the disk twice).
>
I think the issue is not that zero is slow as normal write, but that it is
not fast
enough so it worth the zero entire disk before writing data.
In an image copy, where you don't know if the destination already
started life with all zero, then you HAVE to copy zeros into the image
for the holes; the only question is whether also pre-filling the entire
image (with fewer calls) and then overwriting the prefill is faster than
just writing the data areas once. So there is a tradeoff to see how
much time do you add with the overhead of lots of small-length
WRITE_ZEROES for the holes, vs. the overhead of one large-length
WRITE_ZEROES for the entire image. There's ALSO a factor of how much of
the image is holes vs. data - a pre-fill of only 10% of the image (which
is mostly sparse) is less wasteful than a pre-fill of 90% of the image
(which is mostly dense) - but that waste doesn't cost anything if
prefill is O(1) regardless of size; vs. being painful if it is O(n)
based on size. There are definitely heuristics at play, and I don't
know that the NBD spec can go into any strong advice on what type of
speedups are in play, only whether the write zero is on par with normal
writes.
And, given the uncertainties on what speedups (or slowdowns) a pre-fill
might cause, it DOES show that knowing if an image started life all zero
is an even better optimization, because then you don't have to waste any
time on overwriting holes. But having another way to speed things up
does not necessarily render this proposal as useless.
> Note that the Linux fallocate(2) interface may or may not be
powerful
> enough to easily determine if zeroing will be efficient - in
> particular, FALLOC_FL_ZERO_RANGE in isolation does NOT give that
> insight; for block devices, it is known that ioctl(BLKZEROOUT) does
> NOT have a way for userspace to probe if it is efficient or slow. But
> with enough demand, the kernel may add another FALLOC_FL_ flag to use
> with FALLOC_FL_ZERO_RANGE, and/or appropriate ioctls with guaranteed
> ENOTSUP failures if a fast path cannot be taken. If a server cannot
> easily determine if write zeroes will be efficient, it is better off
> not advertising NBD_FLAG_SEND_FAST_ZERO.
>
I think this can work for file based images. If fallocate() fails, the
client
will get ENOTSUP after the first call quickly.
The negative case is fast, but that doesn't say anything about the
positive case. Unless Linux adds a new FALLOC_FL_ bit, you have no
guarantee whether fallocate() reporting success may still have happened
because the kernel did a fallback to a slow write. If fallocate() comes
back quickly, you got lucky; but if it takes the full time of a write(),
you lost your window of opportunity to report ENOTSUP quickly. Hence,
my hope that the kernel folks add a new FALLOC_FL_ flag to give us the
semantics we want (of a guaranteed way to avoid slow fallbacks).
For block device we don't have any way to know if a fallocate() or
BLKZEROOUT
will be fast, so I guess servers will never advertise FAST_ZERO.
As I said, you don't know that with BLKZEROOUT, but the kernel might
give us another ioctl that DOES know.
Generally this new flag usefulness is limited. It will only help
qemu-img
to convert
faster to file based images.
Limited use case is still a use case. If there are cases where you can
optimize by a simple extension to the protocol, and where either side
lacking the extension is not fatal to the protocol, then it is worth
doing. And so far, that is what this feels like to me.
Do we have performance measurements showing significant speed up when
zeroing the entire image before coping data, compared with zeroing only the
unallocated ranges?
Kevin may have more of an idea based on the patches he wrote for
qemu-img, and which spurred me into proposing this email; maybe he can
share numbers for his testing on regular files and/or block devices to
at least get a feel for whether a speedup is likely with a sufficient
NBD server.
For example if the best speedup we can get in real world scenario is 2%, is
ti
worth complicating the protocol and using another bit?
Gaining 2% of an hour may still be worth it.
> + set. Servers SHOULD NOT set this transmission flag if there is
no
> + quick way to determine whether a particular write zeroes request
> + will be efficient, but the lack of an efficient write zero
>
I think we should use "fast" instead of "efficient". For example when
the
kernel
fallback to manual zeroing it is probably the most efficient way it can be
done,
but it is not fast.
Seems like a simple enough wording change.
> @@ -2114,6 +2151,7 @@ The following error values are defined:
> * `EINVAL` (22), Invalid argument.
> * `ENOSPC` (28), No space left on device.
> * `EOVERFLOW` (75), Value too large.
> +* `ENOTSUP` (95), Operation not supported.
> * `ESHUTDOWN` (108), Server is in the process of being shut down.
>
> The server SHOULD return `ENOSPC` if it receives a write request
> @@ -2125,6 +2163,10 @@ request is not aligned to advertised minimum block
> sizes. Finally, it
> SHOULD return `EPERM` if it receives a write or trim request on a
> read-only export.
>
> +The server SHOULD NOT return `ENOTSUP` except as documented in
> +response to `NBD_CMD_WRITE_ZEROES` when `NBD_CMD_FLAG_FAST_ZERO` is
> +supported.
>
This makes ENOTSUP less useful. I think it should be allowed to return
ENOTSUP
as response for other commands if needed.
Sorry, but we have the problem of back-compat to worry about. Remember,
the error values permitted in the NBD protocol are system-agnostic (they
_happen_ to match Linux errno values, but not all the world uses the
same values for those errors in their libc, so portable implementations
HAVE to map between NBD_EINVAL sent over the wire and libc EINVAL used
internally, even if the mapping is 1:1 on Linux). Since the NBD
protocol has documented only a finite subset of valid errors, and
portable clients have to implement a mapping, it's very probably that
there exist clients written against the current NBD spec that will choke
hard (and probably hang up the connection) on receiving an unexpected
error number from the server which was not pre-compiled into their
mapping. ANY server that replies with ENOTSUP at the moment is in
violation of the existing server requirements, whether or not clients
have a high quality of implementation and manage to tolerate the
server's noncompliance.
Thus, when we add new errno values as being valid returns, we have to
take care that servers SHOULD NOT send the new errno except to clients
that are prepared for the error - a server merely advertising
NBD_FLAG_SEND_FAST_ZERO is _still_ insufficient to give the server
rights to send ENOTSUP (since the server can't know if the client
recognized the advertisement, at least until the client finally sends a
NBD_CMD_FLAG_FAST_ZERO flag). (Note, I said SHOULD NOT, not MUST NOT -
if your server goofs and leaks ENOTSUP to a client on any other command,
most clients will still be okay, and so you probably won't have people
complaining that your server is broken. The only MUST NOT send ENOTSUP
is for the case where the server advertised FAST_ZERO probing and the
client did not request FAST_ZERO, because then server has to assume the
client is relying on the server to do fallback handling for reduced
network traffic.)
I think this makes sense, and should work, but we need more data supporting
that this is
useful in practice.
Fair enough - since Kevin has already got patches proposed against qemu
to wire up a qemu flag BDRV_REQ_NO_FALLBACK, which should map in a
rather straightforward manner to my NBD proposal (any qemu request sent
with the BDRV_REQ_NO_FALLBACK bit set turns into an NBD_CMD_WRITE_ZEROES
with the NBD_CMD_FLAG_FAST_ZERO set), it should be pretty easy for me to
demonstrate a timing analysis of the proposed reference implementation,
to prove that it either makes a noticeable difference or was in the
noise. But it may be a couple of weeks before I work on a reference
implementation - even if Kevin's patches are qemu 4.0 material to fix a
speed regression, getting a new NBD protocol extension included during
feature freeze is too much of a stretch.
--
Eric Blake, Principal Software Engineer
Red Hat, Inc. +1-919-301-3226
Virtualization:
qemu.org |
libvirt.org