On 04/10/2018 09:07 AM, Nir Soffer wrote:
On Tue, Apr 10, 2018 at 4:48 PM Kevin Wolf <kwolf(a)redhat.com>
wrote:
> Am 10.04.2018 um 15:03 hat Nir Soffer geschrieben:
>> On Tue, Apr 10, 2018 at 1:44 PM Richard W.M. Jones <rjones(a)redhat.com>
>> wrote:
>>
>>> We now have true zeroing support in oVirt imageio, thanks for that.
>>>
>>> However a problem is that ‘qemu-img convert’ issues zero requests for
>>> the whole disk before starting the transfer. It does this using 32 MB
>>> requests which take approx. 1 second each to execute on the oVirt side.
>>
>>
>>> Two problems therefore:
>>>
>>> (1) Zeroing the disk can take a long time (eg. 40 GB is approx.
>>> 20 minutes). Furthermore there is no progress indication while
> this
>>> is happening.
This is going to be true whether or not you write zeroes in 32M chunks
or in 2G chunks - it takes a long time to write actual zeroes to a block
device if you are unsure of whether the device already contains zeroes.
There is more overhead for sending 64 requests of 32M each than 1
request for 2G, there's a question for whether that's in the noise
(slightly more data sent over the wire) or impactful (because you have
to wait for more round trips, where the time spent waiting for traffic
is on par with the time spent writing zeroes for a single request).
The only way that a write zeroes request is not going to be slower than
a normal write is if the block device itself supports an efficient way
to guarantee that the sectors of the disk will read as zero (for
example, using things like WRITE_SAME on iscsi devices).
>>>
>>
>>> Nothing bad happens: because it is making frequent requests there
>>> is no timeout.
>>>
>>> (2) I suspect that because we don't have trim support that this is
>>> actually causing the disk to get fully allocated on the target.
>>>
>>> The NBD requests are sent with may_trim=1 so we could turn these
>>> into trim requests, but obviously cannot do that while there is no
>>> trim support.
In fact, if a trim request guarantees that you can read back zeroes
regardless of what was previously on the block device, then that is
precisely what you SHOULD be doing to make write zeroes more efficient
(but only when may_trim=1).
>>>
>>
>> It sounds like nbdkit is emulating trim with zero instead of noop.
No, qemu-img is NOT requesting trim, it is requesting write zeroes. You
can implement write zeroes with a trim if the trim will read back as
zeroes. But while trim is advisory, write zeroes has mandatory
semantics on what you read back (where may_trim=1 is a determining
factor on whether the write MUST allocate, or MAY trim. Ignoring
may_trim and always allocating is semantically correct but may be
slower, while trimming is correct only when may_trim=1).
>>
>> I'm not sure why qemu-img is trying to do, I hope the nbd maintainer on
>> qemu side can explain this.
>
> qemu-img tries to efficiently zero out the whole device at once so that
> it doesn't have to use individual small write requests for unallocated
> parts of the image later on.
At one point, there was a proposal to have the NBD protocol add
something where the server could advertise to the client if it is known
at initial connection time that the export is starting life with ALL
sectors zeroed. (Easy to prove for a just-created sparse file, a bit
harder to prove for a block device although at least some iscsi devices
do have queries to learn if the entire device is unallocated).
This has not yet been implemented in the NBD protocol, but may be worth
doing; it is something that is slightly redundant with the
NBD_CMD_BLOCK_STATUS that qemu 2.12 is introducing (in that the client
can perform that sort of query itself rather than the server advertising
it at initial connection), but may be easy enough to implement even
where NBD_CMD_BLOCK_STATUS is difficult that it would still allow
qemu-img to operate more efficiently in some situations. But qemu-img
DOES know how to skip zeroing a block device if it knows up front that
the device already reads as all zeroes, so the missing piece of
information is getting NBD to tell that to qemu-img.
Meanwhile, NBD_CMD_BLOCK_STATUS is still quite a ways from being
supported in nbdkit, so that's not anything that rhv-upload can exploit
any time soon.
>
This makes sense if the device is backed by a block device on oVirt side,
and the NBD support efficient zeroing. But in this case the device is backed
by an empty sparse file on NFS, and oVirt does not support yet efficient
zeroing, we just write zeros manually.
I think should be handled on virt-v2v plugin side. When zeroing a file raw
image,
you can ignore zero requests after the highest write offset, since the
plugin
created a new image, and we know that the image is empty.
Didn't Rich already try to do that?
+def emulate_zero(h, count, offset):
+ # qemu-img convert starts by trying to zero/trim the whole device.
+ # Since we've just created a new disk it's safe to ignore these
+ # requests as long as they are smaller than the highest write seen.
+ # After that we must emulate them with writes.
+ if offset+count < h['highestwrite']:
Or is the problem that emulate_zero() is only being called if:
+ # Unlike the trim and flush calls, there is no 'can_zero' method
+ # so nbdkit could call this even if the server doesn't support
+ # zeroing. If this is the case we must emulate.
+ if not h['can_zero']:
+ emulate_zero(h, count, offset)
+ return
rather than doing the 'highestwrite' check unconditionally even when
oVirt supports zero requests?
When the destination is a block device we cannot avoid zeroing since a block
device may contain junk data (we usually get dirty empty images from our
local
xtremio server).
And that's why qemu-img is starting life with write zeroes requests -
because it needs to guarantee that the image either already started as
all zeroes, or that zeroes are written to overwrite junk data.
> The problem is that the NBD block driver has max_pwrite_zeroes =
32 MB,
> so it's not that efficient after all. I'm not sure if there is a real
> reason for this, but Eric should know.
>
Yes, I do know. But it missed qemu 2.12; it's another NBD spec proposal
where I'm also going to submit a qemu patch:
https://lists.debian.org/nbd/2018/03/msg00017.html
Right now, the NBD protocol has no clean distinction between maximum
data request (hard limit of 32M for NBD_CMD_WRITE in qemu-img) and for
maximum length on a request with no accompanying data
(NBD_CMD_WRITE_ZEROES). Once we add NBD_INFO_ZERO_SIZE, then it becomes
obvious that sending a 2G NBD_CMD_WRITE_ZEROES request makes sense, even
when 32M is the maximum for a normal write; but until that point, qemu
is being conservative and capping EVERYTHING to the 32M limit. There's
also talk about enhancing NBD to support larger than 4G by adding an
extension that permits 64-bit lengths, but that's further off in the
"nice idea, but not yet documented or implemented" category.
We support zero with unlimited size without sending any payload to oVirt,
so
there is no reason to limit zero request by max_pwrite_zeros. This limit may
make sense when zero is emulated using pwrite.
Even when write zeroes is emulated by falling back to pwrite, the pwrite
can be done in a loop (however, then you get into the game of whether
writing 2G of zeroes takes long enough that you really DO want to
enforce a write zero maximum smaller than 4G, if only to guarantee more
frequent traffic to avoid timing out).
>
>> However, since you suggest that we could use "trim" request for these
>> requests, it means that these requests are advisory (since trim is), and
>> we can just ignore them if the server does not support trim.
>
> What qemu-img sends shouldn't be a NBD_CMD_TRIM request (which is indeed
> advisory), but a NBD_CMD_WRITE_ZEROES request. qemu-img relies on the
> image actually being zeroed after this.
>
So it seems that may_trim=1 is wrong, since trim cannot replace zero.
No, 'may_trim=1' means you may trim, IF you can guarantee that you can
read back as zero. If trim can't guarantee a read back as zero, then
may_trim=1 must be ignored and the server do a write instead. The
client should always be able to request may_trim=1, whether or not the
server can actually do a trim as an optimization.
--
Eric Blake, Principal Software Engineer
Red Hat, Inc. +1-919-301-3266
Virtualization:
qemu.org |
libvirt.org