On 3/20/19 6:50 AM, Richard W.M. Jones wrote:
On Wed, Mar 20, 2019 at 12:35:33AM -0500, Eric Blake wrote:
> On 3/19/19 11:34 AM, Richard W.M. Jones wrote:
>> This pair of calls allows plugins to describe which extents in the
>> virtual disk are allocated, holes or zeroes.
>> ---
>
>
>> + void nbdkit_extents_free (struct nbdkit_extents_map *);
>> +
>> +Frees an existing extents map.
>
> Is this like free() where it is safe to call on NULL?
Yes - otherwise CLEANUP_EXTENTS_FREE wouldn't work. Do we need to
document this?
Documenting it makes it obvious that this is valid:
myextents (...) {
nbdkit_extents_map *map2 = NULL;
if (allocate_something_else() < 0)
goto cleanup;
map2 = nbdkit_extents_new ();
...
cleanup:
nbdkit_extents_free (map2);
rather than requiring an 'if (map2)' in the cleanup code. On the other
hand, only filters (and our core code) call it, so maybe just a comment
in the source code is sufficient (we don't have to worry about
out-of-tree filters the way we do about out-of-tree plugins).
>> +=item C<NBDKIT_EXTENTS_FOREACH_FLAG_RANGE>
>> +
>> +If this flag is included then the C<range_offset> and
C<range_length>
>> +parameters are used, so only extents overlapping
>> +C<[range_offset...range_offset+range_length-1]> are returned.
>
> Must range_offset+range_length-1 lie within the reported
> next_opts->get_size(), or can range extend beyond the end of the plugin
> (that is, can I pass range == -1 to get from a given offset to the end
> of the file, or do I have to compute the end range to avoid internal
> errors)?
Extent maps themselves are independent of the size of the plugin, and
the foreach function will let you put any range there you want.
However the server won't accept any NBD_CMD_BLOCK_STATUS request with
a range beyond the end of the export size (valid_range in
server/protocol.c), and so the core server code should not use a range
which is outside the file (block_status_final_map in the same file).
>From the point of view of filters that might call the foreach
function, they shouldn't choose a range which is outside the
offset/count passed to the layer below, since the layer below is only
guaranteed to set the extents correctly for that range (except in the
case of REQ_ONE where the rules are even more complicated
unfortunately).
Or even only the initial subset of the range.
>> +++ b/docs/nbdkit-plugin.pod
>
>> +The callback should detect and return the list of extents overlapping
>> +the range C<[offset...offset+count-1]>. The C<extents_map>
parameter
>> +points to an opaque object which the callback should fill in by
>> +calling C<nbdkit_extent_add> etc. See L</Extents map> below.
>> +
>> +The C<flags> parameter may contain the flag C<NBDKIT_FLAG_REQ_ONE>
>> +which means that the client is only requesting information about the
>> +extent overlapping C<offset>. The plugin may ignore this flag, or as
>> +an optimization it may return just a single extent for C<offset>.
>> +(Note that implementing this optimization correctly is difficult and
>> +subtle. You must ensure that the upper limit of the single extent
>> +returned is correct. If unsure then the safest option is to ignore
>> +this flag.)
>
> Indeed. Or in other words, since the extent map coalesces adjacent
> regions with the same type, merely calling nbdkit_extent_add(0) exactly
> once won't work, because that coalesces with the extent map's default of
> the entire image already being reported as data; so to report a single
> data extent that ends sooner than the client's requested range, it is
> necessary to report a different extent starting at the next byte.
Unfortunately the current implementation doesn't coalesce type = 0
with the "background" extent ... It probably should do to reduce
surprises [see below].
As you say though, rules around REQ_ONE are very complex - I
implemented it for the file plugin, but my initial implementation for
the VDDK plugin was incorrect and I removed it. Unfortunately because
VDDK QueryAllocatedBlocks is so slow we will need to implement this
optimization at some point and try to get it right.
The more I think about it, the more I like a fifth type END as an
explicit way to guarantee an extent beyond which the plugin did not
provide information, and that nbdkit should just do short replies when
encountering END. (The NBD protocol requires that NBD_CMD_BLOCK_STATUS
makes progress on success - it must either fail with an error, or report
at least one extent successfully - so if a plugin creates an END extent
directly on the requested offset, nbdkit has to treat that as an error
return back to the client; all other offsets for END are merely an early
termination hint).
> Technically, aren't gaps in the map treated as implicit
extents of type
> 0? Then again, I guess you return the next lower extent (rather than the
> gap) to make the coalescing nicer.
... but another possible way to implement the structure would be to
have an explicit extent covering every part of the range (either
[0 to INT64_MAX], or [offset to offset+length-1]). This would be a
bit like the common/range struct.
I'll see if that makes the implementation easier.
In other words, if nbdkit_extents_new() prepopulates an extent, then
there are no gaps to worry about, and you are always guaranteed to find
an existing extent that coincides with the offset given to
nbdkit_extents_add().
> For worst-case guest behavior, this can result in several memmove()
> calls during a single _add(). I guess the way to be more efficient is to
> use a different data structure rather than an array; I also suspect that
> we want to cater to the typical case of a plugin calling _add() in
> ascending order (so actions at the end of the list should be fast, no
> matter whether we use array or some other structure).
I believe the structure we really want for efficiency is:
https://en.wikipedia.org/wiki/Interval_tree
I tried implementing one of these a while back and it's very complex.
A full-fledged interval tree allows overlapping intervals; you at least
have the benefit that due to splitting/coalescing you always have a
linear list with no overlap. Still, a balanced binary heap with O(log n)
performance is probably better than an array with O(n) performance, if n
can get large. But as I said earlier, it's okay to get the interface
right with a slower (but easier) implementation, then later switch the
implementation without breaking the interface for speedups.
> The NBD protocol allows for short replies (whether or not
FLAG_ONE was
> requested) - should we have a fifth type that the plugin can call
> _add(END) as a way of specifically marking that they did not provide
> extent information beyond that point, and therefore nbdkit must do a
> short reply (instead of treating the rest of the range as data)? (For
> example, in qemu with the qcow2 file, a block status operation never
> reads more than one L2 cluster; requests that overlap regions of the
> disk that would require reading a second L2 cluster are instead
> truncated). The END type would not be transmitted over the NBD wire,
> but may make some other operations easier (such as your addressing your
> comment about a plugin honoring REQ_ONE having subtle semantics if it
> doesn't take coalescing of subsequent offsets into account).
OK I didn't know this about the protocol but will see if it can be
implemented.
The protocol also allows returning information beyond the end if it does
not create any more extents, but I think that part of the protocol is
harder to give as a contract to plugins; maybe you do it by
pre-allocating the map with two extents: the default for the range, and
an unlimited END extent immediately beyond the range, so that it is
obvious if the user overwrite END to provide more data.
--
Eric Blake, Principal Software Engineer
Red Hat, Inc. +1-919-301-3226
Virtualization:
qemu.org |
libvirt.org