On Wed, Mar 20, 2019 at 12:35:33AM -0500, Eric Blake wrote:
On 3/19/19 11:34 AM, Richard W.M. Jones wrote:
> This pair of calls allows plugins to describe which extents in the
> virtual disk are allocated, holes or zeroes.
> ---
> + void nbdkit_extents_free (struct nbdkit_extents_map *);
> +
> +Frees an existing extents map.
Is this like free() where it is safe to call on NULL?
Yes - otherwise CLEANUP_EXTENTS_FREE wouldn't work. Do we need to
document this?
> +=item C<NBDKIT_EXTENTS_FOREACH_FLAG_RANGE>
> +
> +If this flag is included then the C<range_offset> and C<range_length>
> +parameters are used, so only extents overlapping
> +C<[range_offset...range_offset+range_length-1]> are returned.
Must range_offset+range_length-1 lie within the reported
next_opts->get_size(), or can range extend beyond the end of the plugin
(that is, can I pass range == -1 to get from a given offset to the end
of the file, or do I have to compute the end range to avoid internal
errors)?
Extent maps themselves are independent of the size of the plugin, and
the foreach function will let you put any range there you want.
However the server won't accept any NBD_CMD_BLOCK_STATUS request with
a range beyond the end of the export size (valid_range in
server/protocol.c), and so the core server code should not use a range
which is outside the file (block_status_final_map in the same file).
From the point of view of filters that might call the foreach
function, they shouldn't choose a range which is outside the
offset/count passed to the layer below, since the layer below is only
guaranteed to set the extents correctly for that range (except in the
case of REQ_ONE where the rules are even more complicated
unfortunately).
> +++ b/docs/nbdkit-plugin.pod
> +The callback should detect and return the list of extents overlapping
> +the range C<[offset...offset+count-1]>. The C<extents_map> parameter
> +points to an opaque object which the callback should fill in by
> +calling C<nbdkit_extent_add> etc. See L</Extents map> below.
> +
> +The C<flags> parameter may contain the flag C<NBDKIT_FLAG_REQ_ONE>
> +which means that the client is only requesting information about the
> +extent overlapping C<offset>. The plugin may ignore this flag, or as
> +an optimization it may return just a single extent for C<offset>.
> +(Note that implementing this optimization correctly is difficult and
> +subtle. You must ensure that the upper limit of the single extent
> +returned is correct. If unsure then the safest option is to ignore
> +this flag.)
Indeed. Or in other words, since the extent map coalesces adjacent
regions with the same type, merely calling nbdkit_extent_add(0) exactly
once won't work, because that coalesces with the extent map's default of
the entire image already being reported as data; so to report a single
data extent that ends sooner than the client's requested range, it is
necessary to report a different extent starting at the next byte.
Unfortunately the current implementation doesn't coalesce type = 0
with the "background" extent ... It probably should do to reduce
surprises [see below].
As you say though, rules around REQ_ONE are very complex - I
implemented it for the file plugin, but my initial implementation for
the VDDK plugin was incorrect and I removed it. Unfortunately because
VDDK QueryAllocatedBlocks is so slow we will need to implement this
optimization at some point and try to get it right.
> +If there is an error, C<.extents> should call
C<nbdkit_error> with an
> +error message, and C<nbdkit_set_error> to record an appropriate error
> +(unless C<errno> is sufficient), then return C<-1>.
> +
> +=head3 Extents map
> +
> +The plugin C<extents> callback is passed an opaque pointer C<struct
> +nbdkit_extents_map *extents_map>. This structure represents a list of
> +L<filesystem
extents|https://en.wikipedia.org/wiki/Extent_(file_systems)>
> +describing which areas of the disk are allocated, which are sparse
> +(“holes”), and, if supported, which are zeroes. Initially the map
> +describes a single allocated data extent covering the entire virtual
> +disk.
> +
> +The C<extents> callback has to fill in any holes or zeroes areas by
> +calling C<nbdkit_extent*> functions, to describe all extents
> +overlapping the range C<[offset...offset+count-1]>. (nbdkit will
> +ignore extents outside this range so plugins may, if it is easier to
> +implement, return all extent information for the whole disk.)
Worth emphasizing that reporting DATA is always a safe fallback, and
that plugins should favor speed over precision? (If the time it takes
you to report ZERO or SPARSE is not significantly faster than the time
for .pread to just read the extent in question, then reporting DATA is
the better option).
Yes I'll add this observation.
> +++ b/server/extents.c
> +/* Trivial implementation as an ordered list of extents. We will
> + * probably have to change this to a more efficient data structure if
> + * we have plugins with lots of extents.
> + *
> + * Invariants:
> + * - extents are stored in ascending order
> + * - extents must not overlap
> + * - adjacent extents must not have the same type
> + *
Any invariant about not exceeding .get_size()? (I guess that's not
easily possible unless nbdkit_extents_new() were to take a parameter, as
the max size may differ for a filter compared to what the filter passes
to the plugin). I'm slightly worried about a malicious plugin that
purposefully slams us with add() calls way beyond the end of the image.
Maybe you want a magic type for marking EOF when the map is first
created, with different rules on how things coalesce? (Then again, if
the plugin crashes because it manages memory poorly, that's not really
the end of the world; we're more worried about a client triggering bad
behavior, not a plugin).
Hmm, what if we had the signature nbdkit_extents_new(offset, length) to
bound the extents we bother to track from the user - we still allow the
user to call nbdkit_extents_add() with information outside of the bounds
we care about, but merely return 0 on those cases without allocating any
memory? Then we are guaranteed that the guest cannot allocate more
memory than proportional to the size of the client's request (which is
currently capped at 32 bits by the protocol).
It sounds like it would make the implementation more complex ...
> +/* Find the extent before or containing offset. If offset is
inside
> + * the extent, the extent index is returned. If offset is not in any
> + * extent, but there is an extent with a lower offset, then the index
> + * of the nearest lower extent is returned. If offset is before the
> + * zeroth extent then -1 is returned.
Technically, aren't gaps in the map treated as implicit extents of type
0? Then again, I guess you return the next lower extent (rather than the
gap) to make the coalescing nicer.
... but another possible way to implement the structure would be to
have an explicit extent covering every part of the range (either
[0 to INT64_MAX], or [offset to offset+length-1]). This would be a
bit like the common/range struct.
I'll see if that makes the implementation easier.
For worst-case guest behavior, this can result in several memmove()
calls during a single _add(). I guess the way to be more efficient is to
use a different data structure rather than an array; I also suspect that
we want to cater to the typical case of a plugin calling _add() in
ascending order (so actions at the end of the list should be fast, no
matter whether we use array or some other structure).
I believe the structure we really want for efficiency is:
https://en.wikipedia.org/wiki/Interval_tree
I tried implementing one of these a while back and it's very complex.
I've cut some other comments about the structure because I want to see
if implementing this using an array which always contains extents
covering the full range is simpler.
[...]
The NBD protocol allows for short replies (whether or not FLAG_ONE
was
requested) - should we have a fifth type that the plugin can call
_add(END) as a way of specifically marking that they did not provide
extent information beyond that point, and therefore nbdkit must do a
short reply (instead of treating the rest of the range as data)? (For
example, in qemu with the qcow2 file, a block status operation never
reads more than one L2 cluster; requests that overlap regions of the
disk that would require reading a second L2 cluster are instead
truncated). The END type would not be transmitted over the NBD wire,
but may make some other operations easier (such as your addressing your
comment about a plugin honoring REQ_ONE having subtle semantics if it
doesn't take coalescing of subsequent offsets into account).
OK I didn't know this about the protocol but will see if it can be
implemented.
Rich.
--
Richard Jones, Virtualization Group, Red Hat
http://people.redhat.com/~rjones
Read my programming and virtualization blog:
http://rwmj.wordpress.com
virt-p2v converts physical machines to virtual machines. Boot with a
live CD or over the network (PXE) and turn machines into KVM guests.
http://libguestfs.org/virt-v2v