Re: [Libguestfs] [PATCH nbdkit 1/9] server: Implement extents/can_extents calls for plugins and filters.

Wednesday, 20 March 2019

On Wed, Mar 20, 2019 at 12:35:33AM -0500, Eric Blake wrote:
...
 On 3/19/19 11:34 AM, Richard W.M. Jones wrote:
 > This pair of calls allows plugins to describe which extents in the
 > virtual disk are allocated, holes or zeroes.
 > ---

 > + void nbdkit_extents_free (struct nbdkit_extents_map *);
 > +
 > +Frees an existing extents map.

 Is this like free() where it is safe to call on NULL? 
Yes - otherwise CLEANUP_EXTENTS_FREE wouldn't work.  Do we need to
document this?

...
 > +=item C<NBDKIT_EXTENTS_FOREACH_FLAG_RANGE>
 > +
 > +If this flag is included then the C<range_offset> and C<range_length>
 > +parameters are used, so only extents overlapping
 > +C<[range_offset...range_offset+range_length-1]> are returned.

 Must range_offset+range_length-1 lie within the reported
 next_opts->get_size(), or can range extend beyond the end of the plugin
 (that is, can I pass range == -1 to get from a given offset to the end
 of the file, or do I have to compute the end range to avoid internal
 errors)? 
Extent maps themselves are independent of the size of the plugin, and
the foreach function will let you put any range there you want.

However the server won't accept any NBD_CMD_BLOCK_STATUS request with
a range beyond the end of the export size (valid_range in
server/protocol.c), and so the core server code should not use a range
which is outside the file (block_status_final_map in the same file).

...
From the point of view of filters that might call the foreach
function, they shouldn't choose a range which is outside the
offset/count passed to the layer below, since the layer below is only
guaranteed to set the extents correctly for that range (except in the
case of REQ_ONE where the rules are even more complicated
unfortunately).

...
 > +++ b/docs/nbdkit-plugin.pod

 > +The callback should detect and return the list of extents overlapping
 > +the range C<[offset...offset+count-1]>.  The C<extents_map> parameter
 > +points to an opaque object which the callback should fill in by
 > +calling C<nbdkit_extent_add> etc.  See L</Extents map> below.
 > +
 > +The C<flags> parameter may contain the flag C<NBDKIT_FLAG_REQ_ONE>
 > +which means that the client is only requesting information about the
 > +extent overlapping C<offset>.  The plugin may ignore this flag, or as
 > +an optimization it may return just a single extent for C<offset>.
 > +(Note that implementing this optimization correctly is difficult and
 > +subtle.  You must ensure that the upper limit of the single extent
 > +returned is correct.  If unsure then the safest option is to ignore
 > +this flag.)

 Indeed. Or in other words, since the extent map coalesces adjacent
 regions with the same type, merely calling nbdkit_extent_add(0) exactly
 once won't work, because that coalesces with the extent map's default of
 the entire image already being reported as data; so to report a single
 data extent that ends sooner than the client's requested range, it is
 necessary to report a different extent starting at the next byte. 
Unfortunately the current implementation doesn't coalesce type = 0
with the "background" extent ...  It probably should do to reduce
surprises [see below].

As you say though, rules around REQ_ONE are very complex - I
implemented it for the file plugin, but my initial implementation for
the VDDK plugin was incorrect and I removed it.  Unfortunately because
VDDK QueryAllocatedBlocks is so slow we will need to implement this
optimization at some point and try to get it right.

...
 > +If there is an error, C<.extents> should call
C<nbdkit_error> with an
 > +error message, and C<nbdkit_set_error> to record an appropriate error
 > +(unless C<errno> is sufficient), then return C<-1>.
 > +
 > +=head3 Extents map
 > +
 > +The plugin C<extents> callback is passed an opaque pointer C<struct
 > +nbdkit_extents_map *extents_map>.  This structure represents a list of
 > +L<filesystem extents|https://en.wikipedia.org/wiki/Extent_(file_systems)>
 > +describing which areas of the disk are allocated, which are sparse
 > +(“holes”), and, if supported, which are zeroes.  Initially the map
 > +describes a single allocated data extent covering the entire virtual
 > +disk.
 > +
 > +The C<extents> callback has to fill in any holes or zeroes areas by
 > +calling C<nbdkit_extent*> functions, to describe all extents
 > +overlapping the range C<[offset...offset+count-1]>.  (nbdkit will
 > +ignore extents outside this range so plugins may, if it is easier to
 > +implement, return all extent information for the whole disk.)

 Worth emphasizing that reporting DATA is always a safe fallback, and
 that plugins should favor speed over precision?  (If the time it takes
 you to report ZERO or SPARSE is not significantly faster than the time
 for .pread to just read the extent in question, then reporting DATA is
 the better option). 
Yes I'll add this observation.

...
 > +++ b/server/extents.c

 > +/* Trivial implementation as an ordered list of extents.  We will
 > + * probably have to change this to a more efficient data structure if
 > + * we have plugins with lots of extents.
 > + *
 > + * Invariants:
 > + * - extents are stored in ascending order
 > + * - extents must not overlap
 > + * - adjacent extents must not have the same type
 > + *

 Any invariant about not exceeding .get_size()?  (I guess that's not
 easily possible unless nbdkit_extents_new() were to take a parameter, as
 the max size may differ for a filter compared to what the filter passes
 to the plugin).  I'm slightly worried about a malicious plugin that
 purposefully slams us with add() calls way beyond the end of the image.
 Maybe you want a magic type for marking EOF when the map is first
 created, with different rules on how things coalesce?  (Then again, if
 the plugin crashes because it manages memory poorly, that's not really
 the end of the world; we're more worried about a client triggering bad
 behavior, not a plugin).

 Hmm, what if we had the signature nbdkit_extents_new(offset, length) to
 bound the extents we bother to track from the user - we still allow the
 user to call nbdkit_extents_add() with information outside of the bounds
 we care about, but merely return 0 on those cases without allocating any
 memory?  Then we are guaranteed that the guest cannot allocate more
 memory than proportional to the size of the client's request (which is
 currently capped at 32 bits by the protocol). 
It sounds like it would make the implementation more complex ...

...
 > +/* Find the extent before or containing offset.  If offset is
inside
 > + * the extent, the extent index is returned.  If offset is not in any
 > + * extent, but there is an extent with a lower offset, then the index
 > + * of the nearest lower extent is returned.  If offset is before the
 > + * zeroth extent then -1 is returned.

 Technically, aren't gaps in the map treated as implicit extents of type
 0? Then again, I guess you return the next lower extent (rather than the
 gap) to make the coalescing nicer. 
... but another possible way to implement the structure would be to
have an explicit extent covering every part of the range (either
[0 to INT64_MAX], or [offset to offset+length-1]).  This would be a
bit like the common/range struct.

I'll see if that makes the implementation easier.

...
 For worst-case guest behavior, this can result in several memmove()
 calls during a single _add(). I guess the way to be more efficient is to
 use a different data structure rather than an array; I also suspect that
 we want to cater to the typical case of a plugin calling _add() in
 ascending order (so actions at the end of the list should be fast, no
 matter whether we use array or some other structure). 
I believe the structure we really want for efficiency is:

https://en.wikipedia.org/wiki/Interval_tree

I tried implementing one of these a while back and it's very complex.

I've cut some other comments about the structure because I want to see
if implementing this using an array which always contains extents
covering the full range is simpler.

[...]

...
 The NBD protocol allows for short replies (whether or not FLAG_ONE
was
 requested) - should we have a fifth type that the plugin can call
 _add(END) as a way of specifically marking that they did not provide
 extent information beyond that point, and therefore nbdkit must do a
 short reply (instead of treating the rest of the range as data)?  (For
 example, in qemu with the qcow2 file, a block status operation never
 reads more than one L2 cluster; requests that overlap regions of the
 disk that would require reading a second L2 cluster are instead
 truncated).  The END type would not be transmitted over the NBD wire,
 but may make some other operations easier (such as your addressing your
 comment about a plugin honoring REQ_ONE having subtle semantics if it
 doesn't take coalescing of subsequent offsets into account). 
OK I didn't know this about the protocol but will see if it can be
implemented.

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-p2v converts physical machines to virtual machines.  Boot with a
live CD or over the network (PXE) and turn machines into KVM guests.
http://libguestfs.org/virt-v2v

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Re: [Libguestfs] [PATCH nbdkit 1/9] server: Implement extents/can_extents calls for plugins and filters.