The way nbdcopy works now is it reads the extent map from the source
(in pieces). This extent map is a flat, contiguous array describing
data + holes.
The size of those regions is whatever the source gives us, and we
don't do any further processing on it[1]. Importantly it is unrelated
to the block size of the destination.
This map drives the work loop (multi-thread-copying.c). Holes begin
an asynchronous zero command immediately. Data causes an asynch read
of the source, and when that command completes (other commands are
running at the same time in the same thread) we do sparseness
detection and based on that we issue one of more asynch write and
asynch zero commands to the destination.
As a further complication there are several threads working in
parallel on large blocks (128M) of the source.
The result of all this is that the destination driver (eg. nbd-ops.c)
sees a mix of zero and write requests, mostly out of order, and with
no particular block size. [This is different from what I said before -
I wrongly said that the driver always saw requests in order.]
It should never write to the same byte in the output twice, and each
byte should be written (or zeroed) exactly once.
My proposal is to forward all aligned (to the preferred block size)
requests to the destination. Also any non-aligned requests that
contain whole blocks. These have to be split into fragment + aligned
part + fragment subrequests at block boundaries.
For the fragments it's tricky. My first idea was to calloc a block
and copy each fragment's data into this block (that's what I meant
when I was talking about read-modify-write), but it's hard to know
when a block is complete. (Bitmap?)
I think what we can do instead is to save the fragments, indexed by
destination block number and offset. It should be relatively easy to
test when a new fragment arrives if have all of the fragments for a
particular block, so we can then write out the whole block to the
destination.
Particular attention also has to be paid to the final block which
might be a short block (or can it not be?). Also if we get to the end
and discover we have fragments left over then it would seem to
indicate a bug in the program.
I've not yet implemented any of this. Given how much other stuff I've
got to do it seems like RHEL 9.1 material.
Rich.
[1] We really ought to coalesce adjacent regions of the same type, and
unconditionally throwing away very small holes (eg. < 4K) might be
worth doing in case the source is not well-behaved.
--
Richard Jones, Virtualization Group, Red Hat
http://people.redhat.com/~rjones
Read my programming and virtualization blog:
http://rwmj.wordpress.com
virt-p2v converts physical machines to virtual machines. Boot with a
live CD or over the network (PXE) and turn machines into KVM guests.
http://libguestfs.org/virt-v2v