Splitting up virt-v2v
by Richard W.M. Jones
For a long time I've wanted to split up virt-v2v into smaller
components to make it easier to consume. It's never been clear how to
do this, but I think I have a workable plan now, described in this email.
----------------------------------------------------------------------
First, the AIMS, which are:
(a) Preserve current functionality, including copying conversion,
in-place conversion, and the virt-v2v command line.
(b) Allow warm migration to use virt-v2v without requiring the
"--debug-overlays hack".
(c) Allow threads, multi-conn, and parallel copying of guest disks, all
for better copying performance.
(d) Allow an alternate supervisor to convert and copy many guests in
parallel, given that the supervisor has a global view of the
system/network (I'm not intending to implement this, only to make
it possible).
(e) Better progress bars.
(f) Better logging.
(g) Reuse as much existing code as possible. This is NOT a rewrite!
----------------------------------------------------------------------
Here's my PLAN:
/usr/bin/virt-v2v still exists, but it's now a supervisor program
(possibly even a shell script) that runs the steps below:
(1) Set up the input side by running "helper-v2v-input-<type>". For
all input types this creates a temporary directory containing:
/tmp/XXXXXX/in1 NBD endpoints overlaying the source disk(s)
/tmp/XXXXXX/in2 (these are actually Unix domain sockets)
/tmp/XXXXXX/in3
/tmp/XXXXXX/metadata.in Metadata parsed from the source.
Currently for most inputs we have a running nbdkit process for
each source disk, and we'd do the same here, except we add
nbdkit-cow-filter on top so that the source disk is protected from
being modified. Another small difference is that for -i disk
(local input) we would need an active nbdkit process on top of the
disk, whereas currently we set the disk as a qcow2 backing file.
(2) Perform the conversion by running "helper-v2v-convert". This does
the conversion and sparsification. It writes directly to the NBD
endpoints (in*) above. The writes are stored in the COW overlay
so the source disk is not modified.
Conversion will also create an output metadata file:
/tmp/XXXXXX/metadata.out Target metadata
Exact format of the metadata files is to be decided, but some kind
of not-quite-libvirt-XML may be suitable. It's also not clear if
the metadata format is an internal detail of virt-v2v, or if we
document it as a stable API.
(3) Set up the output side by running "helper-v2v-output-<type>
setup". This will read the output metadata and do whatever is
needed to set up the empty output disks (perhaps by creating a
guest on the target, but also this could be done in step (5)
below).
This will create:
/tmp/XXXXXX/out1 NBD endpoints overlaying the target disk(s)
/tmp/XXXXXX/out2 (these are actually Unix domain sockets)
/tmp/XXXXXX/out3
(4) Do the copy. By default this will run either nbdcopy or qemu-img
convert from in* -> out*.
Copying could be done in parallel, currently it is done serially.
(5) Finalize the output by running "helper-v2v-output-<type> final".
This might create the target guest and whatever else is needed.
(6) Kill the NBD servers and clean up the temporary directory.
----------------------------------------------------------------------
Let's see how this plan matches the aims.
Aim (a):
Copying conversion works as outlined above. In-place conversion
works by placing an NBD server on top of the files you want to
convert and running helper-v2v-convert (virt-v2v --in-place would
also still work for backwards compat).
Aim (b):
Warm migration: Should be fairly clear this can work in the same way
as in-place conversion, but I'll discuss this further with Martin K
and Tomas to make sure I'm not missing anything.
Aims (c), (d):
Threads etc for performance: Although I don't plan to implement
this, it's clear that an alternate supervisor program could improve
performance here by either doing copies of a single guest / multiple
disks in parallel, but even better by having a global view of the
system and doing copies of multiple guests' disks in parallel.
This is outside the scope of the virt-v2v project, but in scope for
something like MTV.
Aim (e):
Better progress bars: nbdcopy should have support for
machine-readable progress bars, once I push the changes. It will
mean no more need to parse debug logs.
Aim (f):
Better logging: I hope we can log each step separately.
A custom supervisor program would also be able to tell which
particular step failed (eg. did it fail in conversion? did it fail
copying a disk and which one?)
Aim (g):
This works by splitting up the existing v2v code base into separate
binaries. It is already broadly structured (internally) like this.
So it's not a rewrite, it's a big refactoring.
However I'd probably write a new virt-v2v supervisor binary, because
the existing command line parsing code is extremely complex.
Rich.
--
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines. Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top
3 years, 9 months
[PATCH] v2v: support configuration of viosock driver
by Valeriy Vdovin
Check that install_drivers function has copied viosock driver files to
the windows guest file system. If positive, this means the drivers have
passed minor/major version check and the guest is able to use them.
After we know that that the drivers are on the guest, we can enable
virtio sock option in configuration and starting script files.
Signed-off-by: Valeriy Vdovin <valeriy.vdovin(a)virtuozzo.com>
---
v2v/convert_linux.ml | 1 +
v2v/convert_windows.ml | 4 +++-
v2v/create_json.ml | 1 +
v2v/create_libvirt_xml.ml | 6 ++++++
v2v/linux_kernels.ml | 4 ++++
v2v/linux_kernels.mli | 1 +
v2v/output_qemu.ml | 4 ++++
v2v/types.ml | 1 +
v2v/types.mli | 2 +-
v2v/windows_virtio.ml | 5 +++--
v2v/windows_virtio.mli | 2 +-
11 files changed, 26 insertions(+), 5 deletions(-)
diff --git a/v2v/convert_linux.ml b/v2v/convert_linux.ml
index 86d387f1..9f22fe3c 100644
--- a/v2v/convert_linux.ml
+++ b/v2v/convert_linux.ml
@@ -154,6 +154,7 @@ let convert (g : G.guestfs) inspect source_disks output rcaps _ =
gcaps_virtio_rng = kernel.ki_supports_virtio_rng;
gcaps_virtio_balloon = kernel.ki_supports_virtio_balloon;
gcaps_isa_pvpanic = kernel.ki_supports_isa_pvpanic;
+ gcaps_virtio_socket = kernel.ki_supports_virtio_socket;
gcaps_machine = machine;
gcaps_arch = Utils.kvm_arch inspect.i_arch;
gcaps_acpi = acpi;
diff --git a/v2v/convert_windows.ml b/v2v/convert_windows.ml
index b452c09b..7842f443 100644
--- a/v2v/convert_windows.ml
+++ b/v2v/convert_windows.ml
@@ -214,7 +214,8 @@ let convert (g : G.guestfs) inspect _ output rcaps static_ips =
video_driver,
virtio_rng_supported,
virtio_ballon_supported,
- isa_pvpanic_supported =
+ isa_pvpanic_supported,
+ virtio_socket_supported =
Registry.with_hive_write g inspect.i_windows_system_hive
update_system_hive in
@@ -256,6 +257,7 @@ let convert (g : G.guestfs) inspect _ output rcaps static_ips =
gcaps_virtio_rng = virtio_rng_supported;
gcaps_virtio_balloon = virtio_ballon_supported;
gcaps_isa_pvpanic = isa_pvpanic_supported;
+ gcaps_virtio_socket = virtio_socket_supported;
gcaps_machine = machine;
gcaps_arch = Utils.kvm_arch inspect.i_arch;
gcaps_acpi = true;
diff --git a/v2v/create_json.ml b/v2v/create_json.ml
index fdf7b12f..316a5536 100644
--- a/v2v/create_json.ml
+++ b/v2v/create_json.ml
@@ -229,6 +229,7 @@ let create_json_metadata source targets target_buses
"virtio-rng", JSON.Bool guestcaps.gcaps_virtio_rng;
"virtio-balloon", JSON.Bool guestcaps.gcaps_virtio_balloon;
"isa-pvpanic", JSON.Bool guestcaps.gcaps_isa_pvpanic;
+ "virtio-socket", JSON.Bool guestcaps.gcaps_virtio_socket;
"acpi", JSON.Bool guestcaps.gcaps_acpi;
] in
List.push_back doc ("guestcaps", JSON.Dict guestcaps_dict);
diff --git a/v2v/create_libvirt_xml.ml b/v2v/create_libvirt_xml.ml
index 212ace2d..6a764cb2 100644
--- a/v2v/create_libvirt_xml.ml
+++ b/v2v/create_libvirt_xml.ml
@@ -521,6 +521,12 @@ let create_libvirt_xml ?pool source targets target_buses guestcaps
e "address" ["type", "isa"; "iobase", "0x505"] []
]
);
+ List.push_back devices (
+ e "viosock"
+ ["model",
+ if guestcaps.gcaps_virtio_socket then "virtio" else "none"]
+ []
+ );
(* Standard devices added to every guest. *)
List.push_back_list devices [
diff --git a/v2v/linux_kernels.ml b/v2v/linux_kernels.ml
index 7e171eae..6dead217 100644
--- a/v2v/linux_kernels.ml
+++ b/v2v/linux_kernels.ml
@@ -44,6 +44,7 @@ type kernel_info = {
ki_supports_virtio_rng : bool;
ki_supports_virtio_balloon : bool;
ki_supports_isa_pvpanic : bool;
+ ki_supports_virtio_socket : bool;
ki_is_xen_pv_only_kernel : bool;
ki_is_debug : bool;
ki_config_file : string option;
@@ -246,6 +247,8 @@ let detect_kernels (g : G.guestfs) inspect family bootloader =
kernel_supports "virtio_balloon" "VIRTIO_BALLOON" in
let supports_isa_pvpanic =
kernel_supports "pvpanic" "PVPANIC" in
+ let supports_virtio_socket =
+ kernel_supports "virtio_socket" "VIRTIO_SOCKET" in
let is_xen_pv_only_kernel =
check_config "X86_XEN" config_file ||
check_config "X86_64_XEN" config_file in
@@ -272,6 +275,7 @@ let detect_kernels (g : G.guestfs) inspect family bootloader =
ki_supports_virtio_rng = supports_virtio_rng;
ki_supports_virtio_balloon = supports_virtio_balloon;
ki_supports_isa_pvpanic = supports_isa_pvpanic;
+ ki_supports_virtio_socket = supports_virtio_socket;
ki_is_xen_pv_only_kernel = is_xen_pv_only_kernel;
ki_is_debug = is_debug;
ki_config_file = config_file;
diff --git a/v2v/linux_kernels.mli b/v2v/linux_kernels.mli
index 028eba81..fe81a036 100644
--- a/v2v/linux_kernels.mli
+++ b/v2v/linux_kernels.mli
@@ -33,6 +33,7 @@ type kernel_info = {
ki_supports_virtio_rng : bool; (** Kernel supports virtio-rng? *)
ki_supports_virtio_balloon : bool; (** Kernel supports memory balloon? *)
ki_supports_isa_pvpanic : bool; (** Kernel supports ISA pvpanic device? *)
+ ki_supports_virtio_socket : bool; (** Kernel supports virtio-socket? *)
ki_is_xen_pv_only_kernel : bool; (** Is a Xen paravirt-only kernel? *)
ki_is_debug : bool; (** Is debug kernel? *)
ki_config_file : string option; (** Path of config file, if found. *)
diff --git a/v2v/output_qemu.ml b/v2v/output_qemu.ml
index be3a3c5e..d6d70c23 100644
--- a/v2v/output_qemu.ml
+++ b/v2v/output_qemu.ml
@@ -247,6 +247,10 @@ object
arg "-balloon" "none";
if guestcaps.gcaps_isa_pvpanic then
arg_list "-device" ["pvpanic"; "ioport=0x505"];
+ if guestcaps.gcaps_virtio_socket then
+ arg "-viosock" "virtio"
+ else
+ arg "-viosock" "none";
(* Add a serial console to Linux guests. *)
if inspect.i_type = "linux" then
diff --git a/v2v/types.ml b/v2v/types.ml
index a8949e4b..4c7ee864 100644
--- a/v2v/types.ml
+++ b/v2v/types.ml
@@ -411,6 +411,7 @@ type guestcaps = {
gcaps_virtio_rng : bool;
gcaps_virtio_balloon : bool;
gcaps_isa_pvpanic : bool;
+ gcaps_virtio_socket : bool;
gcaps_machine : guestcaps_machine;
gcaps_arch : string;
gcaps_acpi : bool;
diff --git a/v2v/types.mli b/v2v/types.mli
index f474dcaa..42a80d9d 100644
--- a/v2v/types.mli
+++ b/v2v/types.mli
@@ -252,7 +252,7 @@ type guestcaps = {
gcaps_virtio_rng : bool; (** Guest supports virtio-rng. *)
gcaps_virtio_balloon : bool; (** Guest supports virtio balloon. *)
gcaps_isa_pvpanic : bool; (** Guest supports ISA pvpanic device. *)
-
+ gcaps_virtio_socket : bool; (** Guest supports virtio socket. *)
gcaps_machine : guestcaps_machine; (** Machine model. *)
gcaps_arch : string; (** Architecture that KVM must emulate. *)
gcaps_acpi : bool; (** True if guest supports acpi. *)
diff --git a/v2v/windows_virtio.ml b/v2v/windows_virtio.ml
index 74a43cc7..cf417a45 100644
--- a/v2v/windows_virtio.ml
+++ b/v2v/windows_virtio.ml
@@ -68,7 +68,7 @@ let rec install_drivers ((g, _) as reg) inspect rcaps =
match net_type with
| Some model -> model
| None -> RTL8139 in
- (IDE, net_type, Cirrus, false, false, false)
+ (IDE, net_type, Cirrus, false, false, false, false)
)
else (
(* Can we install the block driver? *)
@@ -178,9 +178,10 @@ let rec install_drivers ((g, _) as reg) inspect rcaps =
let virtio_rng_supported = g#exists (driverdir // "viorng.inf") in
let virtio_ballon_supported = g#exists (driverdir // "balloon.inf") in
let isa_pvpanic_supported = g#exists (driverdir // "pvpanic.inf") in
+ let virtio_socket_supported = g#exists (driverdir // "viosock.inf") in
(block, net, video,
- virtio_rng_supported, virtio_ballon_supported, isa_pvpanic_supported)
+ virtio_rng_supported, virtio_ballon_supported, isa_pvpanic_supported, virtio_socket_supported)
)
and install_linux_tools g inspect =
diff --git a/v2v/windows_virtio.mli b/v2v/windows_virtio.mli
index c063af3f..642317b1 100644
--- a/v2v/windows_virtio.mli
+++ b/v2v/windows_virtio.mli
@@ -20,7 +20,7 @@
val install_drivers
: Registry.t -> Types.inspect -> Types.requested_guestcaps ->
- Types.guestcaps_block_type * Types.guestcaps_net_type * Types.guestcaps_video_type * bool * bool * bool
+ Types.guestcaps_block_type * Types.guestcaps_net_type * Types.guestcaps_video_type * bool * bool * bool * bool
(** [install_drivers reg inspect rcaps]
installs virtio drivers from the driver directory or driver
ISO into the guest driver directory and updates the registry
--
2.27.0
3 years, 9 months
[nbdkit PATCH v2 0/6] Add multi-conn filter
by Eric Blake
In v2:
- lots more patches to allow cross-connection plugin calls
- added patch for exportname filtering
- working testsuite, including bug fixes in v1 that it uncovered
- more cross-references in man pages
Eric Blake (6):
backend: Split out new backend_handle_* functions
Revert "filters: Remove most next_* wrappers"
Revert "server: filters: Remove struct b_h."
filters: Track next handle alongside next backend
multi-conn: New filter
multi-conn: Add knob to limit consistency emulation by export name
docs/nbdkit-filter.pod | 4 +-
filters/cache/nbdkit-cache-filter.pod | 5 +-
filters/fua/nbdkit-fua-filter.pod | 7 +
.../multi-conn/nbdkit-multi-conn-filter.pod | 201 ++++++++
filters/nocache/nbdkit-nocache-filter.pod | 1 +
filters/noextents/nbdkit-noextents-filter.pod | 1 +
.../noparallel/nbdkit-noparallel-filter.pod | 1 +
filters/nozero/nbdkit-nozero-filter.pod | 1 +
include/nbdkit-filter.h | 145 +++---
configure.ac | 4 +-
filters/multi-conn/Makefile.am | 68 +++
tests/Makefile.am | 13 +-
server/internal.h | 60 ++-
server/backend.c | 147 +++++-
server/extents.c | 7 +-
server/filters.c | 424 +++++++++++++---
filters/checkwrite/checkwrite.c | 2 +-
filters/exitlast/exitlast.c | 2 +-
filters/exitwhen/exitwhen.c | 2 +-
filters/gzip/gzip.c | 2 +-
filters/limit/limit.c | 6 +-
filters/log/log.c | 2 +-
filters/multi-conn/multi-conn.c | 470 ++++++++++++++++++
filters/tar/tar.c | 2 +-
tests/test-multi-conn-name.sh | 88 ++++
tests/test-multi-conn-plugin.sh | 141 ++++++
tests/test-multi-conn.sh | 293 +++++++++++
TODO | 7 -
28 files changed, 1921 insertions(+), 185 deletions(-)
create mode 100644 filters/multi-conn/nbdkit-multi-conn-filter.pod
create mode 100644 filters/multi-conn/Makefile.am
create mode 100644 filters/multi-conn/multi-conn.c
create mode 100755 tests/test-multi-conn-name.sh
create mode 100755 tests/test-multi-conn-plugin.sh
create mode 100755 tests/test-multi-conn.sh
--
2.30.1
3 years, 9 months
[RFC nbdkit PATCH] multi-conn: New filter
by Eric Blake
Implement a TODO item of emulating multi-connection consistency via
multiple plugin flush calls to allow a client to assume that a flush
on a single connection is good enough. This also gives us some
fine-tuning over whether to advertise the bit, including some setups
that are unsafe but may be useful in timing tests.
Testing is interesting: I used the sh plugin to implement a server
that intentionally keeps a per-connection cache.
Note that this filter assumes that multiple connections will still
share the same data (other than caching effects); effects are not
guaranteed when trying to mix it with more exotic plugins like info
that violate that premise.
---
I'm still working on the test; the sh plugin is good enough that it
does what I want when playing with it manually, but I still need to
write up various scenarios in test-multi-conn.sh to match what I've
played with manually.
I'm open to feedback on the set of options I've exposed during .config
(too many, not enough, better names?) Right now, it is:
multi-conn-mode=auto|plugin|disable|emulate|unsafe
multi-conn-track-dirty=fast|connection|off
.../multi-conn/nbdkit-multi-conn-filter.pod | 169 +++++++
configure.ac | 4 +-
filters/multi-conn/Makefile.am | 68 +++
tests/Makefile.am | 11 +-
filters/multi-conn/multi-conn.c | 467 ++++++++++++++++++
tests/test-multi-conn-plugin.sh | 121 +++++
tests/test-multi-conn.sh | 85 ++++
TODO | 7 -
8 files changed, 923 insertions(+), 9 deletions(-)
create mode 100644 filters/multi-conn/nbdkit-multi-conn-filter.pod
create mode 100644 filters/multi-conn/Makefile.am
create mode 100644 filters/multi-conn/multi-conn.c
create mode 100755 tests/test-multi-conn-plugin.sh
create mode 100755 tests/test-multi-conn.sh
diff --git a/filters/multi-conn/nbdkit-multi-conn-filter.pod b/filters/multi-conn/nbdkit-multi-conn-filter.pod
new file mode 100644
index 00000000..ae2873df
--- /dev/null
+++ b/filters/multi-conn/nbdkit-multi-conn-filter.pod
@@ -0,0 +1,169 @@
+=head1 NAME
+
+nbdkit-multi-conn-filter - nbdkit multi-conn filter
+
+=head1 SYNOPSIS
+
+ nbdkit --filter=multi-conn plugin \
+ [multi-conn-mode=MODE] [multi-conn-track-dirty=LEVEL] [plugin-args...]
+
+=head1 DESCRIPTION
+
+C<nbdkit-multi-conn-filter> is a filter that enables alterations to
+the server's advertisement of NBD_FLAG_MULTI_CONN. When a server
+permits multiple simultaneous clients, and sets this flag, a client
+may assume that all connections see a consistent view (after getting
+the server reply from a write in one connection, sending a flush
+command on a single connection and waiting for that reply then
+guarantees that all connections will then see the just-written data).
+If the flag is not advertised, a client must presume that separate
+connections may have utilized independent caches, and where a flush on
+one connection does not affect the cache of a second connection.
+
+The main use of this filter is to emulate consistent semantics across
+multiple connections when not already provided by a plugin, although
+it also has additional modes useful for evaluating performance and
+correctness of client and plugin multi-conn behaviors. This filter
+assumes that multiple connections to a plugin will eventually share
+data, other than any caching effects; it is not suitable for use with
+a plugin that produces completely independent data per connection.
+
+Additional control over the behavior of client flush commands is
+possible by combining this filter with L<nbdkit-fua-filter(1)>.
+
+=head1 PARAMETERS
+
+=over 4
+
+=item B<multi-conn-mode=auto>
+
+This filter defaults to B<auto> mode. If the plugin advertises
+multi-conn, then this filter behaves the same as B<plugin> mode;
+otherwise, this filter behaves the same as B<emulate> mode. Either
+way, this mode advertises NBD_FLAG_MULTI_CONN to the client.
+
+=item B<multi-conn-mode=emulate>
+
+When B<emulate> mode is chosen, then this filter tracks all parallel
+connections. When a client issues a flush command over any one
+connection (including a write command with the FUA (force unit access)
+flag set), the filter then replicates that flush across each
+connection to the plugin (although the amount of plugin calls can be
+tuned by adjusting B<multi-conn-track-dirty>). This assumes that
+flushing each connection is enough to clear any per-connection cached
+data, in order to give each connection a consistent view of the image;
+therefore, this mode advertises NBD_FLAG_MULTI_CONN to the client.
+
+Note that in this mode, a client will be unable to connect if the
+plugin lacks support for flush, as there would be no way to emulate
+cross-connection consistency.
+
+=item B<multi-conn-mode=disable>
+
+When B<disable> mode is chosen, this filter disables advertisement of
+NBD_FLAG_MULTI_CONN to the client, even if the plugin supports it, and
+does not replicate flush commands across connections. This is useful
+for testing whether a client with multiple connections properly sends
+multiple flushes in order to overcome per-connection caching.
+
+=item B<multi-conn-mode=plugin>
+
+When B<plugin> mode is chosen, the filter does not change whether
+NBD_FLAG_MULTI_CONN is advertised by the plugin, and does not
+replicate flush commands across connections; but still honors
+B<multi-conn-track-dirty> for minimizing the number of flush commands
+passed on to the plugin.
+
+=item B<multi-conn-mode=unsafe>
+
+When B<unsafe> mode is chosen, this filter blindly advertises
+NBD_FLAG_MULTI_CONN to the client even if the plugin lacks support.
+This is dangerous, and risks data corruption if the client makes
+assumptions about data consistency that were not actually met.
+
+=item B<multi-conn-track-dirty=fast>
+
+When dirty tracking is set to B<fast>, the filter tracks whether any
+connection has caused the image to be dirty (any write, zero, or trim
+commands since the last flush, regardless of connection); if all
+connections are clean, a client flush command is ignored rather than
+sent on to the plugin. In this mode, a flush action on one connection
+marks all other connections as clean, regardless of whether the filter
+actually advertised NBD_FLAG_MULTI_CONN, which can result in less
+activity when a client sends multiple flushes rather than taking
+advantage of multi-conn semantics. This is safe with
+B<multi-conn-mode=emulate>, but potentially unsafe with
+B<multi-conn-mode=plugin> when the plugin did not advertise
+multi-conn, as it does not track whether a read may have cached stale
+data prior to a flush.
+
+=item B<multi-conn-track-dirty=connection>
+
+Dirty tracking is set to B<connection> by default, where the filter
+tracks whether a given connection is dirty (any write, zero, or trim
+commands since the last flush on the given connection, and any read
+since the last flush on any other connection); if the connection is
+clean, a flush command to that connection (whether directly from the
+client, or replicated by B<multi-conn-mode=emulate> is ignored rather
+than sent on to the plugin. This mode may result in more flush calls
+than B<multi-conn-track-dirty=fast>, but in turn is safe to use with
+B<multi-conn-mode=plugin>.
+
+=item B<multi-conn-track-dirty=off>
+
+When dirty tracking is set to B<off>, all flush commands from the
+client are passed on to the plugin, regardless of whether the flush
+would be needed for consistency. Note that when combined with
+B<multi-conn-mode=emulate>, a client which disregards
+NBD_FLAG_MULTI_CONN by flushing on each connection itself results in a
+quadratic number of flush operations on the plugin.
+
+=back
+
+=head1 EXAMPLES
+
+Provide consistent cross-connection flush semantics on top of a plugin
+that lacks it natively:
+
+ nbdkit --filter=multi-conn split file.part1 file.part2
+
+Minimize the number of expensive flush operations performed when
+utilizing a plugin that has multi-conn consistency from a client that
+blindly flushes across every connection:
+
+ nbdkit --filter=multi-conn file multi-conn-mode=plugin \
+ multi-conn-track-dirty=fast disk.img
+
+=head1 FILES
+
+=over 4
+
+=item F<$filterdir/nbdkit-multi-conn-filter.so>
+
+The filter.
+
+Use C<nbdkit --dump-config> to find the location of C<$filterdir>.
+
+=back
+
+=head1 VERSION
+
+C<nbdkit-multi-conn-filter> first appeared in nbdkit 1.26.
+
+=head1 SEE ALSO
+
+L<nbdkit(1)>,
+L<nbdkit-file-plugin(1)>,
+L<nbdkit-filter(3)>,
+L<nbdkit-fua-filter(1)>,
+L<nbdkit-nocache-filter(1)>,
+L<nbdkit-noextents-filter(1)>,
+L<nbdkit-nozero-filter(1)>.
+
+=head1 AUTHORS
+
+Eric Blake
+
+=head1 COPYRIGHT
+
+Copyright (C) 2018-2021 Red Hat Inc.
diff --git a/configure.ac b/configure.ac
index cb18dd88..2b3e214e 100644
--- a/configure.ac
+++ b/configure.ac
@@ -1,5 +1,5 @@
# nbdkit
-# Copyright (C) 2013-2020 Red Hat Inc.
+# Copyright (C) 2013-2021 Red Hat Inc.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are
@@ -128,6 +128,7 @@ filters="\
ip \
limit \
log \
+ multi-conn \
nocache \
noextents \
nofilter \
@@ -1259,6 +1260,7 @@ AC_CONFIG_FILES([Makefile
filters/ip/Makefile
filters/limit/Makefile
filters/log/Makefile
+ filters/multi-conn/Makefile
filters/nocache/Makefile
filters/noextents/Makefile
filters/nofilter/Makefile
diff --git a/filters/multi-conn/Makefile.am b/filters/multi-conn/Makefile.am
new file mode 100644
index 00000000..778b8947
--- /dev/null
+++ b/filters/multi-conn/Makefile.am
@@ -0,0 +1,68 @@
+# nbdkit
+# Copyright (C) 2021 Red Hat Inc.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are
+# met:
+#
+# * Redistributions of source code must retain the above copyright
+# notice, this list of conditions and the following disclaimer.
+#
+# * Redistributions in binary form must reproduce the above copyright
+# notice, this list of conditions and the following disclaimer in the
+# documentation and/or other materials provided with the distribution.
+#
+# * Neither the name of Red Hat nor the names of its contributors may be
+# used to endorse or promote products derived from this software without
+# specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY RED HAT AND CONTRIBUTORS ''AS IS'' AND
+# ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
+# THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
+# PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL RED HAT OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+# USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+# OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+# SUCH DAMAGE.
+
+include $(top_srcdir)/common-rules.mk
+
+EXTRA_DIST = nbdkit-multi-conn-filter.pod
+
+filter_LTLIBRARIES = nbdkit-multi-conn-filter.la
+
+nbdkit_multi_conn_filter_la_SOURCES = \
+ multi-conn.c \
+ $(top_srcdir)/include/nbdkit-filter.h \
+ $(NULL)
+
+nbdkit_multi_conn_filter_la_CPPFLAGS = \
+ -I$(top_srcdir)/include \
+ -I$(top_srcdir)/common/include \
+ -I$(top_srcdir)/common/utils \
+ $(NULL)
+nbdkit_multi_conn_filter_la_CFLAGS = $(WARNINGS_CFLAGS)
+nbdkit_multi_conn_filter_la_LIBADD = \
+ $(top_builddir)/common/utils/libutils.la \
+ $(IMPORT_LIBRARY_ON_WINDOWS) \
+ $(NULL)
+nbdkit_multi_conn_filter_la_LDFLAGS = \
+ -module -avoid-version -shared $(NO_UNDEFINED_ON_WINDOWS) \
+ -Wl,--version-script=$(top_srcdir)/filters/filters.syms \
+ $(NULL)
+
+if HAVE_POD
+
+man_MANS = nbdkit-multi-conn-filter.1
+CLEANFILES += $(man_MANS)
+
+nbdkit-multi-conn-filter.1: nbdkit-multi-conn-filter.pod
+ $(PODWRAPPER) --section=1 --man $@ \
+ --html $(top_builddir)/html/$@.html \
+ $<
+
+endif HAVE_POD
diff --git a/tests/Makefile.am b/tests/Makefile.am
index 70898f20..4b3ee65c 100644
--- a/tests/Makefile.am
+++ b/tests/Makefile.am
@@ -1,5 +1,5 @@
# nbdkit
-# Copyright (C) 2013-2020 Red Hat Inc.
+# Copyright (C) 2013-2021 Red Hat Inc.
#
# Redistribution and use in source and binary forms, with or without
# modification, are permitted provided that the following conditions are
@@ -1538,6 +1538,15 @@ EXTRA_DIST += \
test-log-script-info.sh \
$(NULL)
+# multi-conn filter test.
+TESTS += \
+ test-multi-conn.sh \
+ $(NULL)
+EXTRA_DIST += \
+ test-multi-conn-plugin.sh \
+ test-multi-conn.sh \
+ $(NULL)
+
# nofilter test.
TESTS += test-nofilter.sh
EXTRA_DIST += test-nofilter.sh
diff --git a/filters/multi-conn/multi-conn.c b/filters/multi-conn/multi-conn.c
new file mode 100644
index 00000000..3b244cb7
--- /dev/null
+++ b/filters/multi-conn/multi-conn.c
@@ -0,0 +1,467 @@
+/* nbdkit
+ * Copyright (C) 2021 Red Hat Inc.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions are
+ * met:
+ *
+ * * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ *
+ * * Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in the
+ * documentation and/or other materials provided with the distribution.
+ *
+ * * Neither the name of Red Hat nor the names of its contributors may be
+ * used to endorse or promote products derived from this software without
+ * specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY RED HAT AND CONTRIBUTORS ''AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
+ * THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
+ * PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL RED HAT OR
+ * CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+ * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+ * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+ * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+ * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+#include <config.h>
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <string.h>
+#include <stdbool.h>
+#include <assert.h>
+#include <pthread.h>
+
+#include <nbdkit-filter.h>
+
+#include "cleanup.h"
+#include "vector.h"
+
+/* Track results of .config */
+static enum MultiConnMode {
+ AUTO,
+ EMULATE,
+ PLUGIN,
+ UNSAFE,
+ DISABLE,
+} mode;
+
+static enum TrackDirtyMode {
+ CONN,
+ FAST,
+ OFF,
+} track;
+
+enum dirty {
+ WRITE = 1, /* A write may have populated a cache */
+ READ = 2, /* A read may have populated a cache */
+};
+
+/* Coordination between connections. */
+static pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
+
+/* The list of handles to active connections. */
+struct handle {
+ struct nbdkit_next_ops *next_ops;
+ void *nxdata;
+ enum MultiConnMode mode; /* Runtime resolution of mode==AUTO */
+ enum dirty dirty; /* What aspects of this connection are dirty */
+};
+DEFINE_VECTOR_TYPE(conns_vector, struct handle *);
+static conns_vector conns = empty_vector;
+static bool dirty; /* True if any connection is dirty */
+
+/* Accept 'multi-conn-mode=mode' and 'multi-conn-track-dirty=level' */
+static int
+multi_conn_config (nbdkit_next_config *next, void *nxdata,
+ const char *key, const char *value)
+{
+ if (strcmp (key, "multi-conn-mode") == 0) {
+ if (strcmp (value, "auto") == 0)
+ mode = AUTO;
+ else if (strcmp (value, "emulate") == 0)
+ mode = EMULATE;
+ else if (strcmp (value, "plugin") == 0)
+ mode = PLUGIN;
+ else if (strcmp (value, "disable") == 0)
+ mode = DISABLE;
+ else if (strcmp (value, "unsafe") == 0)
+ mode = UNSAFE;
+ else {
+ nbdkit_error ("unknown multi-conn mode '%s'", value);
+ return -1;
+ }
+ return 0;
+ }
+ else if (strcmp (key, "multi-conn-track-dirty") == 0) {
+ if (strcmp (value, "connection") == 0 ||
+ strcmp (value, "conn") == 0)
+ track = CONN;
+ else if (strcmp (value, "fast") == 0)
+ track = FAST;
+ else if (strcmp (value, "off") == 0)
+ track = OFF;
+ else {
+ nbdkit_error ("unknown multi-conn track-dirty setting '%s'", value);
+ return -1;
+ }
+ return 0;
+ }
+ return next (nxdata, key, value);
+}
+
+#define multi_conn_config_help \
+ "multi-conn-mode=<MODE> 'emulate' (default), 'plugin', 'disable',\n" \
+ " or 'unsafe'.\n" \
+ "multi-conn-track-dirty=<LEVEL> 'conn' (default), 'fast', or 'off'.\n"
+
+static void *
+multi_conn_open (nbdkit_next_open *next, void *nxdata,
+ int readonly, const char *exportname, int is_tls)
+{
+ struct handle *h;
+
+ if (next (nxdata, readonly, exportname) == -1)
+ return NULL;
+
+ /* Allocate here, but populate and insert into list in .prepare */
+ h = calloc (1, sizeof *h);
+ if (h == NULL) {
+ nbdkit_error ("calloc: %m");
+ return NULL;
+ }
+ return h;
+}
+
+static int
+multi_conn_prepare (struct nbdkit_next_ops *next_ops, void *nxdata,
+ void *handle, int readonly)
+{
+ struct handle *h = handle;
+ int r;
+
+ h->next_ops = next_ops;
+ h->nxdata = nxdata;
+ if (mode == AUTO) {
+ r = next_ops->can_multi_conn (nxdata);
+ if (r == -1)
+ return -1;
+ if (r == 0)
+ h->mode = EMULATE;
+ else
+ h->mode = PLUGIN;
+ }
+ else
+ h->mode = mode;
+ if (h->mode == EMULATE && next_ops->can_flush (nxdata) != 1) {
+ nbdkit_error ("emulating multi-conn requires working flush");
+ return -1;
+ }
+
+ ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock);
+ conns_vector_append (&conns, h);
+ return 0;
+}
+
+static int
+multi_conn_finalize (struct nbdkit_next_ops *next_ops, void *nxdata,
+ void *handle)
+{
+ struct handle *h = handle;
+
+ ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock);
+ assert (h->next_ops == next_ops);
+ assert (h->nxdata == nxdata);
+
+ /* XXX should we add a config param to flush if the client forgot? */
+ for (size_t i = 0; i < conns.size; i++) {
+ if (conns.ptr[i] == h) {
+ conns_vector_remove (&conns, i);
+ break;
+ }
+ }
+ return 0;
+}
+
+static void
+multi_conn_close (void *handle)
+{
+ free (handle);
+}
+
+static int
+multi_conn_can_fua (struct nbdkit_next_ops *next_ops, void *nxdata,
+ void *handle)
+{
+ /* If the backend has native FUA support but is not multi-conn
+ * consistent, and we have to flush on every connection, then we are
+ * better off advertising emulated fua rather than native.
+ */
+ struct handle *h = handle;
+ int fua = next_ops->can_fua (nxdata);
+
+ assert (h->mode != AUTO);
+ if (fua == NBDKIT_FUA_NATIVE && h->mode == EMULATE)
+ return NBDKIT_FUA_EMULATE;
+ return fua;
+}
+
+static int
+multi_conn_can_multi_conn (struct nbdkit_next_ops *next_ops, void *nxdata,
+ void *handle)
+{
+ struct handle *h = handle;
+
+ switch (h->mode) {
+ case EMULATE:
+ return 1;
+ case PLUGIN:
+ return next_ops->can_multi_conn (nxdata);
+ case DISABLE:
+ return 0;
+ case UNSAFE:
+ return 1;
+ case AUTO: /* Not possible, see .prepare */
+ default:
+ abort ();
+ }
+}
+
+static void
+mark_dirty (struct handle *h, bool is_read)
+{
+ /* No need to grab lock here: the NBD spec is clear that a client
+ * must wait for the response to a flush before sending the next
+ * command that expects to see the result of that flush, so any race
+ * in accessing dirty can be traced back to the client improperly
+ * sending a flush in parallel with other live commands.
+ */
+ switch (track) {
+ case CONN:
+ h->dirty |= is_read ? READ : WRITE;
+ break;
+ case FAST:
+ if (!is_read)
+ dirty = true;
+ break;
+ case OFF:
+ break;
+ default:
+ abort ();
+ }
+}
+
+static int
+multi_conn_flush (struct nbdkit_next_ops *next_ops, void *nxdata,
+ void *handle, uint32_t flags, int *err);
+
+static int
+multi_conn_pread (struct nbdkit_next_ops *next_ops, void *nxdata,
+ void *handle, void *buf, uint32_t count, uint64_t offs,
+ uint32_t flags, int *err)
+{
+ struct handle *h = handle;
+
+ mark_dirty (h, true);
+ return next_ops->pread (nxdata, buf, count, offs, flags, err);
+}
+
+static int
+multi_conn_pwrite (struct nbdkit_next_ops *next_ops, void *nxdata,
+ void *handle, const void *buf, uint32_t count,
+ uint64_t offs, uint32_t flags, int *err)
+{
+ struct handle *h = handle;
+ bool need_flush = false;
+
+ if (flags & NBDKIT_FLAG_FUA) {
+ if (h->mode == EMULATE) {
+ mark_dirty (h, false);
+ need_flush = true;
+ flags &= ~NBDKIT_FLAG_FUA;
+ }
+ }
+ else
+ mark_dirty (h, false);
+
+ if (next_ops->pwrite (nxdata, buf, count, offs, flags, err) == -1)
+ return -1;
+ if (need_flush)
+ return multi_conn_flush (next_ops, nxdata, h, 0, err);
+ return 0;
+}
+
+static int
+multi_conn_zero (struct nbdkit_next_ops *next_ops, void *nxdata,
+ void *handle, uint32_t count, uint64_t offs, uint32_t flags,
+ int *err)
+{
+ struct handle *h = handle;
+ bool need_flush = false;
+
+ if (flags & NBDKIT_FLAG_FUA) {
+ if (h->mode == EMULATE) {
+ mark_dirty (h, false);
+ need_flush = true;
+ flags &= ~NBDKIT_FLAG_FUA;
+ }
+ }
+ else
+ mark_dirty (h, false);
+
+ if (next_ops->zero (nxdata, count, offs, flags, err) == -1)
+ return -1;
+ if (need_flush)
+ return multi_conn_flush (next_ops, nxdata, h, 0, err);
+ return 0;
+}
+
+static int
+multi_conn_trim (struct nbdkit_next_ops *next_ops, void *nxdata,
+ void *handle, uint32_t count, uint64_t offs, uint32_t flags,
+ int *err)
+{
+ struct handle *h = handle;
+ bool need_flush = false;
+
+ if (flags & NBDKIT_FLAG_FUA) {
+ if (h->mode == EMULATE) {
+ mark_dirty (h, false);
+ need_flush = true;
+ flags &= ~NBDKIT_FLAG_FUA;
+ }
+ }
+ else
+ mark_dirty (h, false);
+
+ if (next_ops->trim (nxdata, count, offs, flags, err) == -1)
+ return -1;
+ if (need_flush)
+ return multi_conn_flush (next_ops, nxdata, h, 0, err);
+ return 0;
+}
+
+static int
+multi_conn_cache (struct nbdkit_next_ops *next_ops, void *nxdata,
+ void *handle, uint32_t count, uint64_t offs, uint32_t flags,
+ int *err)
+{
+ struct handle *h = handle;
+
+ mark_dirty (h, true);
+ return next_ops->cache (nxdata, count, offs, flags, err);
+}
+
+static int
+multi_conn_flush (struct nbdkit_next_ops *next_ops, void *nxdata,
+ void *handle, uint32_t flags, int *err)
+{
+ struct handle *h = handle, *h2;
+ size_t i;
+
+ if (h->mode == EMULATE) {
+ /* Optimize for the common case of a single connection: flush all
+ * writes on other connections, then flush both read and write on
+ * the current connection, then finally flush all other
+ * connections to avoid reads seeing stale data, skipping the
+ * flushes that make no difference according to dirty tracking.
+ */
+ bool updated = h->dirty & WRITE;
+
+ ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock);
+ for (i = 0; i < conns.size; i++) {
+ h2 = conns.ptr[i];
+ if (h == h2)
+ continue;
+ if (dirty || h2->dirty & WRITE) {
+ if (h2->next_ops->flush (h2->nxdata, flags, err) == -1)
+ return -1;
+ h2->dirty &= ~WRITE;
+ updated = true;
+ }
+ }
+ if (dirty || updated) {
+ if (next_ops->flush (nxdata, flags, err) == -1)
+ return -1;
+ }
+ h->dirty = 0;
+ dirty = false;
+ for (i = 0; i < conns.size; i++) {
+ h2 = conns.ptr[i];
+ if (updated && h2->dirty & READ) {
+ assert (h != h2);
+ if (h2->next_ops->flush (h2->nxdata, flags, err) == -1)
+ return -1;
+ }
+ h2->dirty &= ~READ;
+ }
+ }
+ else {
+ /* !EMULATE: Check if the image is clean, allowing us to skip a flush. */
+ switch (track) {
+ case CONN:
+ if (!h->dirty)
+ return 0;
+ break;
+ case FAST:
+ if (!dirty)
+ return 0;
+ break;
+ case OFF:
+ break;
+ default:
+ abort ();
+ }
+ /* Perform the flush, then update dirty tracking. */
+ if (next_ops->flush (nxdata, flags, err) == -1)
+ return -1;
+ switch (track) {
+ case CONN:
+ if (next_ops->can_multi_conn (nxdata) == 1) {
+ ACQUIRE_LOCK_FOR_CURRENT_SCOPE (&lock);
+ for (i = 0; i < conns.size; i++)
+ conns.ptr[i]->dirty = 0;
+ }
+ else
+ h->dirty = 0;
+ break;
+ case FAST:
+ dirty = false;
+ break;
+ case OFF:
+ break;
+ default:
+ abort ();
+ }
+ }
+ return 0;
+}
+
+static struct nbdkit_filter filter = {
+ .name = "multi-conn",
+ .longname = "nbdkit multi-conn filter",
+ .config = multi_conn_config,
+ .config_help = multi_conn_config_help,
+ .open = multi_conn_open,
+ .prepare = multi_conn_prepare,
+ .finalize = multi_conn_finalize,
+ .close = multi_conn_close,
+ .can_fua = multi_conn_can_fua,
+ .can_multi_conn = multi_conn_can_multi_conn,
+ .pread = multi_conn_pread,
+ .pwrite = multi_conn_pwrite,
+ .trim = multi_conn_trim,
+ .zero = multi_conn_zero,
+ .cache = multi_conn_cache,
+ .flush = multi_conn_flush,
+};
+
+NBDKIT_REGISTER_FILTER(filter)
diff --git a/tests/test-multi-conn-plugin.sh b/tests/test-multi-conn-plugin.sh
new file mode 100755
index 00000000..c580b89a
--- /dev/null
+++ b/tests/test-multi-conn-plugin.sh
@@ -0,0 +1,121 @@
+#!/usr/bin/env bash
+# nbdkit
+# Copyright (C) 2018-2021 Red Hat Inc.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are
+# met:
+#
+# * Redistributions of source code must retain the above copyright
+# notice, this list of conditions and the following disclaimer.
+#
+# * Redistributions in binary form must reproduce the above copyright
+# notice, this list of conditions and the following disclaimer in the
+# documentation and/or other materials provided with the distribution.
+#
+# * Neither the name of Red Hat nor the names of its contributors may be
+# used to endorse or promote products derived from this software without
+# specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY RED HAT AND CONTRIBUTORS ''AS IS'' AND
+# ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
+# THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
+# PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL RED HAT OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+# USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+# OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+# SUCH DAMAGE.
+
+# Test plugin used by test-multi-conn.sh.
+# This plugin purposefully maintains a per-connection cache.
+# An optional parameter tightfua=true controls whether FUA acts on
+# just the given region, or on all pending ops in the current connection.
+# Note that an earlier cached write on one connection can overwrite a later
+# FUA write on another connection - this is okay (the client is buggy if
+# it ever sends overlapping writes without coordinating flushes and still
+# expects any particular write to occur last).
+
+fill_cache() {
+ if test ! -f "$tmpdir/$1"; then
+ cp "$tmpdir/0" "$tmpdir/$1"
+ fi
+}
+do_fua() {
+ case ,$4, in
+ *,fua,*)
+ if test -f "$tmpdir/strictfua"; then
+ dd of="$tmpdir/0" if="$tmpdir/$1" skip=$3 seek=$3 count=$2 \
+ conv=notrunc iflag=count_bytes,skip_bytes oflag=seek_bytes
+ else
+ do_flush $1
+ fi ;;
+ esac
+}
+do_flush() {
+ if test -f "$tmpdir/$1-replay"; then
+ while read cnt off; do
+ dd of="$tmpdir/0" if="$tmpdir/$1" skip=$off seek=$off count=$cnt \
+ conv=notrunc iflag=count_bytes,skip_bytes oflag=seek_bytes
+ done < "$tmpdir/$1-replay"
+ fi
+ rm -f "$tmpdir/$1" "$tmpdir/$1-replay"
+}
+case "$1" in
+ config)
+ case $2 in
+ strictfua)
+ case $3 in
+ true | on | 1) touch "$tmpdir/strictfua" ;;
+ false | off | 0) ;;
+ *) echo "unknown value for strictfua $3" >&2; exit 1 ;;
+ esac ;;
+ *) echo "unknown config key $2" >&2; exit 1 ;;
+ esac
+ ;;
+ get_ready)
+ printf "%-32s" 'Initial contents' > "$tmpdir/0"
+ echo 0 > "$tmpdir/counter"
+ ;;
+ get_size)
+ echo 32
+ ;;
+ can_write | can_zero | can_trim | can_flush)
+ exit 0
+ ;;
+ can_fua | can_cache)
+ echo native
+ ;;
+ open)
+ read i < "$tmpdir/counter"
+ echo $((i+1)) | tee "$tmpdir/counter"
+ ;;
+ pread)
+ fill_cache $2
+ dd if="$tmpdir/$2" skip=$4 count=$3 iflag=count_bytes,skip_bytes
+ ;;
+ cache)
+ fill_cache $2
+ ;;
+ pwrite)
+ fill_cache $2
+ dd of="$tmpdir/$2" seek=$4 conv=notrunc oflag=seek_bytes
+ echo $3 $4 >> "$tmpdir/$2-replay"
+ do_fua $2 $3 $4 $5
+ ;;
+ zero | trim)
+ fill_cache $2
+ dd of="$tmpdir/$2" if="/dev/zero" seek=$4 conv=notrunc oflag=seek_bytes
+ echo $3 $4 >> "$tmpdir/$2-replay"
+ do_fua $2 $3 $4 $5
+ ;;
+ flush)
+ do_flush $2
+ ;;
+ *)
+ exit 2
+ ;;
+esac
diff --git a/tests/test-multi-conn.sh b/tests/test-multi-conn.sh
new file mode 100755
index 00000000..01efd108
--- /dev/null
+++ b/tests/test-multi-conn.sh
@@ -0,0 +1,85 @@
+#!/usr/bin/env bash
+# nbdkit
+# Copyright (C) 2018-2021 Red Hat Inc.
+#
+# Redistribution and use in source and binary forms, with or without
+# modification, are permitted provided that the following conditions are
+# met:
+#
+# * Redistributions of source code must retain the above copyright
+# notice, this list of conditions and the following disclaimer.
+#
+# * Redistributions in binary form must reproduce the above copyright
+# notice, this list of conditions and the following disclaimer in the
+# documentation and/or other materials provided with the distribution.
+#
+# * Neither the name of Red Hat nor the names of its contributors may be
+# used to endorse or promote products derived from this software without
+# specific prior written permission.
+#
+# THIS SOFTWARE IS PROVIDED BY RED HAT AND CONTRIBUTORS ''AS IS'' AND
+# ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
+# THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
+# PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL RED HAT OR
+# CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+# SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+# LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
+# USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
+# ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+# OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
+# OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+# SUCH DAMAGE.
+
+# Demonstrate various multi-conn filter behaviors.
+
+source ./functions.sh
+set -e
+set -x
+
+requires_plugin sh
+requires nbdsh -u "nbd://nosuch" --version
+requires dd iflag=count_bytes </dev/null
+
+files="test-multi-conn.out test-multi-conn.stat"
+rm -f $files
+cleanup_fn rm -f $files
+
+fail=0
+export handles preamble handles uri
+uri= # will be set by --run later
+handles=2
+preamble='
+import os
+
+uri = os.environ["uri"]
+handles = int(os.environ["handles"])
+h = []
+for i in range(handles):
+ h.append(nbd.NBD())
+ h[i].connect_uri(uri)
+print(h[0].can_multi_conn())
+'
+
+# Demonstrate the caching required with plugin alone
+nbdkit -vf -U - sh test-multi-conn-plugin.sh \
+ --run 'nbdsh -c "$preamble" -c "
+"' > test-multi-conn.out || fail=1
+diff -u <(cat <<\EOF
+False
+EOF
+ ) test-multi-conn.out || fail=1
+
+# Demonstrate that FUA alone does not have to sync full disk
+ # TODO
+# Demonstrate multi-conn defaults
+ # TODO
+# Use --filter=stats to show track-dirty effects
+nbdkit -vf -U - sh test-multi-conn-plugin.sh \
+ --filter=stats statsfile=test-multi-conn.stat \
+ --run 'nbdsh -c "$preamble" -c "
+h[0].flush()
+"' > test-multi-conn.out || fail=1
+cat test-multi-conn.stat
+grep 'flush: 1 ops' test-multi-conn.stat || fail=1
+
+exit $fail
diff --git a/TODO b/TODO
index d8dd7ef2..e41e38e8 100644
--- a/TODO
+++ b/TODO
@@ -206,13 +206,6 @@ Suggestions for filters
* masking plugin features for testing clients (see 'nozero' and 'fua'
filters for examples)
-* multi-conn filter to adjust advertisement of multi-conn bit. In
- particular, if the plugin lacks .can_multi_conn, then .open/.close
- track all open connections, and .flush and FUA flag will call
- next_ops->flush() on all of them. Conversely, if plugin supports
- multi-conn, we can cache whether the image is dirty, and avoid
- expense of next_ops->flush when it is clean.
-
* "bandwidth quota" filter which would close a connection after it
exceeded a certain amount of bandwidth up or down.
--
2.30.1
3 years, 10 months
Crafted OVAs and handling non-existing (empty) disks in OVF
by Tomáš Golembiovský
Hi,
from time to time we get a request [1] to import appliance with crafted
OVF (from Cisco or Gigamon) into oVirt. The common denominator is that
some disks are missing and are supposed to be created during import of
the appliance.
Doing the import properly would require not only solving of the problem
with missing disks, but also implementing multiple concepts -- notably
concept of configurations [2] (for Cisco appliance) or concept of
properties [3] (for Gigamon appliance). This would be quite complex
change to oVirt as well as to virt-v2v and at the moment we don't feel
that such effort is justifiable. But by solving the problem with disks
we can at least provide a helping hand to those requiring the crafted
appliances.
The idea here is that virt-v2v would ignore the non-existing disks and
user would be required to add those manually after conversion. As for
OVFs using the configurations virt-v2v would pick some settings from OVF
(random from users perspective) and user would be responsible for
editing the VM's configuration after conversion (CPUs, memory, etc.) to
size the VM properly based on the expected use. We could further
constrain this to only work with -o vdsm, but this may not be needed
unless we hit some issues with implementing the feature. It is also
possible that ignoring the disks may not work for some reasons that
we have not yet discovered and we'll se once we try.
There is one more problem with the Cisco OVA that it contains .cert file
which is not specified in manifest. But the file is not needed during
conversion. So this could be possibly handled by enforcing virt-v2v to
use only files listed in manifest instead of complaining.
Before I invest any time into this I wanted to make sure that such
approach would be acceptable to the upstream. So would this be good
enough?
***
The topics for discussions are above. What follows are the technical
details for those interested in gain deeper understanding of the
problem. You may be wondering why would we want to ignore the empty
disks if we can create them for most of the output backends. The problem
is, that we cannot. Either we don't know which disks are of the interest
because not all should be used (configurations) or we have no idea how
big the disk should be (properties).
### Configurations
The OVF can define several configurations in DeploymentOptionSection.
A short (simplified) example may look like this:
<DeploymentOptionSection>
<Configuration ovf:default="true" ovf:id="Standard" />
<Configuration ovf:id="Express" />
...
</DeploymentOptionSection>
Then in the VirtualHardwareSection there can be duplicate settings
distinguished only by the ovf:configuration attribute. For example 2 different
vCPU configurations:
<Item ovf:configuration="Express">
...
<rasd:Description>Number of Virtual CPUs</rasd:Description>
<rasd:ResourceType>3</rasd:ResourceType>
<rasd:VirtualQuantity>4</rasd:VirtualQuantity>
</Item>
<Item ovf:configuration="Standard">
...
<rasd:Description>Number of Virtual CPUs</rasd:Description>
<rasd:ResourceType>3</rasd:ResourceType>
<rasd:VirtualQuantity>16</rasd:VirtualQuantity>
</Item>
In terms of disks this means that only some of the disks get used. Specifically
the Cisco Appliance has 4 disks listed in the DiskSection -- 1 system disk A
and 3 empty disks B,C,D. But the created VM never has all four. It has either
only A or it has A+B or A+C or A+D. Without understanding configurations we
cannot tell whether to use B, C, D or none.
### Properties
The OVF can define various properties in ProductSection element. Like this:
<ProductSection>
...
<Property ovf:key="datadisksize" ovf:qualifiers="MinValue(20),MaxValue(1000)" ovf:type="int" ovf:userConfigurable="true">
<Label>02. Size of Data Disk</Label>
<Description>The size of the Data Disk in gigabytes.</Description>
<Value ovf:value="40"/>
</Property>
...
</ProductSection
Then, the ovf:key value of the property can be used as a variable on
other places in the OVF. For example like this:
<DiskSection>
...
<Disk ovf:capacity="${datadisksize}" ovf:fileRef="" ... />
...
</DiskSection>
And as before, without understanding the concept virt-v2v has no idea
how to size the disk. The Value element is optional (if the property is
userConfigurable) so relying on the default in OVF may not work. I can
imagine some OVFs may even use a property to specify vCPU count or
memory which could bring up a question how to handle those. Possibly
default to 0 or 1 (where 1 may be better default in my opinion).
Tomas
[1] https://bugzilla.redhat.com/1705600
[2] Open Virtualization Format Specification (DSP0243) Version 2.1.1,
Chapter 9.8 -- DeploymentOptionSection
[3] Open Virtualization Format Specification (DSP0243) Version 2.1.1,
Chapter 9.5.1 -- Property elements
--
Tomáš Golembiovský <tgolembi(a)redhat.com>
3 years, 10 months
[PATCH nbdcopy] copy: Implement extent metadata for efficient copying.
by Richard W.M. Jones
This implements these interconnected options:
--allocated
--destination-is-zero (alias: --target-is-zero)
--no-extents
---
TODO | 6 +-
copy/Makefile.am | 7 +
copy/copy-sparse-allocated.sh | 92 ++++++++++++++
copy/copy-sparse-no-extents.sh | 92 ++++++++++++++
copy/copy-sparse.sh | 97 ++++++++++++++
copy/file-ops.c | 181 ++++++++++++++++++++++++++
copy/main.c | 107 ++++++++++++++--
copy/multi-thread-copying.c | 226 ++++++++++++++++++++++++---------
copy/nbd-ops.c | 176 +++++++++++++++++++++++++
copy/nbdcopy.h | 50 ++++++++
copy/nbdcopy.pod | 32 ++++-
copy/pipe-ops.c | 30 ++++-
12 files changed, 1015 insertions(+), 81 deletions(-)
diff --git a/TODO b/TODO
index 9e5c821..8c0402e 100644
--- a/TODO
+++ b/TODO
@@ -30,9 +30,9 @@ Performance: Chart it over various buffer sizes and threads, as that
Examine other fuzzers: https://gitlab.com/akihe/radamsa
nbdcopy:
- - Properly handle extents/sparseness in input and output.
- - Write zeroes efficiently.
- - Detect zeroes (optionally) and turn into sparseness.
+ - --synchronous mode does not yet support extents.
+ - Detect zeroes (optionally) and turn into sparseness
+ (like qemu-img convert -S).
- Progress bar: allow it to be written to a file descriptor
and/or written in a machine-consumable format.
- Minimum/preferred/maximum block size.
diff --git a/copy/Makefile.am b/copy/Makefile.am
index 8f2d168..cfbc386 100644
--- a/copy/Makefile.am
+++ b/copy/Makefile.am
@@ -26,6 +26,9 @@ EXTRA_DIST = \
copy-nbd-to-small-block-error.sh \
copy-nbd-to-small-nbd-error.sh \
copy-nbd-to-stdout.sh \
+ copy-sparse.sh \
+ copy-sparse-allocated.sh \
+ copy-sparse-no-extents.sh \
copy-stdin-to-nbd.sh \
nbdcopy.pod \
$(NULL)
@@ -46,6 +49,7 @@ nbdcopy_SOURCES = \
$(NULL)
nbdcopy_CPPFLAGS = \
-I$(top_srcdir)/include \
+ -I$(top_srcdir)/common/include \
-I$(top_srcdir)/common/utils \
$(NULL)
nbdcopy_CFLAGS = \
@@ -88,6 +92,9 @@ TESTS += \
copy-nbd-to-small-nbd-error.sh \
copy-stdin-to-nbd.sh \
copy-nbd-to-stdout.sh \
+ copy-sparse.sh \
+ copy-sparse-allocated.sh \
+ copy-sparse-no-extents.sh \
$(ROOT_TESTS) \
$(NULL)
diff --git a/copy/copy-sparse-allocated.sh b/copy/copy-sparse-allocated.sh
new file mode 100755
index 0000000..203c3b9
--- /dev/null
+++ b/copy/copy-sparse-allocated.sh
@@ -0,0 +1,92 @@
+#!/usr/bin/env bash
+# nbd client library in userspace
+# Copyright (C) 2020 Red Hat Inc.
+#
+# This library is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2 of the License, or (at your option) any later version.
+#
+# This library is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with this library; if not, write to the Free Software
+# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+
+# Adapted from copy-sparse.sh.
+#
+# This test depends on the nbdkit default sparse block size (32K).
+
+. ../tests/functions.sh
+
+set -e
+set -x
+
+requires nbdkit --version
+requires nbdkit --exit-with-parent --version
+requires nbdkit data --version
+requires nbdkit eval --version
+
+out=copy-sparse-allocated.out
+cleanup_fn rm -f $out
+
+$VG nbdcopy --allocated -- \
+ [ nbdkit --exit-with-parent data data='
+ 1
+ @1073741823 1
+ @4294967295 1
+ @4294967296 1
+ ' ] \
+ [ nbdkit --exit-with-parent eval \
+ get_size=' echo 7E ' \
+ pwrite=" echo \$@ >> $out " \
+ trim=" echo \$@ >> $out " \
+ zero=" echo \$@ >> $out " ]
+
+sort -o $out $out
+
+echo Output:
+cat $out
+
+if [ "$(cat $out)" != "pwrite 1 4294967296
+pwrite 32768 0
+pwrite 32768 1073709056
+pwrite 32768 4294934528
+zero 134184960 32768
+zero 134184960 4160749568
+zero 134184960 939524096
+zero 134217728 1073741824
+zero 134217728 1207959552
+zero 134217728 134217728
+zero 134217728 1342177280
+zero 134217728 1476395008
+zero 134217728 1610612736
+zero 134217728 1744830464
+zero 134217728 1879048192
+zero 134217728 2013265920
+zero 134217728 2147483648
+zero 134217728 2281701376
+zero 134217728 2415919104
+zero 134217728 2550136832
+zero 134217728 268435456
+zero 134217728 2684354560
+zero 134217728 2818572288
+zero 134217728 2952790016
+zero 134217728 3087007744
+zero 134217728 3221225472
+zero 134217728 3355443200
+zero 134217728 3489660928
+zero 134217728 3623878656
+zero 134217728 3758096384
+zero 134217728 3892314112
+zero 134217728 402653184
+zero 134217728 4026531840
+zero 134217728 536870912
+zero 134217728 671088640
+zero 134217728 805306368" ]; then
+ echo "$0: output does not match expected"
+ exit 1
+fi
diff --git a/copy/copy-sparse-no-extents.sh b/copy/copy-sparse-no-extents.sh
new file mode 100755
index 0000000..e976d55
--- /dev/null
+++ b/copy/copy-sparse-no-extents.sh
@@ -0,0 +1,92 @@
+#!/usr/bin/env bash
+# nbd client library in userspace
+# Copyright (C) 2020 Red Hat Inc.
+#
+# This library is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2 of the License, or (at your option) any later version.
+#
+# This library is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with this library; if not, write to the Free Software
+# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+
+# Adapted from copy-sparse.sh
+#
+# This test depends on the nbdkit default sparse block size (32K).
+
+. ../tests/functions.sh
+
+set -e
+set -x
+
+# Skip this test under valgrind, it takes too long.
+if [ "x$LIBNBD_VALGRIND" = "x1" ]; then
+ echo "$0: test skipped under valgrind"
+ exit 77
+fi
+
+requires nbdkit --version
+requires nbdkit --exit-with-parent --version
+requires nbdkit data --version
+requires nbdkit eval --version
+
+out=copy-sparse-no-extents.out
+cleanup_fn rm -f $out
+
+$VG nbdcopy --no-extents -- \
+ [ nbdkit --exit-with-parent data data='
+ 1
+ @1073741823 1
+ ' ] \
+ [ nbdkit --exit-with-parent eval \
+ get_size=' echo 7E ' \
+ pwrite=" echo \$@ >> $out " \
+ trim=" echo \$@ >> $out " \
+ zero=" echo \$@ >> $out " ]
+
+sort -n -o $out $out
+
+echo Output:
+cat $out
+
+if [ "$(cat $out)" != "pwrite 33554432 0
+pwrite 33554432 100663296
+pwrite 33554432 1006632960
+pwrite 33554432 1040187392
+pwrite 33554432 134217728
+pwrite 33554432 167772160
+pwrite 33554432 201326592
+pwrite 33554432 234881024
+pwrite 33554432 268435456
+pwrite 33554432 301989888
+pwrite 33554432 33554432
+pwrite 33554432 335544320
+pwrite 33554432 369098752
+pwrite 33554432 402653184
+pwrite 33554432 436207616
+pwrite 33554432 469762048
+pwrite 33554432 503316480
+pwrite 33554432 536870912
+pwrite 33554432 570425344
+pwrite 33554432 603979776
+pwrite 33554432 637534208
+pwrite 33554432 67108864
+pwrite 33554432 671088640
+pwrite 33554432 704643072
+pwrite 33554432 738197504
+pwrite 33554432 771751936
+pwrite 33554432 805306368
+pwrite 33554432 838860800
+pwrite 33554432 872415232
+pwrite 33554432 905969664
+pwrite 33554432 939524096
+pwrite 33554432 973078528" ]; then
+ echo "$0: output does not match expected"
+ exit 1
+fi
diff --git a/copy/copy-sparse.sh b/copy/copy-sparse.sh
new file mode 100755
index 0000000..2fc4d9a
--- /dev/null
+++ b/copy/copy-sparse.sh
@@ -0,0 +1,97 @@
+#!/usr/bin/env bash
+# nbd client library in userspace
+# Copyright (C) 2020 Red Hat Inc.
+#
+# This library is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2 of the License, or (at your option) any later version.
+#
+# This library is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with this library; if not, write to the Free Software
+# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+
+# This test depends on the nbdkit default sparse block size (32K).
+
+. ../tests/functions.sh
+
+set -e
+set -x
+
+requires nbdkit --version
+requires nbdkit --exit-with-parent --version
+requires nbdkit data --version
+requires nbdkit eval --version
+
+out=copy-sparse.out
+cleanup_fn rm -f $out
+
+# Copy from a sparse data disk to an nbdkit-eval-plugin instance which
+# is logging everything. This allows us to see exactly what nbdcopy
+# is writing, to ensure it is writing and trimming the target as
+# expected.
+$VG nbdcopy -- \
+ [ nbdkit --exit-with-parent data data='
+ 1
+ @1073741823 1
+ @4294967295 1
+ @4294967296 1
+ ' ] \
+ [ nbdkit --exit-with-parent eval \
+ get_size=' echo 7E ' \
+ pwrite=" echo \$@ >> $out " \
+ trim=" echo \$@ >> $out " \
+ zero=" echo \$@ >> $out " ]
+
+# Order of the output could vary because requests are sent in
+# parallel.
+sort -n -o $out $out
+
+echo Output:
+cat $out
+
+# Check the output matches expected.
+if [ "$(cat $out)" != "pwrite 1 4294967296
+pwrite 32768 0
+pwrite 32768 1073709056
+pwrite 32768 4294934528
+trim 134184960 32768
+trim 134184960 4160749568
+trim 134184960 939524096
+trim 134217728 1073741824
+trim 134217728 1207959552
+trim 134217728 134217728
+trim 134217728 1342177280
+trim 134217728 1476395008
+trim 134217728 1610612736
+trim 134217728 1744830464
+trim 134217728 1879048192
+trim 134217728 2013265920
+trim 134217728 2147483648
+trim 134217728 2281701376
+trim 134217728 2415919104
+trim 134217728 2550136832
+trim 134217728 268435456
+trim 134217728 2684354560
+trim 134217728 2818572288
+trim 134217728 2952790016
+trim 134217728 3087007744
+trim 134217728 3221225472
+trim 134217728 3355443200
+trim 134217728 3489660928
+trim 134217728 3623878656
+trim 134217728 3758096384
+trim 134217728 3892314112
+trim 134217728 402653184
+trim 134217728 4026531840
+trim 134217728 536870912
+trim 134217728 671088640
+trim 134217728 805306368" ]; then
+ echo "$0: output does not match expected"
+ exit 1
+fi
diff --git a/copy/file-ops.c b/copy/file-ops.c
index 9e94b30..cd19e81 100644
--- a/copy/file-ops.c
+++ b/copy/file-ops.c
@@ -24,7 +24,16 @@
#include <fcntl.h>
#include <unistd.h>
#include <errno.h>
+#include <sys/ioctl.h>
+#include <sys/types.h>
+#include <pthread.h>
+
+#if defined (__linux__)
+#include <linux/fs.h> /* For BLKZEROOUT */
+#endif
+
+#include "isaligned.h"
#include "nbdcopy.h"
static size_t
@@ -74,6 +83,64 @@ file_synch_write (struct rw *rw,
}
}
+static bool
+file_synch_trim (struct rw *rw, uint64_t offset, uint64_t count)
+{
+ assert (rw->t == LOCAL);
+
+#ifdef FALLOC_FL_PUNCH_HOLE
+ int fd = rw->u.local.fd;
+ int r;
+
+ r = fallocate (fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
+ offset, count);
+ if (r == -1) {
+ perror ("fallocate: FALLOC_FL_PUNCH_HOLE");
+ exit (EXIT_FAILURE);
+ }
+ return true;
+#else /* !FALLOC_FL_PUNCH_HOLE */
+ return false;
+#endif
+}
+
+static bool
+file_synch_zero (struct rw *rw, uint64_t offset, uint64_t count)
+{
+ assert (rw->t == LOCAL);
+
+ if (S_ISREG (rw->u.local.stat.st_mode)) {
+#ifdef FALLOC_FL_ZERO_RANGE
+ int fd = rw->u.local.fd;
+ int r;
+
+ r = fallocate (fd, FALLOC_FL_ZERO_RANGE, offset, count);
+ if (r == -1) {
+ perror ("fallocate: FALLOC_FL_ZERO_RANGE");
+ exit (EXIT_FAILURE);
+ }
+ return true;
+#endif
+ }
+ else if (S_ISBLK (rw->u.local.stat.st_mode) &&
+ IS_ALIGNED (offset | count, rw->u.local.sector_size)) {
+#ifdef BLKZEROOUT
+ int fd = rw->u.local.fd;
+ int r;
+ uint64_t range[2] = {offset, count};
+
+ r = ioctl (fd, BLKZEROOUT, &range);
+ if (r == -1) {
+ perror ("ioctl: BLKZEROOUT");
+ exit (EXIT_FAILURE);
+ }
+ return true;
+#endif
+ }
+
+ return false;
+}
+
static void
file_asynch_read (struct rw *rw,
struct buffer *buffer,
@@ -104,9 +171,123 @@ file_asynch_write (struct rw *rw,
}
}
+static bool
+file_asynch_trim (struct rw *rw, struct buffer *buffer,
+ nbd_completion_callback cb)
+{
+ assert (rw->t == LOCAL);
+
+ if (!file_synch_trim (rw, buffer->offset, buffer->len))
+ return false;
+ errno = 0;
+ if (cb.callback (cb.user_data, &errno) == -1) {
+ perror (rw->name);
+ exit (EXIT_FAILURE);
+ }
+ return true;
+}
+
+static bool
+file_asynch_zero (struct rw *rw, struct buffer *buffer,
+ nbd_completion_callback cb)
+{
+ assert (rw->t == LOCAL);
+
+ if (!file_synch_zero (rw, buffer->offset, buffer->len))
+ return false;
+ errno = 0;
+ if (cb.callback (cb.user_data, &errno) == -1) {
+ perror (rw->name);
+ exit (EXIT_FAILURE);
+ }
+ return true;
+}
+
+static void
+file_get_extents (struct rw *rw, uintptr_t index,
+ uint64_t offset, uint64_t count,
+ extent_list *ret)
+{
+ assert (rw->t == LOCAL);
+
+ ret->size = 0;
+
+#ifdef SEEK_HOLE
+ static pthread_mutex_t lseek_lock = PTHREAD_MUTEX_INITIALIZER;
+
+ if (rw->u.local.seek_hole_supported) {
+ uint64_t end = offset + count;
+ int fd = rw->u.local.fd;
+ off_t pos;
+ struct extent e;
+
+ pthread_mutex_lock (&lseek_lock);
+
+ /* This loop is taken pretty much verbatim from nbdkit-file-plugin. */
+ do {
+ pos = lseek (fd, offset, SEEK_DATA);
+ if (pos == -1) {
+ if (errno == ENXIO)
+ pos = end;
+ else {
+ perror ("lseek: SEEK_DATA");
+ exit (EXIT_FAILURE);
+ }
+ }
+
+ /* We know there is a hole from offset to pos-1. */
+ if (pos > offset) {
+ e.offset = offset;
+ e.length = pos - offset;
+ e.hole = true;
+ if (extent_list_append (ret, e) == -1) {
+ perror ("realloc");
+ exit (EXIT_FAILURE);
+ }
+ }
+
+ offset = pos;
+ if (offset >= end)
+ break;
+
+ pos = lseek (fd, offset, SEEK_HOLE);
+ if (pos == -1) {
+ perror ("lseek: SEEK_HOLE");
+ exit (EXIT_FAILURE);
+ }
+
+ /* We know there is allocated data from offset to pos-1. */
+ if (pos > offset) {
+ e.offset = offset;
+ e.length = pos - offset;
+ e.hole = false;
+ if (extent_list_append (ret, e) == -1) {
+ perror ("realloc");
+ exit (EXIT_FAILURE);
+ }
+ }
+
+ offset = pos;
+ } while (offset < end);
+
+ pthread_mutex_unlock (&lseek_lock);
+ return;
+ }
+#endif
+
+ /* Otherwise return the default extent covering the whole range. */
+ default_get_extents (rw, index, offset, count, ret);
+}
+
+
struct rw_ops file_ops = {
.synch_read = file_synch_read,
.synch_write = file_synch_write,
+ .synch_trim = file_synch_trim,
+ .synch_zero = file_synch_zero,
.asynch_read = file_asynch_read,
.asynch_write = file_asynch_write,
+ .asynch_trim = file_asynch_trim,
+ .asynch_zero = file_asynch_zero,
+ .get_extents = file_get_extents,
};
diff --git a/copy/main.c b/copy/main.c
index 0b0589e..8187944 100644
--- a/copy/main.c
+++ b/copy/main.c
@@ -27,9 +27,11 @@
#include <limits.h>
#include <fcntl.h>
#include <unistd.h>
+#include <errno.h>
+#include <assert.h>
#include <sys/types.h>
#include <sys/stat.h>
-#include <assert.h>
+#include <sys/ioctl.h>
#include <pthread.h>
@@ -37,7 +39,10 @@
#include "nbdcopy.h"
+bool allocated; /* --allocated flag */
unsigned connections = 4; /* --connections */
+bool destination_is_zero; /* --destination-is-zero flag */
+bool extents = true; /* ! --no-extents flag */
bool flush; /* --flush flag */
unsigned max_requests = 64; /* --requests */
bool progress; /* -p flag */
@@ -46,13 +51,14 @@ unsigned threads; /* --threads */
struct rw src, dst; /* The source and destination. */
static bool is_nbd_uri (const char *s);
+static bool seek_hole_supported (int fd);
static int open_local (const char *prog,
const char *filename, bool writing, struct rw *rw);
static void open_nbd_uri (const char *prog,
- const char *uri, struct rw *rw);
+ const char *uri, bool writing, struct rw *rw);
static void open_nbd_subprocess (const char *prog,
const char **argv, size_t argc,
- struct rw *rw);
+ bool writing, struct rw *rw);
static void __attribute__((noreturn))
usage (FILE *fp, int exitcode)
@@ -85,19 +91,26 @@ main (int argc, char *argv[])
HELP_OPTION = CHAR_MAX + 1,
LONG_OPTIONS,
SHORT_OPTIONS,
+ ALLOCATED_OPTION,
+ DESTINATION_IS_ZERO_OPTION,
FLUSH_OPTION,
+ NO_EXTENTS_OPTION,
SYNCHRONOUS_OPTION,
};
const char *short_options = "C:pR:T:V";
const struct option long_options[] = {
{ "help", no_argument, NULL, HELP_OPTION },
{ "long-options", no_argument, NULL, LONG_OPTIONS },
+ { "allocated", no_argument, NULL, ALLOCATED_OPTION },
{ "connections", required_argument, NULL, 'C' },
+ { "destination-is-zero",no_argument, NULL, DESTINATION_IS_ZERO_OPTION },
{ "flush", no_argument, NULL, FLUSH_OPTION },
+ { "no-extents", no_argument, NULL, NO_EXTENTS_OPTION },
{ "progress", no_argument, NULL, 'p' },
{ "requests", required_argument, NULL, 'R' },
{ "short-options", no_argument, NULL, SHORT_OPTIONS },
{ "synchronous", no_argument, NULL, SYNCHRONOUS_OPTION },
+ { "target-is-zero", no_argument, NULL, DESTINATION_IS_ZERO_OPTION },
{ "threads", required_argument, NULL, 'T' },
{ "version", no_argument, NULL, 'V' },
{ NULL }
@@ -129,10 +142,22 @@ main (int argc, char *argv[])
}
exit (EXIT_SUCCESS);
+ case ALLOCATED_OPTION:
+ allocated = true;
+ break;
+
+ case DESTINATION_IS_ZERO_OPTION:
+ destination_is_zero = true;
+ break;
+
case FLUSH_OPTION:
flush = true;
break;
+ case NO_EXTENTS_OPTION:
+ extents = false;
+ break;
+
case SYNCHRONOUS_OPTION:
synchronous = true;
break;
@@ -191,7 +216,8 @@ main (int argc, char *argv[])
src.t = NBD;
src.name = argv[optind+1];
open_nbd_subprocess (argv[0],
- (const char **) &argv[optind+1], i-optind-1, &src);
+ (const char **) &argv[optind+1], i-optind-1,
+ false, &src);
optind = i+1;
}
else { /* Source is not [...]. */
@@ -201,7 +227,7 @@ main (int argc, char *argv[])
if (src.t == LOCAL)
src.u.local.fd = open_local (argv[0], src.name, false, &src);
else
- open_nbd_uri (argv[0], src.name, &src);
+ open_nbd_uri (argv[0], src.name, false, &src);
}
if (optind >= argc)
@@ -218,7 +244,8 @@ main (int argc, char *argv[])
dst.t = NBD;
dst.name = argv[optind+1];
open_nbd_subprocess (argv[0],
- (const char **) &argv[optind+1], i-optind-1, &dst);
+ (const char **) &argv[optind+1], i-optind-1,
+ true, &dst);
optind = i+1;
}
else { /* Destination is not [...] */
@@ -228,7 +255,7 @@ main (int argc, char *argv[])
if (dst.t == LOCAL)
dst.u.local.fd = open_local (argv[0], dst.name, true /* writing */, &dst);
else {
- open_nbd_uri (argv[0], dst.name, &dst);
+ open_nbd_uri (argv[0], dst.name, true, &dst);
/* Obviously this is not going to work if the server is
* advertising read-only, so fail early with a nice error message.
@@ -318,6 +345,7 @@ main (int argc, char *argv[])
perror ("truncate");
exit (EXIT_FAILURE);
}
+ destination_is_zero = true;
}
else if (dst.t == NBD) {
dst.size = nbd_get_size (dst.u.nbd.ptr[0]);
@@ -345,16 +373,23 @@ main (int argc, char *argv[])
if (src.t == NBD) {
for (i = 1; i < connections; ++i)
- open_nbd_uri (argv[0], src.name, &src);
+ open_nbd_uri (argv[0], src.name, false, &src);
assert (src.u.nbd.size == connections);
}
if (dst.t == NBD) {
for (i = 1; i < connections; ++i)
- open_nbd_uri (argv[0], dst.name, &dst);
+ open_nbd_uri (argv[0], dst.name, true, &dst);
assert (dst.u.nbd.size == connections);
}
}
+ /* If the source is NBD and we couldn't negotiate meta
+ * base:allocation then turn off extents.
+ */
+ if (src.t == NBD &&
+ !nbd_can_meta_context (src.u.nbd.ptr[0], "base:allocation"))
+ extents = false;
+
/* Start copying. */
if (synchronous)
synch_copying ();
@@ -483,11 +518,18 @@ open_local (const char *prog,
perror ("lseek");
exit (EXIT_FAILURE);
}
+ rw->u.local.seek_hole_supported = seek_hole_supported (fd);
+ rw->u.local.sector_size = 4096;
+#ifdef BLKSSZGET
+ if (ioctl (fd, BLKSSZGET, &rw->u.local.sector_size))
+ fprintf (stderr, "warning: cannot get sector size: %s: %m", rw->name);
+#endif
}
else if (S_ISREG (rw->u.local.stat.st_mode)) {
/* Regular file. */
rw->ops = &file_ops;
rw->size = rw->u.local.stat.st_size;
+ rw->u.local.seek_hole_supported = seek_hole_supported (fd);
}
else {
/* Probably stdin/stdout, a pipe or a socket. Set size == -1
@@ -496,14 +538,26 @@ open_local (const char *prog,
synchronous = true;
rw->ops = &pipe_ops;
rw->size = -1;
+ rw->u.local.seek_hole_supported = false;
}
return fd;
}
+static bool
+seek_hole_supported (int fd)
+{
+#ifndef SEEK_HOLE
+ return false;
+#else
+ off_t r = lseek (fd, 0, SEEK_HOLE);
+ return r >= 0;
+#endif
+}
+
static void
open_nbd_uri (const char *prog,
- const char *uri, struct rw *rw)
+ const char *uri, bool writing, struct rw *rw)
{
struct nbd_handle *nbd;
@@ -514,6 +568,11 @@ open_nbd_uri (const char *prog,
exit (EXIT_FAILURE);
}
nbd_set_uri_allow_local_file (nbd, true); /* Allow ?tls-psk-file. */
+ if (extents && !writing &&
+ nbd_add_meta_context (nbd, "base:allocation") == -1) {
+ fprintf (stderr, "%s: %s\n", prog, nbd_get_error ());
+ exit (EXIT_FAILURE);
+ }
if (handles_append (&rw->u.nbd, nbd) == -1) {
perror ("realloc");
@@ -531,7 +590,7 @@ DEFINE_VECTOR_TYPE (const_string_vector, const char *);
static void
open_nbd_subprocess (const char *prog,
const char **argv, size_t argc,
- struct rw *rw)
+ bool writing, struct rw *rw)
{
struct nbd_handle *nbd;
const_string_vector copy = empty_vector;
@@ -543,6 +602,11 @@ open_nbd_subprocess (const char *prog,
fprintf (stderr, "%s: %s\n", prog, nbd_get_error ());
exit (EXIT_FAILURE);
}
+ if (extents && !writing &&
+ nbd_add_meta_context (nbd, "base:allocation") == -1) {
+ fprintf (stderr, "%s: %s\n", prog, nbd_get_error ());
+ exit (EXIT_FAILURE);
+ }
if (handles_append (&rw->u.nbd, nbd) == -1) {
memory_error:
@@ -565,3 +629,24 @@ open_nbd_subprocess (const char *prog,
free (copy.ptr);
}
+
+/* Default implementation of rw->ops->get_extents for backends which
+ * don't/can't support extents. Also used for the --no-extents case.
+ */
+void
+default_get_extents (struct rw *rw, uintptr_t index,
+ uint64_t offset, uint64_t count,
+ extent_list *ret)
+{
+ struct extent e;
+
+ ret->size = 0;
+
+ e.offset = offset;
+ e.length = count;
+ e.hole = false;
+ if (extent_list_append (ret, e) == -1) {
+ perror ("realloc");
+ exit (EXIT_FAILURE);
+ }
+}
diff --git a/copy/multi-thread-copying.c b/copy/multi-thread-copying.c
index 3805daf..8081bb1 100644
--- a/copy/multi-thread-copying.c
+++ b/copy/multi-thread-copying.c
@@ -27,6 +27,7 @@
#include <poll.h>
#include <errno.h>
#include <assert.h>
+#include <sys/stat.h>
#include <pthread.h>
@@ -122,12 +123,14 @@ multi_thread_copying (void)
free (workers);
}
+static void wait_for_request_slots (uintptr_t index);
static unsigned in_flight (struct nbd_handle *src_nbd,
struct nbd_handle *dst_nbd);
static void poll_both_ends (struct nbd_handle *src_nbd,
struct nbd_handle *dst_nbd);
static int finished_read (void *vp, int *error);
-static int finished_write (void *vp, int *error);
+static int free_buffer (void *vp, int *error);
+static void fill_dst_range_with_zeroes (struct buffer *buffer);
/* There are 'threads' worker threads, each copying work ranges from
* src to dst until there are no more work ranges.
@@ -138,13 +141,7 @@ worker_thread (void *indexp)
uintptr_t index = (uintptr_t) indexp;
uint64_t offset, count;
struct nbd_handle *src_nbd, *dst_nbd;
- bool done = false;
-
- if (! get_next_offset (&offset, &count))
- /* No work to do, return immediately. Can happen for files which
- * are smaller than THREAD_WORK_SIZE where multi-conn is enabled.
- */
- return NULL;
+ extent_list exts = empty_vector;
/* In the case where src or dst is NBD, use
* {src|dst}.u.nbd.ptr[index] so that each thread is connected to
@@ -161,54 +158,77 @@ worker_thread (void *indexp)
else
dst_nbd = NULL;
- while (!done) {
- struct buffer *buffer;
- char *data;
- size_t len;
-
- if (count == 0) {
- /* Get another work range. */
- done = ! get_next_offset (&offset, &count);
- if (done) break;
- assert (0 < count && count <= THREAD_WORK_SIZE);
- }
-
- /* If the number of requests in flight exceeds the limit, poll
- * waiting for at least one request to finish. This enforces the
- * user --requests option.
- */
- while (in_flight (src_nbd, dst_nbd) >= max_requests)
- poll_both_ends (src_nbd, dst_nbd);
-
- /* Create a new buffer. This will be freed in a callback handler. */
- len = count;
- if (len > MAX_REQUEST_SIZE)
- len = MAX_REQUEST_SIZE;
- data = malloc (len);
- if (data == NULL) {
- perror ("malloc");
- exit (EXIT_FAILURE);
- }
- buffer = malloc (sizeof *buffer);
- if (buffer == NULL) {
- perror ("malloc");
- exit (EXIT_FAILURE);
- }
- buffer->offset = offset;
- buffer->len = len;
- buffer->data = data;
- buffer->free_data = free;
- buffer->index = index;
-
- /* Begin the asynch read operation. */
- src.ops->asynch_read (&src, buffer,
- (nbd_completion_callback) {
- .callback = finished_read,
- .user_data = buffer,
- });
-
- offset += len;
- count -= len;
+ while (get_next_offset (&offset, &count)) {
+ size_t i;
+
+ assert (0 < count && count <= THREAD_WORK_SIZE);
+ if (extents)
+ src.ops->get_extents (&src, index, offset, count, &exts);
+ else
+ default_get_extents (&src, index, offset, count, &exts);
+
+ for (i = 0; i < exts.size; ++i) {
+ struct buffer *buffer;
+ char *data;
+ size_t len;
+
+ if (exts.ptr[i].hole) {
+ /* The source is a hole so we can proceed directly to
+ * skipping, trimming or writing zeroes at the destination.
+ */
+ buffer = calloc (1, sizeof *buffer);
+ if (buffer == NULL) {
+ perror ("malloc");
+ exit (EXIT_FAILURE);
+ }
+ buffer->offset = exts.ptr[i].offset;
+ buffer->len = exts.ptr[i].length;
+ buffer->index = index;
+ fill_dst_range_with_zeroes (buffer);
+ }
+
+ else /* data */ {
+ /* As the extent might be larger than permitted for a single
+ * command, we may have to split this into multiple read
+ * requests.
+ */
+ while (exts.ptr[i].length > 0) {
+ len = exts.ptr[i].length;
+ if (len > MAX_REQUEST_SIZE)
+ len = MAX_REQUEST_SIZE;
+ data = malloc (len);
+ if (data == NULL) {
+ perror ("malloc");
+ exit (EXIT_FAILURE);
+ }
+ buffer = calloc (1, sizeof *buffer);
+ if (buffer == NULL) {
+ perror ("malloc");
+ exit (EXIT_FAILURE);
+ }
+ buffer->offset = exts.ptr[i].offset;
+ buffer->len = len;
+ buffer->data = data;
+ buffer->free_data = free;
+ buffer->index = index;
+
+ wait_for_request_slots (index);
+
+ /* Begin the asynch read operation. */
+ src.ops->asynch_read (&src, buffer,
+ (nbd_completion_callback) {
+ .callback = finished_read,
+ .user_data = buffer,
+ });
+
+ exts.ptr[i].offset += len;
+ exts.ptr[i].length -= len;
+ }
+ }
+
+ offset += count;
+ count = 0;
+ } /* for extents */
}
/* Wait for in flight NBD requests to finish. */
@@ -218,14 +238,37 @@ worker_thread (void *indexp)
if (progress)
progress_bar (1, 1);
+ free (exts.ptr);
return NULL;
}
+/* If the number of requests in flight exceeds the limit, poll
+ * waiting for at least one request to finish. This enforces
+ * the user --requests option.
+ */
+static void
+wait_for_request_slots (uintptr_t index)
+{
+ struct nbd_handle *src_nbd, *dst_nbd;
+
+ if (src.t == NBD)
+ src_nbd = src.u.nbd.ptr[index];
+ else
+ src_nbd = NULL;
+ if (dst.t == NBD)
+ dst_nbd = dst.u.nbd.ptr[index];
+ else
+ dst_nbd = NULL;
+
+ while (in_flight (src_nbd, dst_nbd) >= max_requests)
+ poll_both_ends (src_nbd, dst_nbd);
+}
+
/* Count the number of NBD commands in flight. Since the commands are
* auto-retired in the callbacks we don't need to count "done"
* commands.
*/
-static inline unsigned
+static unsigned
in_flight (struct nbd_handle *src_nbd, struct nbd_handle *dst_nbd)
{
return
@@ -335,18 +378,79 @@ finished_read (void *vp, int *error)
dst.ops->asynch_write (&dst, buffer,
(nbd_completion_callback) {
- .callback = finished_write,
+ .callback = free_buffer,
.user_data = buffer,
});
return 1; /* auto-retires the command */
}
-/* Callback called when dst has finished one write command. We can
- * now free the buffer.
+/* Fill a range in dst with zeroes. This is called from the copying
+ * loop when we see a hole in the source. Depending on the command
+ * line flags this could mean:
+ *
+ * --destination-is-zero:
+ * do nothing
+ *
+ * --allocated: we must write zeroes either using an efficient
+ * zeroing command or writing a buffer of zeroes
+ *
+ * (neither flag) try trimming if supported, else write zeroes
+ * as above
+ *
+ * This takes over ownership of the buffer and frees it eventually.
*/
+static void
+fill_dst_range_with_zeroes (struct buffer *buffer)
+{
+ char *data;
+
+ if (destination_is_zero)
+ goto free_and_return;
+
+ if (!allocated) {
+ /* Try trimming. */
+ wait_for_request_slots (buffer->index);
+ if (dst.ops->asynch_trim (&dst, buffer,
+ (nbd_completion_callback) {
+ .callback = free_buffer,
+ .user_data = buffer,
+ }))
+ return;
+ }
+
+ /* Try efficient zeroing. */
+ wait_for_request_slots (buffer->index);
+ if (dst.ops->asynch_zero (&dst, buffer,
+ (nbd_completion_callback) {
+ .callback = free_buffer,
+ .user_data = buffer,
+ }))
+ return;
+
+ /* Fall back to loop writing zeroes. This is going to be slow
+ * anyway, so do it synchronously. XXX
+ */
+ data = calloc (1, BUFSIZ);
+ if (!data) {
+ perror ("calloc");
+ exit (EXIT_FAILURE);
+ }
+ while (buffer->len > 0) {
+ size_t len = buffer->len > BUFSIZ ? BUFSIZ : buffer->len;
+
+ dst.ops->synch_write (&dst, data, len, buffer->offset);
+ buffer->len -= len;
+ buffer->offset += len;
+ }
+ free (data);
+
+ free_and_return:
+ free_buffer (buffer, &errno);
+}
+
static int
-finished_write (void *vp, int *error)
+free_buffer (void *vp, int *error)
{
struct buffer *buffer = vp;
diff --git a/copy/nbd-ops.c b/copy/nbd-ops.c
index 3ae01ad..6a8ac95 100644
--- a/copy/nbd-ops.c
+++ b/copy/nbd-ops.c
@@ -57,6 +57,37 @@ nbd_synch_write (struct rw *rw,
}
}
+static bool
+nbd_synch_trim (struct rw *rw, uint64_t offset, uint64_t count)
+{
+ assert (rw->t == NBD);
+
+ if (nbd_can_trim (rw->u.nbd.ptr[0]) == 0)
+ return false;
+
+ if (nbd_trim (rw->u.nbd.ptr[0], count, offset, 0) == -1) {
+ fprintf (stderr, "%s: %s\n", rw->name, nbd_get_error ());
+ exit (EXIT_FAILURE);
+ }
+ return true;
+}
+
+static bool
+nbd_synch_zero (struct rw *rw, uint64_t offset, uint64_t count)
+{
+ assert (rw->t == NBD);
+
+ if (nbd_can_zero (rw->u.nbd.ptr[0]) == 0)
+ return false;
+
+ if (nbd_zero (rw->u.nbd.ptr[0],
+ count, offset, LIBNBD_CMD_FLAG_NO_HOLE) == -1) {
+ fprintf (stderr, "%s: %s\n", rw->name, nbd_get_error ());
+ exit (EXIT_FAILURE);
+ }
+ return true;
+}
+
static void
nbd_asynch_read (struct rw *rw,
struct buffer *buffer,
@@ -87,9 +118,154 @@ nbd_asynch_write (struct rw *rw,
}
}
+static bool
+nbd_asynch_trim (struct rw *rw, struct buffer *buffer,
+ nbd_completion_callback cb)
+{
+ assert (rw->t == NBD);
+
+ if (nbd_can_trim (rw->u.nbd.ptr[0]) == 0)
+ return false;
+
+ if (nbd_aio_trim (rw->u.nbd.ptr[buffer->index],
+ buffer->len, buffer->offset,
+ cb, 0) == -1) {
+ fprintf (stderr, "%s: %s\n", rw->name, nbd_get_error ());
+ exit (EXIT_FAILURE);
+ }
+ return true;
+}
+
+static bool
+nbd_asynch_zero (struct rw *rw, struct buffer *buffer,
+ nbd_completion_callback cb)
+{
+ assert (rw->t == NBD);
+
+ if (nbd_can_zero (rw->u.nbd.ptr[0]) == 0)
+ return false;
+
+ if (nbd_aio_zero (rw->u.nbd.ptr[buffer->index],
+ buffer->len, buffer->offset,
+ cb, LIBNBD_CMD_FLAG_NO_HOLE) == -1) {
+ fprintf (stderr, "%s: %s\n", rw->name, nbd_get_error ());
+ exit (EXIT_FAILURE);
+ }
+ return true;
+}
+
+static int
+add_extent (void *vp, const char *metacontext,
+ uint64_t offset, uint32_t *entries, size_t nr_entries,
+ int *error)
+{
+ extent_list *ret = vp;
+ size_t i;
+
+ if (strcmp (metacontext, "base:allocation") != 0)
+ return 0;
+
+ for (i = 0; i < nr_entries; i += 2) {
+ struct extent e;
+
+ e.offset = offset;
+ e.length = entries[i];
+ /* Note we deliberately don't care about the ZERO flag. */
+ e.hole = (entries[i+1] & LIBNBD_STATE_HOLE) != 0;
+ if (extent_list_append (ret, e) == -1) {
+ perror ("realloc");
+ exit (EXIT_FAILURE);
+ }
+
+ offset += entries[i];
+ }
+
+ return 0;
+}
+
+/* This is done synchronously, but that's fine because commands from
+ * the previous work range in flight continue to run, it's difficult
+ * to (sanely) start new work until we have the full list of extents,
+ * and in almost every case the remote NBD server can answer our
+ * request for extents in a single round trip.
+ */
+static void
+nbd_get_extents (struct rw *rw, uintptr_t index,
+ uint64_t offset, uint64_t count,
+ extent_list *ret)
+{
+ extent_list exts = empty_vector;
+ struct nbd_handle *nbd;
+
+ assert (rw->t == NBD);
+ nbd = rw->u.nbd.ptr[index];
+
+ ret->size = 0;
+
+ while (count > 0) {
+ size_t i;
+
+ exts.size = 0;
+ if (nbd_block_status (nbd, count, offset,
+ (nbd_extent_callback) {
+ .user_data = &exts,
+ .callback = add_extent
+ }, 0) == -1) {
+ /* XXX We could call default_get_extents, but unclear if it's
+ * the right thing to do if the server is returning errors.
+ */
+ fprintf (stderr, "%s: %s\n", rw->name, nbd_get_error ());
+ exit (EXIT_FAILURE);
+ }
+
+ /* The server should always make progress. */
+ if (exts.size == 0) {
+ fprintf (stderr, "%s: NBD server is broken: it is not returning extent information.\nTry nbdcopy --no-extents as a workaround.\n",
+ rw->name);
+ exit (EXIT_FAILURE);
+ }
+
+ /* Copy the extents returned into the final list (ret). This is
+ * complicated because the extents returned by the server may
+ * begin earlier and begin or end later than the requested size.
+ */
+ for (i = 0; i < exts.size; ++i) {
+ uint64_t d;
+
+ if (exts.ptr[i].offset + exts.ptr[i].length <= offset)
+ continue;
+ if (exts.ptr[i].offset < offset) {
+ d = offset - exts.ptr[i].offset;
+ exts.ptr[i].offset += d;
+ exts.ptr[i].length -= d;
+ assert (exts.ptr[i].offset == offset);
+ }
+ if (exts.ptr[i].offset + exts.ptr[i].length > offset + count) {
+ d = offset + count - exts.ptr[i].offset - exts.ptr[i].length;
+ exts.ptr[i].length -= d;
+ assert (exts.ptr[i].length == offset + count);
+ }
+ if (extent_list_append (ret, exts.ptr[i]) == -1) {
+ perror ("realloc");
+ exit (EXIT_FAILURE);
+ }
+
+ offset += exts.ptr[i].length;
+ count -= exts.ptr[i].length;
+ }
+ }
+
+ free (exts.ptr);
+}
+
struct rw_ops nbd_ops = {
.synch_read = nbd_synch_read,
.synch_write = nbd_synch_write,
+ .synch_trim = nbd_synch_trim,
+ .synch_zero = nbd_synch_zero,
.asynch_read = nbd_asynch_read,
.asynch_write = nbd_asynch_write,
+ .asynch_trim = nbd_asynch_trim,
+ .asynch_zero = nbd_asynch_zero,
+ .get_extents = nbd_get_extents,
};
diff --git a/copy/nbdcopy.h b/copy/nbdcopy.h
index 9e4fc19..d74abad 100644
--- a/copy/nbdcopy.h
+++ b/copy/nbdcopy.h
@@ -47,6 +47,8 @@ struct rw {
struct { /* For LOCAL. */
int fd;
struct stat stat;
+ bool seek_hole_supported;
+ int sector_size;
} local;
handles nbd; /* For NBD, one handle per connection. */
} u;
@@ -63,6 +65,14 @@ struct buffer {
uintptr_t index; /* Thread number. */
};
+/* List of extents for rw->ops->get_extents. */
+struct extent {
+ uint64_t offset;
+ uint64_t length;
+ bool hole;
+};
+DEFINE_VECTOR_TYPE(extent_list, struct extent);
+
/* The operations struct hides some of the differences between local
* file, NBD and pipes from the copying code.
*
@@ -80,6 +90,16 @@ struct rw_ops {
void (*synch_write) (struct rw *rw,
const void *data, size_t len, uint64_t offset);
+ /* Synchronously trim. buffer->data is not used. If not possible,
+ * returns false.
+ */
+ bool (*synch_trim) (struct rw *rw, uint64_t offset, uint64_t count);
+
+ /* Synchronously zero. buffer->data is not used. If not possible,
+ * returns false.
+ */
+ bool (*synch_zero) (struct rw *rw, uint64_t offset, uint64_t count);
+
/* Asynchronous I/O operations. These start the operation and call
* 'cb' on completion.
*
@@ -95,12 +115,42 @@ struct rw_ops {
void (*asynch_write) (struct rw *rw,
struct buffer *buffer,
nbd_completion_callback cb);
+
+ /* Asynchronously trim. buffer->data is not used. If not possible,
+ * returns false.
+ */
+ bool (*asynch_trim) (struct rw *rw, struct buffer *buffer,
+ nbd_completion_callback cb);
+
+ /* Asynchronously zero. buffer->data is not used. If not possible,
+ * returns false.
+ */
+ bool (*asynch_zero) (struct rw *rw, struct buffer *buffer,
+ nbd_completion_callback cb);
+
+ /* Read base:allocation extents metadata for a region of the source.
+ * For local files the same information is read from the kernel.
+ *
+ * Note that qemu-img fetches extents for the entire disk up front,
+ * and we want to avoid doing that because it had very negative
+ * behaviour for certain sources (ie. VDDK).
+ */
+ void (*get_extents) (struct rw *rw, uintptr_t index,
+ uint64_t offset, uint64_t count,
+ extent_list *ret);
};
extern struct rw_ops file_ops;
extern struct rw_ops nbd_ops;
extern struct rw_ops pipe_ops;
+extern void default_get_extents (struct rw *rw, uintptr_t index,
+ uint64_t offset, uint64_t count,
+ extent_list *ret);
+
+extern bool allocated;
extern unsigned connections;
+extern bool destination_is_zero;
+extern bool extents;
extern bool flush;
extern unsigned max_requests;
extern bool progress;
diff --git a/copy/nbdcopy.pod b/copy/nbdcopy.pod
index f654f65..5ff7434 100644
--- a/copy/nbdcopy.pod
+++ b/copy/nbdcopy.pod
@@ -4,7 +4,9 @@ nbdcopy - copy to and from an NBD server
=head1 SYNOPSIS
- nbdcopy [-C N|--connections=N] [--flush] [-p|--progress]
+ nbdcopy [--allocated] [-C N|--connections=N]
+ [--destination-is-zero|--target-is-zero]
+ [--flush] [--no-extents] [-p|--progress]
[-R N|--requests=N] [--synchronous]
[-T N|--threads=N]
SOURCE DESTINATION
@@ -74,6 +76,15 @@ formats use C<qemu-img convert>, see L<qemu-img(1)>.
Display brief command line help and exit.
+=item B<--allocated>
+
+Normally nbdcopy tries to create a sparse output (with holes), if the
+destination supports that. It does this in two ways: either using
+extent informtation from the source to copy holes (see
+I<--no-extents>), or by detecting runs of zeroes (see I<-S>). If you
+use I<--allocated> then nbdcopy creates a fully allocated, non-sparse
+output on the destination.
+
=item B<-C> N
=item B<--connections=>N
@@ -82,11 +93,30 @@ Set the maximum number of NBD connections ("multi-conn"). By default
nbdcopy will try to use multi-conn with up to 4 connections if the NBD
server supports it.
+=item B<--destination-is-zero>
+
+=item B<--target-is-zero>
+
+Assume the destination is already zeroed. This allows nbdcopy to skip
+copying blocks of zeroes from the source to the destination. This is
+not safe unless the destination device is already zeroed.
+(I<--target-is-zero> is provided for compatibility with
+L<qemu-img(1)>.)
+
=item B<--flush>
Flush writes to ensure that everything is written to persistent
storage before nbdcopy exits.
+=item B<--no-extents>
+
+Normally nbdcopy uses extent metadata to skip over parts of the source
+disk which contain holes. If you use this flag, nbdcopy ignores
+extent information and reads everything, which is usually slower. You
+might use this flag in two situations: the source NBD server has
+incorrect metadata information; or the source has very slow extent
+querying so it's faster to simply read all of the data.
+
=item B<-p>
=item B<--progress>
diff --git a/copy/pipe-ops.c b/copy/pipe-ops.c
index e10a31e..0788aae 100644
--- a/copy/pipe-ops.c
+++ b/copy/pipe-ops.c
@@ -61,6 +61,12 @@ pipe_synch_write (struct rw *rw,
}
}
+static bool
+pipe_synch_trim_zero (struct rw *rw, uint64_t offset, uint64_t count)
+{
+ return false; /* not supported by pipes */
+}
+
static void
pipe_asynch_read (struct rw *rw,
struct buffer *buffer,
@@ -77,16 +83,30 @@ pipe_asynch_write (struct rw *rw,
abort (); /* See comment below. */
}
+static bool
+pipe_asynch_trim_zero (struct rw *rw, struct buffer *buffer,
+ nbd_completion_callback cb)
+{
+ return false; /* not supported by pipes */
+}
+
struct rw_ops pipe_ops = {
.synch_read = pipe_synch_read,
.synch_write = pipe_synch_write,
+ .synch_trim = pipe_synch_trim_zero,
+ .synch_zero = pipe_synch_trim_zero,
- /* Asynch pipe operations are not defined. These should never be
- * called because pipes/streams/sockets force --synchronous.
- * Because calling a NULL pointer screws up the stack trace when
- * we're not using frame pointers, these are defined to functions
- * that call abort().
+ /* Asynch pipe read/write operations are not defined. These should
+ * never be called because pipes/streams/sockets force synchronous
+ * mode. Because calling a NULL pointer screws up the stack trace
+ * when we're not using frame pointers, these are defined to
+ * functions that call abort().
*/
.asynch_read = pipe_asynch_read,
.asynch_write = pipe_asynch_write,
+
+ .asynch_trim = pipe_asynch_trim_zero,
+ .asynch_zero = pipe_asynch_trim_zero,
+
+ .get_extents = default_get_extents,
};
--
2.29.0.rc2
3 years, 10 months
[PATCH libnbd 0/2] copy: Preserve the host page cache when reading local files.
by Richard W.M. Jones
In nbdcopy we can preserve the page cache while reading from local
files. This means (unlike O_DIRECT) using the page cache to our
advantage when a file is already present in memory. But also not
increasing the amount of file which is cached as we read it, or at
least, not by very much.
These two patches are an evolution of this patch:
https://listman.redhat.com/archives/libguestfs/2021-February/thread.html#...
Although the code is heavily conditional and so will only work on 64
bit Linux systems, I didn't bother adding a command line flag because
I feel the way this is written (modulo bugs) it should almost always
be advantageous.
Writes next -- but that's much more difficult.
Rich.
3 years, 10 months