On Mon, Jul 13, 2020 at 12:37 PM Richard W.M. Jones <rjones(a)redhat.com> wrote:
On Sun, Jul 12, 2020 at 11:16:01PM +0300, Nir Soffer wrote:
> On Sat, Jul 11, 2020 at 11:18 AM Richard W.M. Jones <rjones(a)redhat.com>
wrote:
> >
> > KubeVirt is a custom resource (a kind of plugin) for Kubernetes which
> > adds support for running virtual machines. As part of this they have
> > the same problems as everyone else of how to import large disk images
> > into the system for pets, templates, etc.
> >
> > As part of the project they've defined a format for embedding a disk
> > image into a container (unclear why? perhaps so these can be
> > distributed using the existing container registry systems?):
> >
> >
https://github.com/kubevirt/containerized-data-importer/blob/master/doc/i...
> >
> > An example of such a disk-in-a-container is here:
> >
> >
https://hub.docker.com/r/kubevirt/fedora-cloud-container-disk-demo
> >
> > We've been asked if we can help with tools to efficiently import these
> > disk images, and I have suggested a few things with nbdkit and have
> > written a couple of filters (tar, gzip) to support this.
>
> I don't think gzip filter matches nbdkit very well. Having to decompress the
> entire disk before you can serve it does not sound right.
We do have existing plugins -- I'm thinking of
http://libguestfs.org/nbdkit-iso-plugin.1.html -- which are merely
convenient wrappers around what you could do with a bit of shell
scripting. You might question why we have them at all, but a reason
is that it just makes things simpler for the end user. They don't
have to worry about how to do the download and cleaning up the
temporary file afterwards.
So while you're right that the gzip filter isn't a good fit with
nbdkit for quite annoying technical reasons, I still think it's worth
having it. We're not forcing people to use it or preventing them from
using alternatives.
Yes, this is a good point to provide this.
> > This email is my thoughts on further development work in
this area.
> >
> > ----------------------------------------------------------------------
> >
> > (1) Accessing the disk image directly from the Docker Hub.
> >
> > When you get down to it, what this actually is:
> >
> > * There is a disk image in qcow2 format.
> >
> > * It is embedded as "./disk/downloaded" in a gzip-compressed tar
> > file. (This is a container with a single layer).
> >
> > * This tarball is uploaded to (in this case) the Docker Hub and can
> > be accessed over a URL. The URL can be constructed using a few
> > json requests.
> >
> > * The URL is served by nginx and this supports HTTP range requests.
> >
> > I encapsulated all of this in the attached script. This is an
> > existence proof that it is possible to access the image with nbdkit.
> >
> > One problem is that the auth token only lasts for a limited time
> > (seems to be 5 minutes in my test), and it doesn't automatically renew
> > as you download the layer, so if the download takes longer than 5
> > minutes you'll suddenly get unrecoverable authorization failures.
> >
> > There seem to be two possible ways to solve this:
> >
> > (a) Write a new nbdkit-container-plugin which does the authorization
> > (essentially hiding most of the details in the attached script
> > from the user). It could deal with renewing the key as
> > required.
> >
> > (b) Modify nbdkit-curl-plugin so the user could provide a script for
> > renewing authorization. This would expose the full gory details
> > to the end user, but on the other hand might be useful in other
> > situations that require authorization.
>
> docker/podman already solved this, why should nbdkit solve it again?
Right, exactly my thoughts and the reason why (3) below.
> Do you get timeouts while you download the image with a single request?
Do you mean a single massive curl request? I didn't try. You get a
401 authorization failure if you make a request after ~ 5 minutes
after the token was issued. Unlike VMware's and RHV's disk-over-web
services, the token doesn't automatically extend when a request is made.
> > (2) nbdkit-tar-filter exportname and listing files.
> >
> > This has already been covered by an email from Nir Soffer, so I'll
> > simply link to that:
> >
> >
https://lists.gnu.org/archive/html/qemu-discuss/2020-06/msg00058.html
> >
> > It basically requires a fairly simple change to nbdkit-tar-filter to
> > map the tar filenames into export names, and a deeper change to nbdkit
> > core server to allow listing all export names. The end result would
> > be that an NBD client could query the list of files [ie exports] in
> > the tarball and choose one to download.
>
> We know the tar member name upfront, so why do we need to list the contents?
AIUI we don't necessarily know the name up front. It might not always
be ./downloaded/disk. (I might be wrong on this.)
> > (3) gzip & tar require full downloads - why not “docker/podman
save/export”?
>
> This looks like a better direction.
>
> The nice thing about embedding the disk in the container image is being able
> to use existing infrastructure (docker, quay) to host the images, and
> to transfer
> them to the hosts. We don't need to write any code for this.
>
> Even better, we have automatic caching on the host by docker/podman, so we
> have to pull the image from the registry only once on every host. Then we can
> access the local cache.
>
> > Stepping back to get the bigger picture: Because the OCI standard uses
> > gzip for compression (
https://stackoverflow.com/a/9213826), and
> > because the tar index is interspersed with the tar data, you always
> > need to download the whole container layer before you can access the
> > disk image inside.
>
> You need to download most of the tar, but you don't need to keep the tar
> in a temporary file. For example in python you can create a tarfile over with
> the http response object in streaming with transparent decompression mode
> ("r|*"), and stream the disk contents from the tar without a temporary
file.
>
> with tarfile.open(mode="r|*", fileobj=response) as tar:
> for member in tar:
> if member.name == "./disk/downloaded":
> with tar.extractfile(member) as f
> shutil.copyfileobj(f, sys.stdout.buffer)
> sys.exit(0)
>
> I think this is what cdi import code does, and is the most efficient way
> to copy the disk directly from the registry with the current format.
>
> > Currently nbdkit-gzip-filter hides this from the
> > end user, but it's still downloading the whole thing to a temporary
> > file. There's no way round that unless OCI can be persuaded to use a
> > better format.
>
> The way is to use the container image downloaded by podman/docker.
>
> > But docker/podman already has a way to export container layers,
> > ie. the save and export commands. These also have the advantage that
> > it will cache the downloaded layers between runs. So why aren't we
> > using that?
> >
> > In this world, nbdkit-container-plugin would simply use docker/podman
> > save (or export?) to grab the container as a tar file, and we would
> > use the tar filter as above to expose the contents as an NBD endpoint
> > for further consumption. IOW:
> >
> > nbdkit container docker.io/kubevirt/fedora-cloud-container-disk-demo \
> > --filter=tar tar-entry=./downloaded/disk
>
> This will work but there are 2 issues:
>
> 1. podman save/export copy the tar locally. This is pretty fast for the example
> image but copying the tar and deleting it seems wasteful.
>
> 2. If we have the tar locally, why not use qemu-img directly? we can find the
> offset of the disk inside the tar and use:
>
> $ time podman save --format oci-dir -o demo-oci
> docker.io/kubevirt/fedora-cloud-container-disk-demo
>
> real 0m2.795s
> user 0m2.011s
> sys 0m0.878s
>
> $ time qemu-img convert -O raw 'json:{"file": {"driver":
"raw",
> "offset": 1536, "file": {"driver": "file",
"filename":
>
"demo-oci/8162f3eda33d5a87df56e969dcd9777523bd53278a0701b2e53b93c33c01853e"}}}'
> out.raw
>
> real 0m1.036s
> user 0m3.237s
> sys 0m1.326s
>
> But I think we have a better way - using a self-extracting-disk
> container. Start a container with
> a disk image, and run qemu-img inside this container to convert the
> disk to the target PV.
>
> It can work like this:
>
> 1. We create a base image - this will be used for all disks container images.
>
> $ cat Dockerfile.kubevirt-img
> FROM alpine
> RUN apk add qemu-img
>
> $ podman build -t kubevirt-img -f Dockerfile.kubevirt-img .
> ...
>
> You pull this from quay.io/nirsof/kubevirt-img.
>
> 2. Create a disk container image, based on the base image
>
> $ cat Dockerfile.kubevirt-fedora-cloud-disk
> FROM quay.io/nirsof/kubevirt-fimg
> COPY disk.qcow2 /disk.qcow2
> CMD ["qemu-img", "convert", "-p", "-f",
"qcow2", "-O", "raw",
> "/disk.qcow2", "/target/disk.img"]
>
> $ podman build -t kubevirt-fedora-cloud-disk -f
> Dockerfile.kubevirt-fedora-cloud-disk .
> ...
>
> This container is a little larger, but the common layer with qemu-img
> and its dependencies is
> shared between all disk container images. In this example it adds only 25 MiB.
>
> You can pull this from quay.io/nirsof/kubevirt-fedora-cloud-disk.
>
> With this we can create a copy of the disk using:
>
> $ time podman run --volume ./:/target:Z --rm -it
> quay.io/nirsof/kubevirt-fedora-cloud-disk
> Trying to pull quay.io/nirsof/kubevirt-fedora-cloud-disk...
> Getting image source signatures
> Copying blob 0d9094d70e9c skipped: already exists
> Copying blob a3ed95caeb02 done
> Copying blob a3ed95caeb02 done
> Copying blob 18717781bd09 done
> Copying blob fe5cd0d8bf32 done
> Writing manifest to image destination
> Storing signatures
> (100.00/100%)
>
> real 0m59.800s
> user 0m8.988s
> sys 0m7.437s
>
> $ ls -lhs disk.img
> 728M -rw-r--r--. 1 nsoffer nsoffer 4.0G Jul 12 21:45 disk.img
>
> $ podman images | grep fedora-cloud
> quay.io/nirsof/kubevirt-fedora-cloud-disk latest
> 097ef06b6d71 About an hour ago 326 MB
> docker.io/kubevirt/fedora-cloud-container-disk-demo latest
> 6494830c6dc7 50 years ago 303 MB
>
> The next time we run this we get the container from the cache:
>
> $ time podman run --volume ./:/target:Z --rm -it
> quay.io/nirsof/kubevirt-fedora-cloud-disk
> (100.00/100%)
>
> real 0m2.244s
> user 0m0.070s
> sys 0m0.253s
Interesting, yes. Although I guess this involves recreating all of
these disk-in-a-container images? Also I'd be a bit concerned from a
security angle: We're turning dumb data into a self-extracting
executable program.
We run the program in a container, and it is exposed only to the shared
volume we provide, so the damage it can do is limited.
Lets say you download ngix container disk. Do you trust the disk to run
a vm from this disk? Then why don't you trust the container provided by
the same vendor?
If we want less active container, we can just expose the image via nbd:
$ cat Dockerfile.kubevirt-fedora-cloud-nbd
FROM quay.io/nirsof/kubevirt-img
COPY disk.qcow2 /disk.qcow2
CMD ["qemu-nbd", "--socket=/shared/nbd.sock",
"--persistent",
"--read-only", "--format", "qcow2",
"/disk.qcow2"]
$ podman run --volume ./:/shared:Z --rm -d kubevirt-fedora-cloud-nbd
a0ebf5616b1c05fc6fd4e3240b83ff0ed9aa187824386d42647c2d271b268419
$ qemu-img info nbd+unix://?socket=nbd.sock
image: nbd+unix://?socket=nbd.sock
file format: raw
virtual size: 4 GiB (4294967296 bytes)
disk size: unavailable
But in this case we need to run 2 containers to import a disk.