On Fri, Aug 07, 2020 at 07:53:13AM -0500, Eric Blake wrote:
>$ free -m; time ./nbdkit file /var/tmp/random fadvise=sequential
cache=none --run 'qemu-img convert -n -p -m 16 -W $nbd
"json:{\"file.driver\":\"null-co\",\"file.size\":\"1E\"}"'
; free -m ; cachestats /var/tmp/random
Hmm - the -W actually says that qemu-img is performing semi-random
access (there is no guarantee that the 16 coroutines are serviced in
linear order of the file), even though we really are making only one
pass through the file in bulk. I don't know if fadvise=normal would
be any better; dropping -W but keeping -m 16 might also be an
interesting number to check (where qemu-img tries harder to do
in-order access, but still take advantage of parallel threads).
> total used free shared buff/cache available
>Mem: 32083 1188 27928 1 2966 30440
>Swap: 16135 16 16119
> (100.00/100%)
>
>real 0m13.107s
>user 0m2.051s
>sys 0m37.556s
> total used free shared buff/cache available
>Mem: 32083 1196 27861 1 3024 30429
>Swap: 16135 16 16119
>pages in cache: 14533/8388608 (0.2%) [filesize=33554432.0K, pagesize=4K]
Without -W it's very similar:
$ free -m; time ./nbdkit file /var/tmp/random fadvise=sequential cache=none --run
'qemu-img convert -n -p -m 16 $nbd
"json:{\"file.driver\":\"null-co\",\"file.size\":\"1E\"}"'
; free -m ; cachestats /var/tmp/random
total used free shared buff/cache available
Mem: 32083 1184 26113 1 4785 30444
Swap: 16135 16 16119
(100.00/100%)
real 0m13.308s
user 0m1.961s
sys 0m40.455s
total used free shared buff/cache available
Mem: 32083 1188 26049 1 4845 30438
Swap: 16135 16 16119
pages in cache: 14808/8388608 (0.2%) [filesize=33554432.0K, pagesize=4K]
With -W and using fadvise=random is also about the same:
$ free -m; time ./nbdkit file /var/tmp/random fadvise=random cache=none --run
'qemu-img convert -n -p -m 16 -W $nbd
"json:{\"file.driver\":\"null-co\",\"file.size\":\"1E\"}"'
; free -m ; cachestats /var/tmp/random
total used free shared buff/cache available
Mem: 32083 1187 26109 1 4785 30440
Swap: 16135 16 16119
(100.00/100%)
real 0m13.030s
user 0m1.986s
sys 0m37.498s
total used free shared buff/cache available
Mem: 32083 1187 26053 1 4842 30440
Swap: 16135 16 16119
pages in cache: 14336/8388608 (0.2%) [filesize=33554432.0K, pagesize=4K]
I'm going to guess that for this case readahead doesn't have much time
to get ahead of qemu.
>+=item B<fadvise=normal>
>+
>+=item B<fadvise=random>
>+
>+=item B<fadvise=sequential>
>+
>+This optional flag hints to the kernel that you will access the file
>+normally, or in a random order, or sequentially. The exact behaviour
>+depends on your operating system, but for Linux using C<normal> causes
>+the kernel to read-ahead, C<sequential> causes the kernel to
>+read-ahead twice as much as C<normal>, and C<random> turns off
>+read-ahead.
Is it worth a mention of L<posix_fadvise(3)> here, to let the user
get some idea of what their operating system supports?
Yes I had this at one point but I seem to have dropped it. Will
add it back, thanks.
>+=head2 Reducing evictions from the page cache
>+
>+If the file is very large and you known the client will only
>+read/write the file sequentially one time (eg for making a single copy
>+or backup) then this will stop other processes from being evicted from
>+the page cache:
>+
>+ nbdkit file disk.img fadvise=sequential cache=none
It's also possible to avoid polluting the page cache by using
O_DIRECT, but that comes with harder guarantees (aligned access
through aligned buffers), so we may add it as another mode later on.
But in the meantime, cache=none is fairly nice while still avoiding
O_DIRECT.
I'm not sure if or even how we could ever do a robust O_DIRECT
implementation, but my idea was that it might be an alternate
implementation of cache=none. But if we thought we might use O_DIRECT
as a separate mode, then maybe we should rename cache=none.
cache=advise? cache=dontneed? I can't think of a good name!
>@@ -355,6 +428,17 @@ file_pwrite (void *handle, const void *buf,
uint32_t count, uint64_t offset,
> {
> struct handle *h = handle;
>+#if defined (HAVE_POSIX_FADVISE) && defined (POSIX_FADV_DONTNEED)
>+ uint32_t orig_count = count;
>+ uint64_t orig_offset = offset;
>+
>+ /* If cache=none we want to force pages we have just written to the
>+ * file to be flushed to disk so we can immediately evict them from
>+ * the page cache.
>+ */
>+ if (cache_mode == cache_none) flags |= NBDKIT_FLAG_FUA;
>+#endif
>+
> while (count > 0) {
> ssize_t r = pwrite (h->fd, buf, count, offset);
> if (r == -1) {
>@@ -369,6 +453,12 @@ file_pwrite (void *handle, const void *buf, uint32_t count,
uint64_t offset,
> if ((flags & NBDKIT_FLAG_FUA) && file_flush (handle, 0) == -1)
> return -1;
>+#ifdef HAVE_POSIX_FADVISE
>+ /* On Linux this will evict the pages we just wrote from the page cache. */
>+ if (cache_mode == cache_none)
>+ posix_fadvise (h->fd, orig_offset, orig_count, POSIX_FADV_DONTNEED);
>+#endif
So on Linux, POSIX_FADV_DONTNEED after a write that was not flushed
doesn't help? You did point out that the use of FUA for flushing
slows things down, but that's a fair price to pay to keep the cache
clean.
On Linux POSIX_FADV_DONTNEED won't flush dirty buffers. I expect (but
didn't actually measure) that just after a medium sized write the
buffers would all be dirty so the posix_fadvise(DONTNEED) call would
do nothing at all. The advice online does seem to be that you must
flush before calling this. (Linus advocates a complex
double-buffering solution so that you can be reading into one buffer
while flushing the other, so you don't have the overhead of waiting
for the flush).
I'm going to do a bit of benchmarking of the write side now.
Thanks,
Rich.
Patch looks good to me.
--
Eric Blake, Principal Software Engineer
Red Hat, Inc. +1-919-301-3226
Virtualization:
qemu.org |
libvirt.org
--
Richard Jones, Virtualization Group, Red Hat
http://people.redhat.com/~rjones
Read my programming and virtualization blog:
http://rwmj.wordpress.com
libguestfs lets you edit virtual machines. Supports shell scripting,
bindings from many languages.
http://libguestfs.org