Re: [Libguestfs] [PATCH nbdkit] file: Implement cache=none and fadvise=normal|random|sequential.

Friday, 7 August 2020

On 8/7/20 6:31 AM, Richard W.M. Jones wrote:
...
 You can use these flags as described in the manual page to optimize
 access patterns, and to get better behaviour with the page cache in
 some scenarios. 
And if you guess wrong, it is only a performance penalty, not a 
correctness issue.

...

 For my testing I used the cachedel and cachestats utilities written by
 Julius Plenz (https://github.com/Feh/nocache).  I started with a 32 GB
 file of random data on a machine with about 32 GB of RAM.  At the
 beginning of the test I evicted the file from the page cache:

 $ cachedel /var/tmp/random
 $ cachestats /var/tmp/random
 pages in cache: 0/8388608 (0.0%)  [filesize=33554432.0K, pagesize=4K]

 Performing a normal sequential copy of the file to /dev/null shows
 that the file is almost entirely pulled into page cache (thus evicting
 useful programs and data):

 $ free -m; time ./nbdkit file /var/tmp/random --run 'qemu-img convert -n -p -m 16 -W
$nbd
"json:{\"file.driver\":\"null-co\",\"file.size\":\"1E\"}"'
; free -m ; cachestats /var/tmp/random
                total        used        free      shared  buff/cache   available
 Mem:          32083        1193       27816           1        3073       30435
 Swap:         16135          16       16119
      (100.00/100%)

 real	0m12.437s
 user	0m2.005s
 sys	0m31.339s
                total        used        free      shared  buff/cache   available
 Mem:          32083        1190         313           1       30578       30433
 Swap:         16135          16       16119
 pages in cache: 7053276/8388608 (84.1%)  [filesize=33554432.0K, pagesize=4K]

 Now we repeat the test using fadvise=sequential cache=none:

 $ cachedel /var/tmp/random
 $ cachestats /var/tmp/random
 pages in cache: 106/8388608 (0.0%)  [filesize=33554432.0K, pagesize=4K]

 $ free -m; time ./nbdkit file /var/tmp/random fadvise=sequential cache=none --run
'qemu-img convert -n -p -m 16 -W $nbd
"json:{\"file.driver\":\"null-co\",\"file.size\":\"1E\"}"'
; free -m ; cachestats /var/tmp/random 
Hmm - the -W actually says that qemu-img is performing semi-random 
access (there is no guarantee that the 16 coroutines are serviced in 
linear order of the file), even though we really are making only one 
pass through the file in bulk.  I don't know if fadvise=normal would be 
any better; dropping -W but keeping -m 16 might also be an interesting 
number to check (where qemu-img tries harder to do in-order access, but 
still take advantage of parallel threads).

...
                total        used        free      shared  buff/cache 
 available
 Mem:          32083        1188       27928           1        2966       30440
 Swap:         16135          16       16119
      (100.00/100%)

 real	0m13.107s
 user	0m2.051s
 sys	0m37.556s
                total        used        free      shared  buff/cache   available
 Mem:          32083        1196       27861           1        3024       30429
 Swap:         16135          16       16119
 pages in cache: 14533/8388608 (0.2%)  [filesize=33554432.0K, pagesize=4K]

 In this case the file largely avoids being pulled into the page cache,
 and we do not evict useful stuff.

 Notice that the test takes slightly longer to run.  This is expected
 because page cache eviction happens synchronously.  I expect the cost
 when doing sequential writes to be higher.  Linus outlined a technique
 to do this without the overhead, but unfortunately it is considerably
 more complex and dangerous than I am comfortable adding to the file
 plugin:

 http://lkml.iu.edu/hypermail/linux/kernel/1005.2/01845.html
 http://lkml.iu.edu/hypermail/linux/kernel/1005.2/01953.html

 (See also scary warnings in the sync_file_range man page) 
We can always add more knobs later if someone has a use case and 
benchmarks for them.  I think what you have here is fine.

...
 +
 +=item B<fadvise=normal>
 +
 +=item B<fadvise=random>
 +
 +=item B<fadvise=sequential>
 +
 +This optional flag hints to the kernel that you will access the file
 +normally, or in a random order, or sequentially.  The exact behaviour
 +depends on your operating system, but for Linux using C<normal> causes
 +the kernel to read-ahead, C<sequential> causes the kernel to
 +read-ahead twice as much as C<normal>, and C<random> turns off
 +read-ahead. 
Is it worth a mention of L<posix_fadvise(3)> here, to let the user get 
some idea of what their operating system supports?

...
 +=head2 Reducing evictions from the page cache
 +
 +If the file is very large and you known the client will only
 +read/write the file sequentially one time (eg for making a single copy
 +or backup) then this will stop other processes from being evicted from
 +the page cache:
 +
 + nbdkit file disk.img fadvise=sequential cache=none 
It's also possible to avoid polluting the page cache by using O_DIRECT, 
but that comes with harder guarantees (aligned access through aligned 
buffers), so we may add it as another mode later on.  But in the 
meantime, cache=none is fairly nice while still avoiding O_DIRECT.

...
 @@ -355,6 +428,17 @@ file_pwrite (void *handle, const void *buf,
uint32_t count, uint64_t offset,
   {
     struct handle *h = handle;

 +#if defined (HAVE_POSIX_FADVISE) && defined (POSIX_FADV_DONTNEED)
 +  uint32_t orig_count = count;
 +  uint64_t orig_offset = offset;
 +
 +  /* If cache=none we want to force pages we have just written to the
 +   * file to be flushed to disk so we can immediately evict them from
 +   * the page cache.
 +   */
 +  if (cache_mode == cache_none) flags |= NBDKIT_FLAG_FUA;
 +#endif
 +
     while (count > 0) {
       ssize_t r = pwrite (h->fd, buf, count, offset);
       if (r == -1) {
 @@ -369,6 +453,12 @@ file_pwrite (void *handle, const void *buf, uint32_t count, uint64_t
offset,
     if ((flags & NBDKIT_FLAG_FUA) && file_flush (handle, 0) == -1)
       return -1;

 +#ifdef HAVE_POSIX_FADVISE
 +  /* On Linux this will evict the pages we just wrote from the page cache. */
 +  if (cache_mode == cache_none)
 +    posix_fadvise (h->fd, orig_offset, orig_count, POSIX_FADV_DONTNEED);
 +#endif 
So on Linux, POSIX_FADV_DONTNEED after a write that was not flushed 
doesn't help?  You did point out that the use of FUA for flushing slows 
things down, but that's a fair price to pay to keep the cache clean.

Patch looks good to me.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Re: [Libguestfs] [PATCH nbdkit] file: Implement cache=none and fadvise=normal|random|sequential.