On 8/7/20 6:31 AM, Richard W.M. Jones wrote:
You can use these flags as described in the manual page to optimize
access patterns, and to get better behaviour with the page cache in
some scenarios.
And if you guess wrong, it is only a performance penalty, not a
correctness issue.
For my testing I used the cachedel and cachestats utilities written by
Julius Plenz (
https://github.com/Feh/nocache). I started with a 32 GB
file of random data on a machine with about 32 GB of RAM. At the
beginning of the test I evicted the file from the page cache:
$ cachedel /var/tmp/random
$ cachestats /var/tmp/random
pages in cache: 0/8388608 (0.0%) [filesize=33554432.0K, pagesize=4K]
Performing a normal sequential copy of the file to /dev/null shows
that the file is almost entirely pulled into page cache (thus evicting
useful programs and data):
$ free -m; time ./nbdkit file /var/tmp/random --run 'qemu-img convert -n -p -m 16 -W
$nbd
"json:{\"file.driver\":\"null-co\",\"file.size\":\"1E\"}"'
; free -m ; cachestats /var/tmp/random
total used free shared buff/cache available
Mem: 32083 1193 27816 1 3073 30435
Swap: 16135 16 16119
(100.00/100%)
real 0m12.437s
user 0m2.005s
sys 0m31.339s
total used free shared buff/cache available
Mem: 32083 1190 313 1 30578 30433
Swap: 16135 16 16119
pages in cache: 7053276/8388608 (84.1%) [filesize=33554432.0K, pagesize=4K]
Now we repeat the test using fadvise=sequential cache=none:
$ cachedel /var/tmp/random
$ cachestats /var/tmp/random
pages in cache: 106/8388608 (0.0%) [filesize=33554432.0K, pagesize=4K]
$ free -m; time ./nbdkit file /var/tmp/random fadvise=sequential cache=none --run
'qemu-img convert -n -p -m 16 -W $nbd
"json:{\"file.driver\":\"null-co\",\"file.size\":\"1E\"}"'
; free -m ; cachestats /var/tmp/random
Hmm - the -W actually says that qemu-img is performing semi-random
access (there is no guarantee that the 16 coroutines are serviced in
linear order of the file), even though we really are making only one
pass through the file in bulk. I don't know if fadvise=normal would be
any better; dropping -W but keeping -m 16 might also be an interesting
number to check (where qemu-img tries harder to do in-order access, but
still take advantage of parallel threads).
total used free shared buff/cache
available
Mem: 32083 1188 27928 1 2966 30440
Swap: 16135 16 16119
(100.00/100%)
real 0m13.107s
user 0m2.051s
sys 0m37.556s
total used free shared buff/cache available
Mem: 32083 1196 27861 1 3024 30429
Swap: 16135 16 16119
pages in cache: 14533/8388608 (0.2%) [filesize=33554432.0K, pagesize=4K]
In this case the file largely avoids being pulled into the page cache,
and we do not evict useful stuff.
Notice that the test takes slightly longer to run. This is expected
because page cache eviction happens synchronously. I expect the cost
when doing sequential writes to be higher. Linus outlined a technique
to do this without the overhead, but unfortunately it is considerably
more complex and dangerous than I am comfortable adding to the file
plugin:
http://lkml.iu.edu/hypermail/linux/kernel/1005.2/01845.html
http://lkml.iu.edu/hypermail/linux/kernel/1005.2/01953.html
(See also scary warnings in the sync_file_range man page)
We can always add more knobs later if someone has a use case and
benchmarks for them. I think what you have here is fine.
+
+=item B<fadvise=normal>
+
+=item B<fadvise=random>
+
+=item B<fadvise=sequential>
+
+This optional flag hints to the kernel that you will access the file
+normally, or in a random order, or sequentially. The exact behaviour
+depends on your operating system, but for Linux using C<normal> causes
+the kernel to read-ahead, C<sequential> causes the kernel to
+read-ahead twice as much as C<normal>, and C<random> turns off
+read-ahead.
Is it worth a mention of L<posix_fadvise(3)> here, to let the user get
some idea of what their operating system supports?
+=head2 Reducing evictions from the page cache
+
+If the file is very large and you known the client will only
+read/write the file sequentially one time (eg for making a single copy
+or backup) then this will stop other processes from being evicted from
+the page cache:
+
+ nbdkit file disk.img fadvise=sequential cache=none
It's also possible to avoid polluting the page cache by using O_DIRECT,
but that comes with harder guarantees (aligned access through aligned
buffers), so we may add it as another mode later on. But in the
meantime, cache=none is fairly nice while still avoiding O_DIRECT.
@@ -355,6 +428,17 @@ file_pwrite (void *handle, const void *buf,
uint32_t count, uint64_t offset,
{
struct handle *h = handle;
+#if defined (HAVE_POSIX_FADVISE) && defined (POSIX_FADV_DONTNEED)
+ uint32_t orig_count = count;
+ uint64_t orig_offset = offset;
+
+ /* If cache=none we want to force pages we have just written to the
+ * file to be flushed to disk so we can immediately evict them from
+ * the page cache.
+ */
+ if (cache_mode == cache_none) flags |= NBDKIT_FLAG_FUA;
+#endif
+
while (count > 0) {
ssize_t r = pwrite (h->fd, buf, count, offset);
if (r == -1) {
@@ -369,6 +453,12 @@ file_pwrite (void *handle, const void *buf, uint32_t count, uint64_t
offset,
if ((flags & NBDKIT_FLAG_FUA) && file_flush (handle, 0) == -1)
return -1;
+#ifdef HAVE_POSIX_FADVISE
+ /* On Linux this will evict the pages we just wrote from the page cache. */
+ if (cache_mode == cache_none)
+ posix_fadvise (h->fd, orig_offset, orig_count, POSIX_FADV_DONTNEED);
+#endif
So on Linux, POSIX_FADV_DONTNEED after a write that was not flushed
doesn't help? You did point out that the use of FUA for flushing slows
things down, but that's a fair price to pay to keep the cache clean.
Patch looks good to me.
--
Eric Blake, Principal Software Engineer
Red Hat, Inc. +1-919-301-3226
Virtualization:
qemu.org |
libvirt.org