On Fri, Aug 07, 2020 at 07:53:13AM -0500, Eric Blake wrote:
> >$ free -m; time ./nbdkit file /var/tmp/random fadvise=sequential cache=none --run 'qemu-img convert -n -p -m 16 -W $nbd "json:{\"file.driver\":\"null-co\",\"file.size\":\"1E\"}"' ; free -m ; cachestats /var/tmp/random
>
> Hmm - the -W actually says that qemu-img is performing semi-random
> access (there is no guarantee that the 16 coroutines are serviced in
> linear order of the file), even though we really are making only one
> pass through the file in bulk. I don't know if fadvise=normal would
> be any better; dropping -W but keeping -m 16 might also be an
> interesting number to check (where qemu-img tries harder to do
> in-order access, but still take advantage of parallel threads).
>
> > total used free shared buff/cache available
> >Mem: 32083 1188 27928 1 2966 30440
> >Swap: 16135 16 16119
> > (100.00/100%)
> >
> >real 0m13.107s
> >user 0m2.051s
> >sys 0m37.556s
> > total used free shared buff/cache available
> >Mem: 32083 1196 27861 1 3024 30429
> >Swap: 16135 16 16119
> >pages in cache: 14533/8388608 (0.2%) [filesize=33554432.0K, pagesize=4K]
Without -W it's very similar:
$ free -m; time ./nbdkit file /var/tmp/random fadvise=sequential cache=none --run 'qemu-img convert -n -p -m 16 $nbd "json:{\"file.driver\":\"null-co\",\"file.size\":\"1E\"}"' ; free -m ; cachestats /var/tmp/random
total used free shared buff/cache available
Mem: 32083 1184 26113 1 4785 30444
Swap: 16135 16 16119
(100.00/100%)
real 0m13.308s
user 0m1.961s
sys 0m40.455s
total used free shared buff/cache available
Mem: 32083 1188 26049 1 4845 30438
Swap: 16135 16 16119
pages in cache: 14808/8388608 (0.2%) [filesize=33554432.0K, pagesize=4K]
With -W and using fadvise=random is also about the same:
$ free -m; time ./nbdkit file /var/tmp/random fadvise=random cache=none --run 'qemu-img convert -n -p -m 16 -W $nbd "json:{\"file.driver\":\"null-co\",\"file.size\":\"1E\"}"' ; free -m ; cachestats /var/tmp/random
total used free shared buff/cache available
Mem: 32083 1187 26109 1 4785 30440
Swap: 16135 16 16119
(100.00/100%)
real 0m13.030s
user 0m1.986s
sys 0m37.498s
total used free shared buff/cache available
Mem: 32083 1187 26053 1 4842 30440
Swap: 16135 16 16119
pages in cache: 14336/8388608 (0.2%) [filesize=33554432.0K, pagesize=4K]
I'm going to guess that for this case readahead doesn't have much time
to get ahead of qemu.
> >+=item B<fadvise=normal>
> >+
> >+=item B<fadvise=random>
> >+
> >+=item B<fadvise=sequential>
> >+
> >+This optional flag hints to the kernel that you will access the file
> >+normally, or in a random order, or sequentially. The exact behaviour
> >+depends on your operating system, but for Linux using C<normal> causes
> >+the kernel to read-ahead, C<sequential> causes the kernel to
> >+read-ahead twice as much as C<normal>, and C<random> turns off
> >+read-ahead.
>
> Is it worth a mention of L<posix_fadvise(3)> here, to let the user
> get some idea of what their operating system supports?
Yes I had this at one point but I seem to have dropped it. Will
add it back, thanks.
> >+=head2 Reducing evictions from the page cache
> >+
> >+If the file is very large and you known the client will only
> >+read/write the file sequentially one time (eg for making a single copy
> >+or backup) then this will stop other processes from being evicted from
> >+the page cache:
> >+
> >+ nbdkit file disk.img fadvise=sequential cache=none
>
> It's also possible to avoid polluting the page cache by using
> O_DIRECT, but that comes with harder guarantees (aligned access
> through aligned buffers), so we may add it as another mode later on.
> But in the meantime, cache=none is fairly nice while still avoiding
> O_DIRECT.
I'm not sure if or even how we could ever do a robust O_DIRECT
We can let the plugin an filter deal with that. The simplest solution is to drop it on the user and require aligned requests.
Maybe a filter can handle alignment?
implementation, but my idea was that it might be an alternate
implementation of cache=none. But if we thought we might use O_DIRECT
as a separate mode, then maybe we should rename cache=none.
cache=advise? cache=dontneed? I can't think of a good name!
Yes, don't call it none if you use the cache.
How about advise=?
I would keep cache semantics similar to qemu.
> >@@ -355,6 +428,17 @@ file_pwrite (void *handle, const void *buf, uint32_t count, uint64_t offset,
> > {
> > struct handle *h = handle;
> >+#if defined (HAVE_POSIX_FADVISE) && defined (POSIX_FADV_DONTNEED)
> >+ uint32_t orig_count = count;
> >+ uint64_t orig_offset = offset;
> >+
> >+ /* If cache=none we want to force pages we have just written to the
> >+ * file to be flushed to disk so we can immediately evict them from
> >+ * the page cache.
> >+ */
> >+ if (cache_mode == cache_none) flags |= NBDKIT_FLAG_FUA;
> >+#endif
> >+
> > while (count > 0) {
> > ssize_t r = pwrite (h->fd, buf, count, offset);
> > if (r == -1) {
> >@@ -369,6 +453,12 @@ file_pwrite (void *handle, const void *buf, uint32_t count, uint64_t offset,
> > if ((flags & NBDKIT_FLAG_FUA) && file_flush (handle, 0) == -1)
> > return -1;
> >+#ifdef HAVE_POSIX_FADVISE
> >+ /* On Linux this will evict the pages we just wrote from the page cache. */
> >+ if (cache_mode == cache_none)
> >+ posix_fadvise (h->fd, orig_offset, orig_count, POSIX_FADV_DONTNEED);
> >+#endif
>
> So on Linux, POSIX_FADV_DONTNEED after a write that was not flushed
> doesn't help? You did point out that the use of FUA for flushing
> slows things down, but that's a fair price to pay to keep the cache
> clean.
On Linux POSIX_FADV_DONTNEED won't flush dirty buffers. I expect (but
didn't actually measure) that just after a medium sized write the
buffers would all be dirty so the posix_fadvise(DONTNEED) call would
do nothing at all. The advice online does seem to be that you must
flush before calling this. (Linus advocates a complex
double-buffering solution so that you can be reading into one buffer
while flushing the other, so you don't have the overhead of waiting
for the flush).
I'm going to do a bit of benchmarking of the write side now.
We already tried this with dd and the results were not good.
Nir