On 1/3/19 6:37 AM, Richard W.M. Jones wrote:
The original plan was to have a background thread doing the reclaim.
However that cannot work given the design of filters, because a
background thread cannot access the next_ops struct which is only
available during requests.
Therefore we spread the work over the request threads. Each blk_*
function checks whether there is work to do, and if there is will
reclaim up to two blocks from the cache (thus ensuring that we always
make forward progress, since each blk_* function can only add at most
one block to the cache).
---
+=head1 CACHE MAXIMUM SIZE
+
+By default the cache can grow to any size (although not larger than
+the virtual size of the underlying plugin) and you have to ensure
+there is sufficient space in C<$TMPDIR> for it.
Might change in the future if we add support for automatic ENOSPC
handling - but no need to document that until we make such a change.
+
+Using the parameters C<cache-max-size>, C<cache-high-threshold> and
+C<cache-low-threshold> you can limit the maximum size of the cache.
+
+This requires kernel and filesystem support (for L<fallocate(2)>
+C<FALLOC_FL_PUNCH_HOLE>), so it may not work on all platforms.
+++ b/filters/cache/cache.h
@@ -34,6 +34,8 @@
#ifndef NBDKIT_CACHE_H
#define NBDKIT_CACHE_H
+#include <fcntl.h>
+
/* Size of a block in the cache. A 4K block size means that we need
* 64 MB of memory to store the bitmaps for a 1 TB underlying image.
* It is also smaller than the usual hole size for sparse files, which
Question - how on earth is this supposed to work when we run the cache
with BLKSIZE 4096 - punching holes typically is a no-op unless you have
a large enough block to actually warrant a hole in the file, and I don't
know that 4k is big enough. Does the kernel coalesce fallocate() calls
over consecutive regions of a file until it has enough to actually punch
a hole? In which case, your test might happen to show that it "works"
in that you end up punching some holes due to large blocks of
consecutive less-frequently-used clusters after several combined
fallocate() calls; but a different test that is truly random-access will
be dirtying enough sub-hole regions that you can't reclaim anything? Or
do you need the cache module to use a larger block size by default, or
even add a parameter to control the block size?
+
+/* If we are currently reclaiming blocks from the cache.
+ *
+ * The state machine starts in the NOT_RECLAIMING state. When the
+ * size of the cache exceeds the high threshold, we move to
+ * RECLAIMING_LRU. Once we have exhausted all LRU blocks, we move to
+ * RECLAIMING_ANY (reclaiming any blocks).
+ *
+ * If at any time the size of the cache goes below the low threshold
+ * we move back to the NOT_RECLAIMING state.
+ *
+ * A possible future enhancement is to add an extra state between LRU
+ * and ANY which reclaims blocks from lru.c:bm[1].
I'm trying to envision how much thrashing we might encounter (such as
reclaiming a block that the very next blk_*() call pulls back into the
cache), particularly if we don't try to sort out bm[1] first. But while
I can spot obvious one-shot inefficiencies, I haven't been able to
describe a case that will consistently lead to degraded performance due
to churn (since we cycle through the entire image before the next
attempt to reclaim a portion of the image that was just reclaimed). So
I'm okay with this patch for now (we can always improve things later if
profiling says we need to).
+/* Reclaim a single cache block. */
+static void
+reclaim_one (int fd, struct bitmap *bm)
+{
+ assert (reclaiming);
+
+ if (reclaiming == RECLAIMING_LRU)
+ reclaim_lru (fd, bm);
+ else
+ reclaim_any (fd, bm);
+}
+
+static void
+reclaim_lru (int fd, struct bitmap *bm)
+{
+ uint64_t old_reclaim_blk;
+
+ /* Find the next block in the cache. */
+ reclaim_blk = bitmap_next (bm, reclaim_blk+1);
+ old_reclaim_blk = reclaim_blk;
+
+ /* Search for an LRU block after this one. */
+ do {
+ if (! lru_has_been_recently_accessed (reclaim_blk)) {
+ reclaim_block (fd, bm);
+ return;
+ }
+
+ reclaim_blk = bitmap_next (bm, reclaim_blk+1);
+ if (reclaim_blk == -1) /* wrap around */
+ reclaim_blk = bitmap_next (bm, 0);
+ } while (reclaim_blk >= 0 && old_reclaim_blk != reclaim_blk);
+
+ if (old_reclaim_blk == reclaim_blk) {
+ /* Run out of LRU blocks, so start reclaiming any block in the cache. */
+ nbdkit_debug ("cache: reclaiming any blocks");
+ reclaiming = RECLAIMING_ANY;
+ reclaim_one (fd, bm);
Could also call reclaim_any() rather than reclaim_one, since that's all
the more that reclaim_one() will do.
+ return;
+ }
+}
That return doesn't hurt from the maintenance point of view (if you add
later code in the function), but it also isn't doing anything.
+
+static void
+reclaim_any (int fd, struct bitmap *bm)
+{
+ /* Find the next block in the cache. */
+ reclaim_blk = bitmap_next (bm, reclaim_blk+1);
+ if (reclaim_blk == -1) /* wrap around */
+ reclaim_blk = bitmap_next (bm, 0);
Is this code thread-safe? Now that you are using
NBDKIT_THREAD_MODEL_PARALLEL for the cache filter, shouldn't all
read/write access to the global reclaim_blk be locked, to avoid two
threads trying to act on the same block at once? I guess it is, because
you only call these functions from the blk_*() functions, and those in
turn are only called from cache.c with the lock held. But is it worth
documenting the locking constraint somewhere as a comment?
+
+ reclaim_block (fd, bm);
+}
+
+static void
+reclaim_block (int fd, struct bitmap *bm)
+{
+ if (reclaim_blk == -1) {
+ nbdkit_debug ("cache: run out of blocks to reclaim!");
+ return;
+ }
+
+ nbdkit_debug ("cache: reclaiming block %" PRIu64, reclaim_blk);
+#ifdef FALLOC_FL_PUNCH_HOLE
+ if (fallocate (fd, FALLOC_FL_PUNCH_HOLE|FALLOC_FL_KEEP_SIZE,
+ reclaim_blk * BLKSIZE, BLKSIZE) == -1) {
+ nbdkit_error ("cache: reclaiming cache blocks: "
+ "fallocate: FALLOC_FL_PUNCH_HOLE: %m");
fallocate() fails with EINVAL if BLKSIZE is not a multiple of the
filesystem block size, which gets back to my question of how this can
even work on something that mandates 4k or even 64k access, if the cache
block size is not configurable.
+source ./functions.sh
+set -e
+set -x
+
+# Check that this is a Linux-like system supporting /proc/pid/fd.
s/pid/PID/
+if ! test -d /proc/self/fd; then
+ echo "$0: not a Linux-like system supporting /proc/\$pid/fd"
+ exit 77
+fi
+
Looking better, but I still have enough questions that may require more
work before you commit.
--
Eric Blake, Principal Software Engineer
Red Hat, Inc. +1-919-301-3226
Virtualization:
qemu.org |
libvirt.org