On Wed, Jun 29, 2022 at 06:01:54PM +0200, Laszlo Ersek wrote:
On 06/29/22 15:36, Richard W.M. Jones wrote:
> Because the test previously used error rates of 50%, it could
> sometimes "fail to fail". This is noticable if you run the test
> repeatedly:
>
> $ while make -C copy check TESTS=copy-nbd-error.sh >& /tmp/log ; do echo -n .
; done
>
> This now happens more often because of the larger requests made by the
> new multi-threaded loop, resulting in fewer calls to the error filter,
> so a greater chance that a series of 50% coin tosses will come up all
> heads in the test.
>
> Fix this by making the test non-stocastic.
>
> Fixes: commit 8d444b41d09a700c7ee6f9182a649f3f2d325abb
> ---
> copy/copy-nbd-error.sh | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/copy/copy-nbd-error.sh b/copy/copy-nbd-error.sh
> index 0088807f54..01524a890c 100755
> --- a/copy/copy-nbd-error.sh
> +++ b/copy/copy-nbd-error.sh
> @@ -40,7 +40,7 @@ $VG nbdcopy -- [ nbdkit --exit-with-parent -v --filter=error
pattern 5M \
> # Failure to read should be fatal
> echo "Testing read failures on non-sparse source"
> $VG nbdcopy -- [ nbdkit --exit-with-parent -v --filter=error pattern 5M \
> - error-pread-rate=0.5 ] null: && fail=1
> + error-pread-rate=1 ] null: && fail=1
>
> # However, reliable block status on a sparse image can avoid the need to read
> echo "Testing read failures on sparse source"
> @@ -51,7 +51,7 @@ $VG nbdcopy -- [ nbdkit --exit-with-parent -v --filter=error null
5M \
> echo "Testing write data failures on arbitrary destination"
> $VG nbdcopy -- [ nbdkit --exit-with-parent -v pattern 5M ] \
> [ nbdkit --exit-with-parent -v --filter=error --filter=noextents \
> - memory 5M error-pwrite-rate=0.5 ] && fail=1
> + memory 5M error-pwrite-rate=1 ] && fail=1
>
> # However, writing zeroes can bypass the need for normal writes
> echo "Testing write data failures from sparse source"
>
Wasn't the original intent of the 50% error rate that the first error
manifest usually at a different offset every time? If we change the
error rate to 1, the tests will fail upon the first access, which kind
of breaks the original intent.
I wonder if we could determine a random offset in advance, and make sure
that the read or write access fails 100%, but only if the request covers
that offset.
It wasn't very clear why the error rate was originally set to 50%, but
it could be as you say. It is possible to inject an error at a
specific offset (though not with the error filter):
----------------------------------------------------------------------
#!/bin/bash
case "$1" in
get_size) echo 64M ;;
pread)
if [ $4 -le 100000 ] && [ $(( $4+$3 )) -gt 100000 ]; then
echo EIO Bad block >&2
exit 1
else
dd if=/dev/zero count=$3 iflag=count_bytes
fi ;;
*) exit 2 ;;
esac
----------------------------------------------------------------------
$ nbdkit sh read-error.sh
- - -
Anyway, looking closer at the bug, I now understand why it happens
much more frequently after applying this patch series.
It's because the test uses the null: destination
(libnbd.git/copy/null-ops.c). This advertises a preferred block size
of 4M, because I reasoned that we should prefer large blocks for this
target (since it throws away the data).
This causes request-size to be increased to 4M, so we're copying 4M
blocks (versus 256K before).
With 5M total disk size, we're reading 4M + 1M in two requests, so in
1 in 4 cases we're going to get two heads in a row, and the test will
fail to fail.
Although there is still definitely a bug in this test, I think I will
adjust the null: target so it sets a small preferred block size,
eg. 4K, so that copying to null: doesn't have this unexpected effect
on request size.
Rich.
--
Richard Jones, Virtualization Group, Red Hat
http://people.redhat.com/~rjones
Read my programming and virtualization blog:
http://rwmj.wordpress.com
libguestfs lets you edit virtual machines. Supports shell scripting,
bindings from many languages.
http://libguestfs.org