On Wed, Apr 01, 2020 at 08:46:53AM -0500, Eric Blake wrote:
On 4/1/20 5:12 AM, Richard W.M. Jones wrote:
>On Tue, Mar 31, 2020 at 04:00:22PM -0500, Eric Blake wrote:
>>The retry filter defaults to 5 retries, but when run with verbose
>>details produces some confusing output:
>>
>>$ rm -f /tmp/inject; (sleep 5s; touch /tmp/inject)& ./nbdkit -f -v -U - \
>> null 1G --filter=retry --filter=noextents --filter=error error-rate=100% \
>> error-file=/tmp/inject --filter=delay rdelay=1 \
>> --run 'qemu-img convert $nbd out.img'
>>...
>>nbdkit: null[1]: debug: noextents: pread count=2097152 offset=14680064
>>nbdkit: null[1]: debug: error: pread count=2097152 offset=14680064
>>nbdkit: null[1]: error: injecting EIO error into pread
>>nbdkit: null[1]: debug: retry 6: original errno = 5
>>nbdkit: null[1]: debug: could not recover after 5 retries
>>
>> again:
>> /* Log the original errno since it will be lost when we retry. */
>>- nbdkit_debug ("retry %d: original errno = %d", data->retry+1,
*err);
>>+ nbdkit_debug ("retry attempt %d: original errno = %d",
data->retry, *err);
>>
>> if (data->retry >= retries) {
>> nbdkit_debug ("could not recover after %d retries", retries);
>
>I think adding 1 was correct, so that the retries count from 1.
But the problem is the timing of when the messages are printed. And
remember, this code is only reached after a failed attempt, not on
success. Pre-patch, we have:
initial try succeeds - no message
or (here, with retries limited to 2):
initial try fails:
retry 1: original error
delay
retry 1 fails:
retry 2: original error
delay
retry 2 fails:
retry 3: original error
could not recover after 2 retries
>
>How about moving the message to make it conditional instead,
>like this?
>
>again:
> /* Log the original errno since it will be lost when we retry. */
> if (data->retry < retries)
> nbdkit_debug ("retry %d: original errno = %d", data->retry+1,
*err);
> else {
> nbdkit_debug ("could not recover after %d retries", retries);
> ...
with your suggestion, we'd have:
initial try fails:
retry 1: original error
delay
retry 1 fails:
retry 2: original error
delay
retry 2 fails:
could not recover after 2 retries
where we lost the log of the errno of retry 2.
It's not printed, but also it's not lost. do_retry will immediately
return false, causing the called function to return this final errno
(hopefully back to the client). The reason for printing the "original
errno" was because those intermediate errnos would otherwise be lost.
Attached is my suggested patch - it's a small evolution on top of
yours. The output is now:
$ rm -f /tmp/inject ; (sleep 5s; touch /tmp/inject ) & ./nbdkit -f -v -U - null
size=1G --filter=retry --filter=noextents --filter=error error-rate=100%
error-file=/tmp/inject --filter=delay rdelay=1 retries=3 --run 'qemu-img convert $nbd
/var/tmp/out.img'
nbdkit: null[1]: error: injecting EIO error into pread
nbdkit: null[1]: debug: pread failed: original errno = 5
nbdkit: null[1]: debug: retry 1: waiting 2 seconds before retrying
nbdkit: null[1]: error: injecting EIO error into pread
nbdkit: null[1]: debug: pread failed: original errno = 5
nbdkit: null[1]: debug: retry 2: waiting 4 seconds before retrying
nbdkit: null[1]: error: injecting EIO error into pread
nbdkit: null[1]: debug: pread failed: original errno = 5
nbdkit: null[1]: debug: retry 3: waiting 8 seconds before retrying
nbdkit: null[1]: error: injecting EIO error into pread
nbdkit: null[1]: debug: could not recover after 3 retries
nbdkit: null[1]: debug: sending error reply: Input/output error
Rich.
--
Richard Jones, Virtualization Group, Red Hat
http://people.redhat.com/~rjones
Read my programming and virtualization blog:
http://rwmj.wordpress.com
virt-top is 'top' for virtual machines. Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top