I can't reproduce this, adding the bug reporter on CC. More below ...
On Wed, May 08, 2024 at 10:16:55AM -0500, Eric Blake wrote:
On Wed, May 08, 2024 at 09:16:19AM +0100, Richard W.M. Jones wrote:
> Hi Eric,
>
> ----- Forwarded message from notify(a)aur.archlinux.org -----
>
> solsticedhiver [1] added the following comment to nbdkit [2]:
>
> There is still one test failing now test-blocksize-sharding.sh. I
> don't know if it is expected or not
> The full test-suite.log
https://0x0.st/X8TI.log
>
> [1]
https://aur.archlinux.org/account/solsticedhiver/
> [2]
https://aur.archlinux.org/pkgbase/nbdkit/
>
> ----- End forwarded message -----
>
> I've attached the fragment of the failure log that they sent. However
> I don't understand the test enough to see where the error happens.
>
> Additionally I think there's a problem with the various 'print'-
> statements in the Python code. They probably need to flush stdout
> otherwise they don't appear coherent with error messages. Something
> like this thoroughout:
>
> -print("sanity check")
> +print("sanity check", flush=True)
That sounds like a useful change no matter what. It may also be worth
adding some more timestamps into the sanity check portion, to make
sure that the plugin is responding as fast as we expect, and not under
heavy load.
> # First pass: check that two aligned operations work in parallel
> # Total time should be closer to 2 seconds, rather than 4 if serialized
> print("sanity check")
> ba1 = bytearray(b"1"*16)
> ba2 = bytearray(b"2"*16)
> buf1 = nbd.Buffer.from_bytearray(ba1)
> buf2 = nbd.Buffer.from_bytearray(ba2)
> touch(witness)
> start_t = time.time()
> h.aio_pwrite(buf1, 0)
> h.aio_pwrite(buf2, 0)
>
> while h.aio_in_flight() > 0:
> h.poll(-1)
> end_t = time.time()
> os.unlink(witness)
>
> out = h.pread(16,0)
> print(out)
> t = end_t - start_t
> print(t)
> assert out in [b"1"*16, b"2"*16]
> assert t >= 2 and t <= 3
Based on the later message, it looks like t is higher than expected,
and that this was the assertion that is failing...
> libnbd: debug: nbdsh: nbd_connect_uri: leave: ret=0
> libnbd: debug: nbdsh: nbd_aio_pwrite: enter: buf="
> 0000: 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 31 |1111111111111111|
> " count=16 offset=0 completion=NULL flags=0x0
> libnbd: debug: nbdsh: nbd_aio_pwrite: leave: ret=1
> libnbd: debug: nbdsh: nbd_aio_pwrite: enter: buf="
> 0000: 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 32 |2222222222222222|
> " count=16 offset=0 completion=NULL flags=0x0
> libnbd: debug: nbdsh: nbd_aio_pwrite: leave: ret=2
> libnbd: debug: nbdsh: nbd_aio_in_flight: enter:
> libnbd: debug: nbdsh: nbd_aio_in_flight: leave: ret=2
We definitely kicked off two parallel aio_pwrite commands...
> libnbd: debug: nbdsh: nbd_poll: enter: timeout=-1
> libnbd: debug: nbdsh: nbd_poll: poll start: events=1
> libnbd: debug: nbdsh: nbd_poll: poll end: r=1 revents=1
> libnbd: debug: nbdsh: nbd_poll: leave: ret=1
> libnbd: debug: nbdsh: nbd_aio_in_flight: enter:
> libnbd: debug: nbdsh: nbd_aio_in_flight: leave: ret=1
> libnbd: debug: nbdsh: nbd_poll: enter: timeout=-1
> libnbd: debug: nbdsh: nbd_poll: poll start: events=1
> libnbd: debug: nbdsh: nbd_poll: poll end: r=1 revents=1
> libnbd: debug: nbdsh: nbd_poll: leave: ret=1
> libnbd: debug: nbdsh: nbd_aio_in_flight: enter:
> libnbd: debug: nbdsh: nbd_aio_in_flight: leave: ret=0
...and we were able to poll until both completed, but lack timestamps
to see how long we polled each time...
> libnbd: debug: nbdsh: nbd_pread: enter: buf=<buf> count=16 offset=0 flags=0x0
> libnbd: debug: nbdsh: nbd_pread: poll start: events=1
> libnbd: debug: nbdsh: nbd_pread: poll end: r=1 revents=1
> libnbd: debug: nbdsh: nbd_pread: leave: ret=0
> Traceback (most recent call last):
> File "<frozen runpy>", line 198, in _run_module_as_main
> File "<frozen runpy>", line 88, in _run_code
> File "/usr/lib/python3.12/site-packages/nbd.py", line 3986, in
<module>
> nbdsh.shell()
> File "/usr/lib/python3.12/site-packages/nbdsh.py", line 149, in shell
> do_snippet[act](arg)
> File "/usr/lib/python3.12/site-packages/nbdsh.py", line 140, in
<lambda>
> "command": lambda arg: exec(arg, d, d),
> ^^^^^^^^^^^^^^^
> File "<string>", line 32, in <module>
> AssertionError
> sanity check
> bytearray(b'2222222222222222')
> 5.224142789840698
...we then checked the contents (it looks like the write with contents
"2"*16 landed last), and overall we spent 5.2 seconds doing all this
work, when we expected between 2 and 3 seconds. Given that the server
is set up with the delay filter adding 2 seconds before each pwrite,
it seems like we would expect > 2 seconds no matter what, but that we
would be > 4 seconds only if the two writes are serialized or else
heavy load gets in the way of fast execution (we should have < 4
seconds if the two writes are done in parallel).
The assert that t <= 3 is probably stricter than necessary; checking
whether t < 4 is going to add a bit more leeway.
So how about re-running the test with these enhancements, to see if we
can learn more about the situation?
diff --git i/tests/test-blocksize-sharding.sh w/tests/test-blocksize-sharding.sh
index 7e926ad1..3711c49c 100755
--- i/tests/test-blocksize-sharding.sh
+++ w/tests/test-blocksize-sharding.sh
@@ -80,9 +80,13 @@ witness = os.getenv("witness")
def touch(path):
open(path, "a").close()
+def progress(prefix, start):
+ t = time.time() - start
+ print("%s: %g" % (prefix, t), flush=True)
+
# First pass: check that two aligned operations work in parallel
# Total time should be closer to 2 seconds, rather than 4 if serialized
-print("sanity check")
+print("sanity check", flush=True)
ba1 = bytearray(b"1"*16)
ba2 = bytearray(b"2"*16)
buf1 = nbd.Buffer.from_bytearray(ba1)
@@ -91,21 +95,24 @@ touch(witness)
start_t = time.time()
h.aio_pwrite(buf1, 0)
h.aio_pwrite(buf2, 0)
+progress("after writes", start_t)
while h.aio_in_flight() > 0:
h.poll(-1)
+ progress("after poll", start_t)
end_t = time.time()
os.unlink(witness)
out = h.pread(16,0)
-print(out)
+progress("after read", start_t)
+print(out, flush=True)
t = end_t - start_t
-print(t)
+print("elapsed write time: %g" % t, flush=True)
assert out in [b"1"*16, b"2"*16]
assert t >= 2 and t <= 3
# Next pass: try to kick off unaligned first
-print("unaligned first")
+print("unaligned first", flush=True)
h.zero(16, 0)
ba3 = bytearray(b"3"*12)
ba4 = bytearray(b"4"*16)
@@ -122,14 +129,14 @@ end_t = time.time()
os.unlink(witness)
out = h.pread(16,0)
-print(out)
+print(out, flush=True)
t = end_t - start_t
-print(t)
+print(t, flush=True)
assert out in [b"4"*4 + b"3"*12, b"4"*16]
assert t >= 8
# Next pass: try to kick off aligned first
-print("aligned first")
+print("aligned first", flush=True)
ba5 = bytearray(b"5"*16)
ba6 = bytearray(b"6"*12)
buf5 = nbd.Buffer.from_bytearray(ba5)
@@ -146,9 +153,9 @@ end_t = time.time()
os.unlink(witness)
out = h.pread(16,0)
-print(out)
+print(out, flush=True)
t = end_t - start_t
-print(t)
+print(t, flush=True)
assert out in [b"5"*4 + b"6"*12, b"5"*16]
assert t >= 8
'
I think we should add this patch upstream anyway, especially the
flushes. More debugging is better in case we can't reproduce this now
and it turns up again in future.
Rich.
--
Richard Jones, Virtualization Group, Red Hat
Fedora Windows cross-compiler. Compile Windows programs, test, and
build Windows installers. Over 100 libraries supported.