On Wed, May 26, 2021 at 10:32:08AM +0100, Richard W.M. Jones wrote:
 On Wed, May 26, 2021 at 11:40:11AM +0300, Nir Soffer wrote:
 > On Tue, May 25, 2021 at 9:06 PM Richard W.M. Jones <rjones(a)redhat.com> wrote:
 > > I ran perf as below.  Although nbdcopy and nbdkit themselves do not
 > > require root (and usually should _not_ be run as root), in this case
 > > perf must be run as root, so everything has to be run as root.
 > >
 > >   # perf record -a -g --call-graph=dwarf ./nbdkit -U - sparse-random size=1T
--run "MALLOC_CHECK_= ../libnbd/run nbdcopy \$uri \$uri"
 > 
 > This uses 64 requests with a request size of 32m. In my tests using
 > --requests 16 --request-size 1048576 is faster. Did you try to profile
 > this?
 
 Interesting!  No I didn't.  In fact I just assumed that larger request
 sizes / number of parallel requests would be better. 
This is the topology of the machine I ran the tests on:
  
https://rwmj.files.wordpress.com/2019/09/screenshot_2019-09-04_11-08-41.png
Even a single 32MB buffer isn't going to fit in any cache, so reducing
buffer size should be a win, and once they are within the size of the
L3 cache, reusing buffers should also be a win.
That's the theory anyway ...  Using --request-size=1048576 changes the
flamegraph quite dramatically (see new attachment).
[What is the meaning of the swapper stack traces?  They are coming
from idle cores?]
Test runs slightly faster:
  $ hyperfine 'nbdkit -U - sparse-random size=1T --run "nbdcopy \$uri
\$uri"'
  Benchmark #1: nbdkit -U - sparse-random size=1T --run "nbdcopy \$uri \$uri"
    Time (mean ± σ):     47.407 s ±  0.953 s    [User: 347.982 s, System: 276.220 s]
    Range (min … max):   46.474 s … 49.373 s    10 runs
 
  $ hyperfine 'nbdkit -U - sparse-random size=1T --run "nbdcopy
--request-size=1048576 \$uri \$uri"'
  Benchmark #1: nbdkit -U - sparse-random size=1T --run "nbdcopy
--request-size=1048576 \$uri \$uri"
    Time (mean ± σ):     43.796 s ±  0.799 s    [User: 328.134 s, System: 252.775 s]
    Range (min … max):   42.289 s … 44.917 s    10 runs
(Note the buffers are still not being reused.)
Rich.
-- 
Richard Jones, Virtualization Group, Red Hat 
http://people.redhat.com/~rjones
Read my programming and virtualization blog: 
http://rwmj.wordpress.com
virt-p2v converts physical machines to virtual machines.  Boot with a
live CD or over the network (PXE) and turn machines into KVM guests.
http://libguestfs.org/virt-v2v