Re: [Libguestfs] FYI: perf commands I'm using to benchmark nbdcopy

Wednesday, 26 May 2021

On Tue, May 25, 2021 at 9:06 PM Richard W.M. Jones <rjones(a)redhat.com&gt; wrote:
...
 I ran perf as below.  Although nbdcopy and nbdkit themselves do not
 require root (and usually should _not_ be run as root), in this case
 perf must be run as root, so everything has to be run as root.

   # perf record -a -g --call-graph=dwarf ./nbdkit -U - sparse-random size=1T --run
"MALLOC_CHECK_= ../libnbd/run nbdcopy \$uri \$uri" 
This uses 64 requests with a request size of 32m. In my tests using
--requests 16 --request-size 1048576 is faster. Did you try to profile
this?

...
 Some things to explain:

  * The output is perf.data in the local directory.  This file may be
    huge (22GB for me!)

  * I am running this from the nbdkit directory, so ./nbdkit runs the
    locally compiled copy of nbdkit.  This allows me to make quick
    changes to nbdkit and see the effects immediately.

  * I am running nbdcopy using "../libnbd/run nbdcopy", so that's from
    the adjacent locally compiled libnbd directory.  Again the reason
    for this is so I can make changes, recompile libnbd, and see the
    effect quickly.

  * "MALLOC_CHECK_=" is needed because of complicated reasons to do
    with how the nbdkit wrapper enables malloc-checking.  We should
    probably provide a way to disable malloc-checking when benchmarking
    because it adds overhead for no benefit, but I've not done that yet
    (patches welcome!) 
Why enable malloc checking in nbdkit when profiling nbdcopy?

...
  * The test harness is nbdkit-sparse-random-plugin, documented here:
    https://libguestfs.org/nbdkit-sparse-random-plugin.1.html 
Does it create a similar pattern to real world images, or more like
the worst case?

In my tests using nbdkit memory and pattern plugins was way more
stable compared with real images via qemu-nbd/nbdkit, but real image
give more real results :-)

Maybe we can extract the extents from a real image, and add a plugin
accepting json extents and inventing data for the data extents?

...
  * I'm using DWARF debugging info to generate call stacks, which
is
    more reliable than the default (frame pointers). 
When I tried to use perf, I did not get proper call stacks, maybe this
was the reason.

...
  * The -a option means I'm measuring events on the whole machine.
 You
    can read the perf manual to find out how to measure only a single
    process (eg. just nbdkit or just nbdcopy).  But actually measuring
    the whole machine gives a truer picture, I believe. 
Why profile the whole machine? I would profile only nbdcopy or nbdkit,
depending on what we are trying to focus on.

Looking in the attached flame graph, if we focus on the nbdcopy worker_thread,
and sort by time:

poll_both_ends: 14.53% (58%)
malloc: 5.55% (22%)
nbd_ops_async_read: 4.34% (17%)
nbd_ops_get_extents: 0.52% (2%)

If we focus into into poll_both_ends:
send 10.17% (69%)
free 4.53% (31%)

So we have a lot of opportunities to optimize by allocating all buffers up front
as done in examples/libev-copy. But I'm not sure we will would see the same
picture when using smaller buffers (--request-size 1m).

nbd_ops_async_read is surprising - this is an async operation that should
consume no time. Why does it take 17% of the time?

Thanks for the info!

Nir

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Re: [Libguestfs] FYI: perf commands I'm using to benchmark nbdcopy