On Tue, May 16, 2023 at 02:37:00PM +0200, Laszlo Ersek wrote:
Something is not adding up.
* I've run "ldd" on my locally built virt-v2v binary, to learn what shared
libraries it uses. Then I located all the packages (installed RPMs) providing those
libraries (symlinks in fact), using "rpm -qf". Then I installed the debuginfo
packages for each of those RPMs.
I've just tried it on RHEL 9 with upstream virt-v2v + commit c0bb624a151b.
I'm seeing some failures but they look quite different to yours and
all seem to be caused by a single leak in libvirt or how we use
libvirt (at least potentially, I've not investigated, and I don't see
this happening in Fedora).
I have:
glibc-2.34-67.el9.x86_64
glibc-debuginfo-2.34-67.el9.x86_64
glibc-debugsource-2.34-67.el9.x86_64
valgrind-3.19.0-3.el9.x86_64
valgrind-devel-3.19.0-3.el9.x86_64
libvirt-9.3.0-1.el9.x86_64
libvirt-debuginfo-9.3.0-1.el9.x86_64
libvirt-debugsource-9.3.0-1.el9.x86_64
How many of the tests fail for you? Just a small number or all of them?
If it's a small number, which ones?
Rich.
I *still* get stack dumps like the following (taken from
"tests/test-v2v-fedora-luks-on-lvm-conversion.sh.log"):
==34448== Conditional jump or move depends on uninitialised value(s)
==34448== at 0x40191DD: __GI___tunables_init (dl-tunables.c:211)
==34448== by 0x4020056: _dl_sysdep_start (dl-sysdep.c:110)
==34448== by 0x4021A07: _dl_start (rtld.c:502)
==34448== by 0x4020AD7: ??? (in /usr/lib64/ld-linux-x86-64.so.2)
==34448== by 0xE: ???
==34448== by 0x1FFEFFE352: ???
==34448== by 0x1FFEFFE35B: ???
==34448== by 0x1FFEFFE366: ???
==34448== by 0x1FFEFFE369: ???
==34448== by 0x1FFEFFE36E: ???
==34448== by 0x1FFEFFE39F: ???
==34448== by 0x1FFEFFE3A2: ???
==34448== by 0x1FFEFFE3A7: ???
==34448== by 0x1FFEFFE3AD: ???
==34448== by 0x1FFEFFE3CA: ???
==34448== by 0x1FFEFFE3D0: ???
==34448== by 0x1FFEFFE3EB: ???
==34448== by 0x1FFEFFE3F1: ???
==34448== by 0x1FFEFFE40C: ???
==34448== by 0x1FFEFFE412: ???
Note the address 0x4020AD7. Valgrind itself says that the address is somewhere inside
"/usr/lib64/ld-linux-x86-64.so.2". Problem is, I *do* have the debuginfo package
installed (with correct version) for that binary. The binary comes from
"glibc-2.34-40.el9_1.1.x86_64", and I've got the matching
"glibc-debuginfo-2.34-40.el9_1.1.x86_64" package installed.
* Now, from that kind of (useless) backtrace, I have four instances in this test case
log, in total. However, there's a different kind too (just one instance):
==34448== Conditional jump or move depends on uninitialised value(s)
==34448== at 0x484A608: strlen (vg_replace_strmem.c:495)
==34448== by 0x5443D32: strdup (strdup.c:41)
==34448== by 0x4F09819: guestfs_int_copy_string_list (in
/home/lacos/src/v2v/libguestfs/lib/.libs/libguestfs.so.0.513.0)
==34448== by 0x4F091DD: guestfs_int_copy_environ (in
/home/lacos/src/v2v/libguestfs/lib/.libs/libguestfs.so.0.513.0)
==34448== by 0x4EB6B67: run_command (in
/home/lacos/src/v2v/libguestfs/lib/.libs/libguestfs.so.0.513.0)
==34448== by 0x4EB778D: guestfs_int_cmd_run (in
/home/lacos/src/v2v/libguestfs/lib/.libs/libguestfs.so.0.513.0)
==34448== by 0x4EC7B10: qemu_img_supports_U_option (in
/home/lacos/src/v2v/libguestfs/lib/.libs/libguestfs.so.0.513.0)
==34448== by 0x4EC775A: get_json_output (in
/home/lacos/src/v2v/libguestfs/lib/.libs/libguestfs.so.0.513.0)
==34448== by 0x4EC745D: guestfs_impl_disk_format (in
/home/lacos/src/v2v/libguestfs/lib/.libs/libguestfs.so.0.513.0)
==34448== by 0x4E8769C: guestfs_disk_format (in
/home/lacos/src/v2v/libguestfs/lib/.libs/libguestfs.so.0.513.0)
==34448== by 0x3B2A67: guestfs_int_ocaml_disk_format (in
/home/lacos/src/v2v/virt-v2v/v2v/virt-v2v)
==34448== by 0x31B9D6: camlGuestfs__fun_12954 (guestfs.ml:1186)
==34448== by 0x334370: camlStdlib__list__map_233 (in
/home/lacos/src/v2v/virt-v2v/v2v/virt-v2v)
==34448== by 0x2AE27A: camlInput_disk__detect_local_input_format_217
(input_disk.ml:142)
==34448== by 0x2ADE82: camlInput_disk__setup_216 (input_disk.ml:88)
==34448== by 0x28E671: camlV2v__main_202 (v2v.ml:552)
==34448== by 0x2DD3C1: camlTools_utils__run_main_and_handle_errors_510
(tools_utils.ml:228)
==34448== by 0x290D07: camlV2v__entry (v2v.ml:700)
==34448== by 0x27FB28: caml_program (in /home/lacos/src/v2v/virt-v2v/v2v/virt-v2v)
==34448== by 0x41AD53: caml_start_program (in
/home/lacos/src/v2v/virt-v2v/v2v/virt-v2v)
==34448== by 0x41B166: caml_startup_common (in
/home/lacos/src/v2v/virt-v2v/v2v/virt-v2v)
==34448== by 0x41B1AC: caml_startup (in /home/lacos/src/v2v/virt-v2v/v2v/virt-v2v)
==34448== by 0x27F16F: main (in /home/lacos/src/v2v/virt-v2v/v2v/virt-v2v)
Here all addresses seem to be resolved, even those that point into my locally built
libguestfs. What I don't understand however are the topmost two frames. I *think*
those come from valgrind itself! So is valgrind complaining about... valgrind???
"vg_replace_strmem.c" is definitely a valgrind source file. I've cloned the
upstream git repo and checked -- it is "shared/vg_replace_strmem.c", and that
file has existed since November 2013. Yet, when I install valgrind-debugsource and
valgrind-debuginfo (matching the installed valgrind version --
"valgrind-3.19.0-3.el9.x86_64"), *none* of the files in those packages are
"vg_replace_strmem.c".
After downloading the SRPM from Brew and build-prepping it, I find, in
"shared/vg_replace_strmem.c":
476 /*---------------------- strlen ----------------------*/
477
478 // Note that this replacement often doesn't get used because gcc inlines
479 // calls to strlen() with its own built-in version. This can be very
480 // confusing if you aren't expecting it. Other small functions in
481 // this file may also be inline by gcc.
482
483 #define STRLEN(soname, fnname) \
484 SizeT VG_REPLACE_FUNCTION_EZU(20070,soname,fnname) \
485 ( const char* str ); \
486 SizeT VG_REPLACE_FUNCTION_EZU(20070,soname,fnname) \
487 ( const char* str ) \
488 { \
489 SizeT i = 0; \
490 while (str[i] != 0) i++; \
491 return i; \
492 }
493
494 #if defined(VGO_linux)
495 STRLEN(VG_Z_LIBC_SONAME, strlen)
So basically valgrind tries to preempt the strlen() symbol from glibc with its own
implementation.
Then, "strdup.c" is not a valgrind source file, but I found it from the glibc
debug packages -- "/usr/src/debug/glibc-2.34-40.el9_1.1.x86_64/string/strdup.c".
(How *incredibly* useful of valgrind *not* to print the *full* pathname of a source file.)
It goes like this:
37 /* Duplicate S, returning an identical malloc'd string. */
38 char *
39 __strdup (const char *s)
40 {
41 size_t len = strlen (s) + 1;
42 void *new = malloc (len);
43
44 if (new == NULL)
45 return NULL;
46
47 return (char *) memcpy (new, s, len);
48 }
So guestfs_int_copy_string_list() calls strdup() calls strlen(), with strdup coming from
glibc and strlen coming from valgrind itself. And then valgrind complains about its own
strlen implementation (fun!), which is BTW an incorrect complaint, because the
*C-language* code at lines 488-492 is proper.
This whole thing looks completely busted. I'll try to fool around with glibc
tunables.
Laszlo
--
Richard Jones, Virtualization Group, Red Hat
http://people.redhat.com/~rjones
Read my programming and virtualization blog:
http://rwmj.wordpress.com
virt-top is 'top' for virtual machines. Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top