Hi Justin,
On 3/20/23 16:47, Justin Churchey wrote:
Hello Laszlo,
Thank you for the rundown. I enabled the
additional LIBGUESTFS_BACKEND_SETTINGS, and I have attached a follow up
to the libguestfs-test-tool output.
Your computer has faulty RAM.
Your libguestfs-test-tool log file contains the following line (read it
very carefully):
LIBGUESTFS_BANKEND_SETTINGS=force_tcg
I was staring my eyes out at your log, not understanding why the
"force_tcg" setting wouldn't take effect -- because it didn't, the log
file confirms the repeated test run still used KVM.
That was when I copied and pasted the above line (before the equal sign)
into a git-grep, and then a "git log -S". It turns out that
"whatever"
variable name was captured in the libguestfs-test-tool log, libguestfs
never checks that variable, worse, libguestfs has *never* checked it
over its entire history.
So then I thought, "aha, Justin must have typed the variable name from
memory, instead of using the clipboard". But that's not possible: even
if you mistyped the variable name when setting the environment,
libguestfs-test-tool would not look for that (misnamed) variable, and
log it!
So the only explanation is that your RAM is faulty; a single character
in the variable name got corrupted in this instance (C -> N):
LIBGUESTFS_BACKEND_SETTINGS
LIBGUESTFS_BANKEND_SETTINGS
^
With faulty RAM, there's nothing more to investigate here; the guest
kernel crash (page fault) can be trivially explained by a pointer
getting corrupted and pointing into outer space.
I suggest running MemTest86 or MemTest86+.
(NB, faulty RAM is not as infrequent as one would think. In my life, if
I count right, this is actually the third occasion that I've determined
faulty RAM for a user -- not necessarily via the same program /
misbehavior, of course. Also I think a faulty disk is much less likely:
non-ECC RAM exists, but disks without redundancy checks don't /
shouldn't exist, as far as I know.)
Laszlo
I also checked out my CPU settings (cat /proc/cpuinfo output attached),
and the host does appear to support PCLMULQDQ (AMD Ryzen 7 5700X). I
also checked the cpuinfo in one of the guests I have created (Ubuntu
18.04, unstable due to intermittent kernel panics), and the cpuinfo
indicates that this feature seems to be passed down to my guest as well.
I noticed that the libguestfs-test-tool didn't seem to like the qemu
settings it tried to boot with. So, I went back to basics and built a
disk using qemu-img (qcow2) and utilized qemu-system-x86_64 to do the
base install (Ubuntu 18.04). The resulting image boots and I import the
resulting image with virt-install. However, the GUI/console seems to
want to lock up shortly after boot if I am using virt-tools. The guest
seems more stable when I boot it directly with `qemu-system,` and this
may be my workaround for now.
In virt-tools, I can consistently get a panic on the guest by trying to
enable the qemu-guest-agent: `systemctl enable qemu-guest-agent.`
Unfortunately, I cannot get the full output from that panic (attached).
It would seem that this problem is more than just libguestfs-tools. Is
there a KVM listserv that this might be more appropriate for?
Sincerely,
On Mon, Mar 20, 2023 at 1:31 AM Laszlo Ersek <lersek(a)redhat.com
<mailto:lersek@redhat.com>> wrote:
On 3/17/23 16:10, Justin Churchey wrote:
> Hello Everyone,
>
> I was having some difficulties converting OVA images yesterday. At
> first, I thought it may have been a compatibility issue with
> VirtualBox 7.0. However, when I went to run libguestfs-test-tool, it
> began failing with the exact same error as the conversions, which
> leads me to believe the issue may lie with libguestfs and not the
> images themselves.
>
> To test further, I created a fresh install of Ubuntu 22.04, and the
> libguestfs-test-tool seems to fail with the same error, even on a
> fresh install. I am attaching the libguestfs-test-tool output for
> reference.
>
> Ubuntu 22.04 is running libguestfs-tools 1.46.2-10ubuntu3
>
> If anybody has any insight into the issue, or if you feel a bug report
> needs to be filed, please let me know.
Your appliance kernel crashes.
Here's my theory on why this might happen, based on your log.
The guestfish appliance runs with KVM acceleration.
The crash happens after/while inserting the modules crc32-pclmul.ko and
crct10dif-pclmul.ko.
The "pclmul" in the names of those modules indicates that these modules
calculate various (crc32) checksums with the PCLMULQDQ instruction. I
believe that PCLMULQDQ is an advanced / accelerated instruction and not
all CPUs may support it.
Your appliance guest is started with "-cpu max" on the QEMU command line
(from libguestfs commit 30f74f38bd6e, "appliance: Use -cpu max.",
2021-01-28). This is probably why the appliance kernel thinks PCLMULQDQ
is available.
I think the PCLMULQDQ instruction may cause an issue here. I don't know
why it misbehaves under KVM, but that's my suspicion anyway.
Note that the kernel crash log provides the following instruction
(assembly binary) dump:
46 70 48 8b 56 68 48 03 97 90 01 00 00 48 c1 e0 06 48 03 46 20 48 89 97
08 02 00 00 48 be ab aa aa aa aa aa aa aa 48 8b 48 10 <48> 89 0a 48 8b
50 20 48 8b 8f 08 02 00 00 48 89 d0 48 f7 e6 48 c1
with the instruction starting at <48> causing the page fault, as the
direct symptom. Now, we can disassemble this:
printf \
'%b' \
'\x46\x70\x48\x8b\x56\x68\x48\x03\x97\x90\x01\x00\x00\x48\xc1\xe0\x06\x48\x03\x46\x20\x48\x89\x97\x08\x02\x00\x00\x48\xbe\xab\xaa\xaa\xaa\xaa\xaa\xaa\xaa\x48\x8b\x48\x10\x48\x89\x0a\x48\x8b\x50\x20\x48\x8b\x8f\x08\x02\x00\x00\x48\x89\xd0\x48\xf7\xe6\x48\xc1'
\
> bin
$ ndisasm -b64 bin
00000000 467048 jo 0x4b
00000003 8B5668 mov edx,[rsi+0x68]
00000006 48039790010000 add rdx,[rdi+0x190]
0000000D 48C1E006 shl rax,byte 0x6
00000011 48034620 add rax,[rsi+0x20]
00000015 48899708020000 mov [rdi+0x208],rdx
0000001C 48BEABAAAAAAAAAA mov rsi,0xaaaaaaaaaaaaaaab
-AAAA
00000026 488B4810 mov rcx,[rax+0x10]
0000002A 48890A mov [rdx],rcx <----------- crash
0000002D 488B5020 mov rdx,[rax+0x20]
00000031 488B8F08020000 mov rcx,[rdi+0x208]
00000038 4889D0 mov rax,rdx
0000003B 48F7E6 mul rsi
0000003E 48 rex.w
0000003F C1 db 0xc1
Note the constant 0xaaaaaaaaaaaaaaab; that seems very special. We can
search the kernel tree for it (I'm not bothering about checking out the
particular ubuntu kernel version for now):
$ git grep -i aaaaaaaaaaaaaaab
arch/x86/math-emu/poly_atan.c:/* 0xaaaaaaaaaaaaaaabLL, transferred
to fixedpterm[] */
arch/x86/math-emu/poly_sin.c: 0xaaaaaaaaaaaaaaabLL,
arch/x86/math-emu/poly_tan.c:static const unsigned long long
twothirds = 0xaaaaaaaaaaaaaaabLL;
In particular, in the last file (poly_tan.c) contains a snippet like
mul64_Xsig(&accum, &twothirds);
which seems vagely related to
0000001C 48BEABAAAAAAAAAA mov rsi,0xaaaaaaaaaaaaaaab
-AAAA
...
0000003B 48F7E6 mul rsi
Now this does not seem connected to PCLMULQDQ, but it does somehow look
connected to multiplication.
I don't really know where to go with this, except for asking KVM
experts.
For now, can you try:
export LIBGUESTFS_BACKEND_SETTINGS=force_tcg
from <
https://libguestfs.org/guestfs.3.html#backend-settings
<
https://libguestfs.org/guestfs.3.html#backend-settings>>, and see
if that makes a difference?
Laszlo