On Mon, Apr 05, 2021 at 04:47:30PM +0300, Sam Eiderman wrote:
We also looked at udev settle call points in the logs and it seems
that it is
called a lot of times before.
The bug I mentioned is
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=616689 and they
also discuss that maybe udev settle is not working as intended.
So I don't know, but it should be relatively easy to tell. Firstly
you can modify appliance/init to add very verbose debugging to udev.
Uncomment --debug here:
https://github.com/libguestfs/libguestfs/blob/b18ac489db76a700f2168ae6eb6...
Additionally or instead you could modify daemon/utils.c to do “ls -lR
/dev/” before and after the udevadm settle command, which should show
if the additional device nodes are present before and/or after the
settle command. That would be a pretty good way to tell if udevadm
settle is having the effect we think it should.
The kernel version of the appliance (as can be seen in the log) is
4.19
> Collecting the full logs is the right approach to diagnosing this.
I added the full log for the first failure logs, I think we can see from there
that udev settle is called but the file does not exist yet.
Do you have the full logs from the second case?
We thought that maybe if we explicitly add the following logic right
after
g.launch() it might help:
1. For each device returned by: lvm 'lvs' '-o' 'vg_name,lv_name'
'-S' 'lv_role=
public && lv_skip_activation!=yes' '--noheadings'
'--separator' '/'
1.1. stat the device /dev/vg/lv
1.2. if stat fails on device does not exist - wait
1.3. Go back to 1
If we wait for too long, relaunch guestfs.
It'd be a bit of a hack. Probably better to try to work out what's
going wrong first of all. It should be possible to tell from the
kernel, udev and libguestfs logs.
Rich.
However it would be nicer to maybe implement this inside
guestfs.launch()
itself
Sam
On Mon, Apr 5, 2021 at 3:45 PM Richard W.M. Jones <rjones(a)redhat.com> wrote:
On Mon, Apr 05, 2021 at 02:47:51PM +0300, Sam Eiderman wrote:
> Hi,
>
> We are experiencing very rare LVM failures - 2 failures so far, in
> different OSs, in different libguestfs functions.
>
> The first failure is inspect_os() not finding the root operating
> system on rhel7.4.
> LVM volumes are returned by lvm command but files under /dev do not exist
(yet?)
>
> Second failure is in is_lv() - is_lv() successfully enumerates all lvm
> volumes but then internal stat() command fails again on /dev file
> since it does not exist (yet?) (rhel8.0)
>
> All of our tests run in parallel, 1 guestfs instance per core on a 32
> core machine and they run on GCP (nested virtualization).
>
> What we think that is happening here is that libguestfs' appliance is
> booting very somewhat slower than usual and that the links to some
> devices do not appear yet (even after multiple seconds).
> We found this old issue that might be connected to this behavior (in
> some way):
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=616689
I wonder if "udevadm settle" is not working? The daemon will use this
command at various times in order to ensure that all preceeding udev
messages have been processed and all /dev changes have been made.
It is called once at appliance boot:
https://github.com/libguestfs/libguestfs/blob/
b18ac489db76a700f2168ae6eb64e9d450613a27/appliance/init#L109
And throughout the daemon code:
https://github.com/libguestfs/libguestfs/blob/
b18ac489db76a700f2168ae6eb64e9d450613a27/daemon/utils.c#L732
$ git grep 'udev_settle ()' -- daemon
daemon/blockdev.c: udev_settle ();
daemon/cryptsetup.ml: udev_settle ()
daemon/cryptsetup.ml: udev_settle ()
daemon/file.c: udev_settle ();
daemon/guestfsd.c: udev_settle ();
daemon/hotplug.c: udev_settle ();
[etc etc]
It could be that udev_settle is not being called at the right points,
or is not working in the way we understand.
...
> Short second failure logs (is_lv() only) - notice that is_lv() is
> invoked on /dev/vg_myvg/lv_var but it fails due to a problem in
> /dev/rhel/swap not existing)
>
> 2021-03-07 10:58:53 T libguestfs - 0 - enter - is_lv
> 2021-03-07 10:58:53 T libguestfs - 0 - appliance - guestfsd: =>
> aug_get (0x13) took 0.00 secs
> 2021-03-07 10:58:53 T libguestfs - 0 - appliance - guestfsd: <= is_lv
> (0x108) request length 64 bytes
> 2021-03-07 10:58:53 T libguestfs - 0 - appliance - commandrvf:
> stdout=n stderr=y flags=0x0
> 2021-03-07 10:58:53 T libguestfs - 0 - appliance - commandrvf: udevadm
> --debug settle -E /dev/vg_myvg/lv_var
> 2021-03-07 10:58:53 T libguestfs - 0 - appliance - command: lvm 'lvs'
> '-o' 'vg_name,lv_name' '-S' 'lv_role=public
&&
> lv_skip_activation!=yes' '--noheadings' '--separator'
'/'
> 2021-03-07 10:58:53 T libguestfs - 0 - appliance - command: lvm returned
0
> 2021-03-07 10:58:53 T libguestfs - 0 - appliance - command: lvm: stdout:
> 2021-03-07 10:58:53 T libguestfs - 0 - appliance - rhel/root
> 2021-03-07 10:58:53 T libguestfs - 0 - appliance - rhel/swap
> 2021-03-07 10:58:53 T libguestfs - 0 - appliance - vg_myvg/lv_var
> 2021-03-07 10:58:53 T libguestfs - 0 - appliance - guestfsd: error:
> stat: /dev/rhel/swap: No such file or directory
You might want to look earlier in this log to see if udevadm settle
was called between the LVs being activated and this API function. If
it was not being called then possibly we need to insert a call after
activation. If it was being called then perhaps udev settle is not
working the way we understand it.
Collecting the full logs is the right approach to diagnosing this.
The only other issue I can think of is the change in kernel PCI device
enumeration code (starting in Linux 5.6,
https://bugzilla.redhat.com/show_bug.cgi?id=1804207). I suppose in
theory the underlying devices might not be ready at all before we run
udev settle in the appliance. However I have not seen this actually
happen.
Rich.
--
Richard Jones, Virtualization Group, Red Hat
http://people.redhat.com/
~rjones
Read my programming and virtualization blog:
http://rwmj.wordpress.com
libguestfs lets you edit virtual machines. Supports shell scripting,
bindings from many languages.
http://libguestfs.org
--
Richard Jones, Virtualization Group, Red Hat
http://people.redhat.com/~rjones
Read my programming and virtualization blog:
http://rwmj.wordpress.com
virt-df lists disk usage of guests without needing to install any
software inside the virtual machine. Supports Linux and Windows.
http://people.redhat.com/~rjones/virt-df/