We also looked at udev settle call points in the logs and it seems that it
is called a lot of times before.
The bug I mentioned is
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=616689 and they also
discuss that maybe udev settle is not working as intended.
The kernel version of the appliance (as can be seen in the log) is 4.19
Collecting the full logs is the right approach to diagnosing this.
I added the full log for the first failure logs, I think we can see from
there that udev settle is called but the file does not exist yet.
We thought that maybe if we explicitly add the following logic right after
g.launch() it might help:
1. For each device returned by: lvm 'lvs' '-o' 'vg_name,lv_name'
'-S'
'lv_role=public && lv_skip_activation!=yes' '--noheadings'
'--separator' '/'
1.1. stat the device /dev/vg/lv
1.2. if stat fails on device does not exist - wait
1.3. Go back to 1
If we wait for too long, relaunch guestfs.
However it would be nicer to maybe implement this inside guestfs.launch()
itself
Sam
On Mon, Apr 5, 2021 at 3:45 PM Richard W.M. Jones <rjones(a)redhat.com> wrote:
> On Mon, Apr 05, 2021 at 02:47:51PM +0300, Sam Eiderman wrote:
> > Hi,
> >
> > We are experiencing very rare LVM failures - 2 failures so far, in
> > different OSs, in different libguestfs functions.
> >
> > The first failure is inspect_os() not finding the root operating
> > system on rhel7.4.
> > LVM volumes are returned by lvm command but files under /dev do not
> exist (yet?)
> >
> > Second failure is in is_lv() - is_lv() successfully enumerates all lvm
> > volumes but then internal stat() command fails again on /dev file
> > since it does not exist (yet?) (rhel8.0)
> >
> > All of our tests run in parallel, 1 guestfs instance per core on a 32
> > core machine and they run on GCP (nested virtualization).
> >
> > What we think that is happening here is that libguestfs' appliance is
> > booting very somewhat slower than usual and that the links to some
> > devices do not appear yet (even after multiple seconds).
> > We found this old issue that might be connected to this behavior (in
> > some way):
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=616689
>
> I wonder if "udevadm settle" is not working? The daemon will use this
> command at various times in order to ensure that all preceeding udev
> messages have been processed and all /dev changes have been made.
>
> It is called once at appliance boot:
>
>
>
https://github.com/libguestfs/libguestfs/blob/b18ac489db76a700f2168ae6eb6...
>
> And throughout the daemon code:
>
>
>
https://github.com/libguestfs/libguestfs/blob/b18ac489db76a700f2168ae6eb6...
>
> $ git grep 'udev_settle ()' -- daemon
> daemon/blockdev.c: udev_settle ();
> daemon/cryptsetup.ml: udev_settle ()
> daemon/cryptsetup.ml: udev_settle ()
> daemon/file.c: udev_settle ();
> daemon/guestfsd.c: udev_settle ();
> daemon/hotplug.c: udev_settle ();
> [etc etc]
>
> It could be that udev_settle is not being called at the right points,
> or is not working in the way we understand.
>
> ...
> > Short second failure logs (is_lv() only) - notice that is_lv() is
> > invoked on /dev/vg_myvg/lv_var but it fails due to a problem in
> > /dev/rhel/swap not existing)
> >
> > 2021-03-07 10:58:53 T libguestfs - 0 - enter - is_lv
> > 2021-03-07 10:58:53 T libguestfs - 0 - appliance - guestfsd: =>
> > aug_get (0x13) took 0.00 secs
> > 2021-03-07 10:58:53 T libguestfs - 0 - appliance - guestfsd: <= is_lv
> > (0x108) request length 64 bytes
> > 2021-03-07 10:58:53 T libguestfs - 0 - appliance - commandrvf:
> > stdout=n stderr=y flags=0x0
> > 2021-03-07 10:58:53 T libguestfs - 0 - appliance - commandrvf: udevadm
> > --debug settle -E /dev/vg_myvg/lv_var
> > 2021-03-07 10:58:53 T libguestfs - 0 - appliance - command: lvm 'lvs'
> > '-o' 'vg_name,lv_name' '-S' 'lv_role=public
&&
> > lv_skip_activation!=yes' '--noheadings' '--separator'
'/'
> > 2021-03-07 10:58:53 T libguestfs - 0 - appliance - command: lvm returned
> 0
> > 2021-03-07 10:58:53 T libguestfs - 0 - appliance - command: lvm: stdout:
> > 2021-03-07 10:58:53 T libguestfs - 0 - appliance - rhel/root
> > 2021-03-07 10:58:53 T libguestfs - 0 - appliance - rhel/swap
> > 2021-03-07 10:58:53 T libguestfs - 0 - appliance - vg_myvg/lv_var
> > 2021-03-07 10:58:53 T libguestfs - 0 - appliance - guestfsd: error:
> > stat: /dev/rhel/swap: No such file or directory
>
> You might want to look earlier in this log to see if udevadm settle
> was called between the LVs being activated and this API function. If
> it was not being called then possibly we need to insert a call after
> activation. If it was being called then perhaps udev settle is not
> working the way we understand it.
>
Collecting the full logs is the right approach to diagnosing this.
>
> The only other issue I can think of is the change in kernel PCI device
> enumeration code (starting in Linux 5.6,
>
https://bugzilla.redhat.com/show_bug.cgi?id=1804207). I suppose in
> theory the underlying devices might not be ready at all before we run
> udev settle in the appliance. However I have not seen this actually
> happen.
>
> Rich.
>
> --
> Richard Jones, Virtualization Group, Red Hat
>
http://people.redhat.com/~rjones
> Read my programming and virtualization blog:
http://rwmj.wordpress.com
> libguestfs lets you edit virtual machines. Supports shell scripting,
> bindings from many languages.
http://libguestfs.org
>
>