On Mon, Apr 05, 2021 at 02:47:51PM +0300, Sam Eiderman wrote:
 Hi,
 
 We are experiencing very rare LVM failures - 2 failures so far, in
 different OSs, in different libguestfs functions.
 
 The first failure is inspect_os() not finding the root operating
 system on rhel7.4.
 LVM volumes are returned by lvm command but files under /dev do not exist (yet?)
 
 Second failure is in is_lv() - is_lv() successfully enumerates all lvm
 volumes but then internal stat() command fails again on /dev file
 since it does not exist (yet?) (rhel8.0)
 
 All of our tests run in parallel, 1 guestfs instance per core on a 32
 core machine and they run on GCP (nested virtualization).
 
 What we think that is happening here is that libguestfs' appliance is
 booting very somewhat slower than usual and that the links to some
 devices do not appear yet (even after multiple seconds).
 We found this old issue that might be connected to this behavior (in
 some way): 
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=616689 
I wonder if "udevadm settle" is not working?  The daemon will use this
command at various times in order to ensure that all preceeding udev
messages have been processed and all /dev changes have been made.
It is called once at appliance boot:
 
https://github.com/libguestfs/libguestfs/blob/b18ac489db76a700f2168ae6eb6...
And throughout the daemon code:
 
https://github.com/libguestfs/libguestfs/blob/b18ac489db76a700f2168ae6eb6...
  $ git grep 'udev_settle ()' -- daemon
  daemon/blockdev.c:  udev_settle ();
  daemon/cryptsetup.ml:  udev_settle ()
  daemon/cryptsetup.ml:  udev_settle ()
  daemon/file.c:    udev_settle ();
  daemon/guestfsd.c:  udev_settle ();
  daemon/hotplug.c:    udev_settle ();
  [etc etc]
It could be that udev_settle is not being called at the right points,
or is not working in the way we understand.
...
 Short second failure logs (is_lv() only) - notice that is_lv() is
 invoked on /dev/vg_myvg/lv_var but it fails due to a problem in
 /dev/rhel/swap not existing)
 
 2021-03-07 10:58:53 T libguestfs - 0 - enter - is_lv
 2021-03-07 10:58:53 T libguestfs - 0 - appliance - guestfsd: =>
 aug_get (0x13) took 0.00 secs
 2021-03-07 10:58:53 T libguestfs - 0 - appliance - guestfsd: <= is_lv
 (0x108) request length 64 bytes
 2021-03-07 10:58:53 T libguestfs - 0 - appliance - commandrvf:
 stdout=n stderr=y flags=0x0
 2021-03-07 10:58:53 T libguestfs - 0 - appliance - commandrvf: udevadm
 --debug settle -E /dev/vg_myvg/lv_var
 2021-03-07 10:58:53 T libguestfs - 0 - appliance - command: lvm 'lvs'
 '-o' 'vg_name,lv_name' '-S' 'lv_role=public &&
 lv_skip_activation!=yes' '--noheadings' '--separator' '/'
 2021-03-07 10:58:53 T libguestfs - 0 - appliance - command: lvm returned 0
 2021-03-07 10:58:53 T libguestfs - 0 - appliance - command: lvm: stdout:
 2021-03-07 10:58:53 T libguestfs - 0 - appliance - rhel/root
 2021-03-07 10:58:53 T libguestfs - 0 - appliance - rhel/swap
 2021-03-07 10:58:53 T libguestfs - 0 - appliance - vg_myvg/lv_var
 2021-03-07 10:58:53 T libguestfs - 0 - appliance - guestfsd: error:
 stat: /dev/rhel/swap: No such file or directory 
You might want to look earlier in this log to see if udevadm settle
was called between the LVs being activated and this API function.  If
it was not being called then possibly we need to insert a call after
activation.  If it was being called then perhaps udev settle is not
working the way we understand it.
Collecting the full logs is the right approach to diagnosing this.
The only other issue I can think of is the change in kernel PCI device
enumeration code (starting in Linux 5.6,
https://bugzilla.redhat.com/show_bug.cgi?id=1804207).  I suppose in
theory the underlying devices might not be ready at all before we run
udev settle in the appliance.  However I have not seen this actually
happen.
Rich.
-- 
Richard Jones, Virtualization Group, Red Hat 
http://people.redhat.com/~rjones
Read my programming and virtualization blog: 
http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  
http://libguestfs.org