I'll try to explain better the scenario:
I have several hosts running lots of VMs which are generated from few base images, say A, B, C the base images (backing file) and A1, A2, A*, B1, B2, B* clones on top of which the newly spawned VMs are running.
I need to collect the disk states of A*, B*, C* machines and see what has been written there. I don't care about the whole content as the base images content A, B, C are well known to me, only thing it matters are the deltas of the new clones.
One more piece in the puzzle is that the inspection does not happen on the hosts running the VMs but on a dedicated server.
My idea was to collect those "snapshots" (generic term not the QEMU one) from the hosts and send them to my inspection server. As A, B and C are accessible from that server only thing I need is to rebase those snapshot to correctly inspect them through libguestfs, and it proved to work (I'm using readonly mode as I only care about reading the disks). I'm not really interested in having consistent point-in-time state of the disks as the operation is done several times a day so I can cope with semi-consistent data as it can be easily re-constructed.
My real problem comes when I try to inspect the disk snapshot: libguestfs will, of course, let me see the whole content of the disks, which means A + A*. Apart from the waste of CPU time spend on looking at files I already know the state (the ones contained in A), it generates a lot of noise. A Linux base image with some library installed consists in 20+ K files, installing something extra (Apache server for example) just brings some hundreds new files and I'm interested only in those ones.
So my real question is: is there a way to distinguish the files contained in the two different disk images (A and A1) or shall I think about a totally different approach?
Thank you.