Re: [Libguestfs] [PATCH nbdkit 5/5] New filter: evil

Wednesday, 17 May 2023

On Wed, May 17, 2023 at 03:28:11PM +0200, Laszlo Ersek wrote:
...
 On 5/16/23 14:12, Richard W.M. Jones wrote:

 > +/* This is the heart of the algorithm, the function which corrupts
 > + * the buffer after reading it from the plugin.
 > + *
 > + * The observation is that if we have a block of (eg) size 10^6 bits
 > + * and our probability of finding a corrupt bit is (eg) 1/10^4, then
 > + * we expect approximately 100 bits in the block to be corrupted.

 This is a binomial distribution (in the example, n=10^6, p=10^(-4)):

 https://en.wikipedia.org/wiki/Binomial_distribution

 The expected value is E = n*p = 10^2, so the comment looks correct in
 that regard.

 There is a different interpretation for "we expect" as well. Namely, the
 "mode" -- the most probable outcome, the number of corrupted bits we're
 most probably going to see. That's a different concept
 <https://en.wikipedia.org/wiki/Mode_(statistics)>;. However, per the WP
 article, for the binomial distribution, it is essentially the same.

 Namely, the WP article says, for this distribution, the mode M (an
 integer) is given uniquely by

   (n + 1) * p - 1 <= M < (n + 1) * p                [1]

 if ((n + 1) * p) is *not* an integer, and the M1 and M2 (equally
 probable, integer) modes

   M1 = (n + 1) * p                                  [2]
   M2 = (n + 1) * p - 1                              [3]

 if ((n + 1) * p) *is* an integer.

 Now,if we distribute the multiplications over the addends (break the
 parens open), we get:

   n * p + p - 1 <= M < n * p + p                [1]
   M1 = n * p + p                                [2]
   M2 = n * p + p - 1                            [3]

 But for the binomial distribution, E = n * p, so we can substitute (also
 rewriting +(p-1) as -(1-p) in [1]):

   E - (1 - p) <= M < E + p                [1]
   M1 = E + p                              [2]
   M2 = E + p - 1                          [3]

 Using the values from the code comment:

   n = 1,000,000
   p = 0.000,1
   E = n * p = 100

   (n + 1) * p = 100.000,1 --> not an integer, so [1] applies:

   E - (1 - p) <= M < E + p                [1]
   99.000,1    <= M < 100.000,1

 Therefore the mode M is 100 -- it equals the expected value E, for the
 particular "n" and "p" values!

 In general, they need not be equal. (For example, the mode(s) are always
 integers, but E need not be:if we change n to 1,000,001 then E is
 100.0001, which is clearly impossible to interpret as a number of
 corrupted bits. It's effectively an average, but never directly produced
 by the distribution as a value.)

 Either way, my point is that M, M1 and M2 (from [1], [2], [3]) are very
 close to E in the general case too, so the "we expect" language applies
 even when we interpret it as "mode", not as "expected value".

 > + *
 > + * For stuck bits we want the corrupted bits to be the same on each
 > + * access, either relative to the backing disk (STUCK_BITS) or to the
 > + * request (STUCK_WIRES).
 > + *
 > + * Instead of creating an expensive bitmap ahead of time covering the
 > + * whole disk, we can use the random number generator with a fixed
 > + * seed derived from the offset of the start of the block.  We can
 > + * then choose a random number uniformly in the range [0..2*(1/P)] (in
 > + * the example [0..2*10^4]) as the distance to the next corrupt bit.
 > + * We jump forwards, corrupt that bit, and repeat until we reach
 > + * the end of the block.

 Two points:

 (1) I figure the idea is that the "average distance" will be 1/p bits --
 the expected value being (0 + 2*(1/p))/2 == 1/p -- so we expect this
 "average distance" to fit n/(1/p) = n*p times in the block size.

 This looks sane, but I'm unequipped to make any arguments about
 "expected value".

 According to wikipedia, this is the Irwin-Hall distribution
 <https://en.wikipedia.org/wiki/Irwin%E2%80%93Hall_distribution>;: "the
 sum of a number of independent random variables, each having a uniform
 distribution". The individual random variables need to follow continuous
 (not discrete) uniform distributions though, so I don't know if
 Irwin-Hall "applies" here at all.

 Wikipedia says that, as "n" increases -- that is, as we sample our
 underlying uniform U(0, 1) distribution more and more times, and add up
 the results --, the distribution of the sum (= the Irwin-Hall
 distribution) approximates a Normal distribution, with both the epxected
 value and the mode being (s/2), where "s" is the number of samples
 (variables) summed.

 I don't know what happens if we scale the underlying U(0, 1)
 distribution's domain by 2/p, so that it becomes U(0, 2/p). I guess we'd
 *hope* that the expected value of the sum scales similarly, to:

   E = s/2 * 2/p = s/p

 Because in that case, we coul wish for the expected value (and mode) of
 the sum of the jumps to match our block size, that is:

   E = s/p = n

 or put differently,

   s = n*p

 Meaning that we'd expect having to sum as many "samples", i.e. having to
 take as many "jumps", in the average case, as n*p is (= 100 jumps with
 our numbers).

 But this is entirely hand-waving; I don't know if this holds up at all! 
Yes I don't think the bit picking method I chose gets the right
distribution exactly.  However it has the advantage of being
(a) reproducible without needing to store any state except for the
small random number generator state, and (b) easy enough to implement.

...
 (2) I think the interval as written is off-by-one. The left hand
side
 (0) should indeed be inclusive, but the RHS 2*(1/p) should be
 *excluded*. We're supposed to have 20,000 integers in the set, from 0 to
 19,999 inclusive. 
This is actually what we use - I'll update the comment.

...
 > +# This will give an approximate count of the number of set
bits.
 > +
 > +zbytes="$( od -A n -w1 -v -t x1 < $f | sort | uniq -c |
 > +           $SED -n -E -e 's/([0-9]+)[[:space:]]+00[[:space:]]*$/\1/p'
)"
 > +nzbits=$(( 10000000 - zbytes )); # looks wrong but actually correct ...

 Some explanation on this would be nice. 
In v2 I killed this completely and use some Python code instead.
(That didn't fix the bias I was observing in this test, and nor did
adding lots of debugging, so I still don't understand that.)

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-top is 'top' for virtual machines.  Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Re: [Libguestfs] [PATCH nbdkit 5/5] New filter: evil