On Wed, May 17, 2023 at 03:28:11PM +0200, Laszlo Ersek wrote:
On 5/16/23 14:12, Richard W.M. Jones wrote:
> +/* This is the heart of the algorithm, the function which corrupts
> + * the buffer after reading it from the plugin.
> + *
> + * The observation is that if we have a block of (eg) size 10^6 bits
> + * and our probability of finding a corrupt bit is (eg) 1/10^4, then
> + * we expect approximately 100 bits in the block to be corrupted.
This is a binomial distribution (in the example, n=10^6, p=10^(-4)):
https://en.wikipedia.org/wiki/Binomial_distribution
The expected value is E = n*p = 10^2, so the comment looks correct in
that regard.
There is a different interpretation for "we expect" as well. Namely, the
"mode" -- the most probable outcome, the number of corrupted bits we're
most probably going to see. That's a different concept
<
https://en.wikipedia.org/wiki/Mode_(statistics)>. However, per the WP
article, for the binomial distribution, it is essentially the same.
Namely, the WP article says, for this distribution, the mode M (an
integer) is given uniquely by
(n + 1) * p - 1 <= M < (n + 1) * p [1]
if ((n + 1) * p) is *not* an integer, and the M1 and M2 (equally
probable, integer) modes
M1 = (n + 1) * p [2]
M2 = (n + 1) * p - 1 [3]
if ((n + 1) * p) *is* an integer.
Now,if we distribute the multiplications over the addends (break the
parens open), we get:
n * p + p - 1 <= M < n * p + p [1]
M1 = n * p + p [2]
M2 = n * p + p - 1 [3]
But for the binomial distribution, E = n * p, so we can substitute (also
rewriting +(p-1) as -(1-p) in [1]):
E - (1 - p) <= M < E + p [1]
M1 = E + p [2]
M2 = E + p - 1 [3]
Using the values from the code comment:
n = 1,000,000
p = 0.000,1
E = n * p = 100
(n + 1) * p = 100.000,1 --> not an integer, so [1] applies:
E - (1 - p) <= M < E + p [1]
99.000,1 <= M < 100.000,1
Therefore the mode M is 100 -- it equals the expected value E, for the
particular "n" and "p" values!
In general, they need not be equal. (For example, the mode(s) are always
integers, but E need not be:if we change n to 1,000,001 then E is
100.0001, which is clearly impossible to interpret as a number of
corrupted bits. It's effectively an average, but never directly produced
by the distribution as a value.)
Either way, my point is that M, M1 and M2 (from [1], [2], [3]) are very
close to E in the general case too, so the "we expect" language applies
even when we interpret it as "mode", not as "expected value".
> + *
> + * For stuck bits we want the corrupted bits to be the same on each
> + * access, either relative to the backing disk (STUCK_BITS) or to the
> + * request (STUCK_WIRES).
> + *
> + * Instead of creating an expensive bitmap ahead of time covering the
> + * whole disk, we can use the random number generator with a fixed
> + * seed derived from the offset of the start of the block. We can
> + * then choose a random number uniformly in the range [0..2*(1/P)] (in
> + * the example [0..2*10^4]) as the distance to the next corrupt bit.
> + * We jump forwards, corrupt that bit, and repeat until we reach
> + * the end of the block.
Two points:
(1) I figure the idea is that the "average distance" will be 1/p bits --
the expected value being (0 + 2*(1/p))/2 == 1/p -- so we expect this
"average distance" to fit n/(1/p) = n*p times in the block size.
This looks sane, but I'm unequipped to make any arguments about
"expected value".
According to wikipedia, this is the Irwin-Hall distribution
<
https://en.wikipedia.org/wiki/Irwin%E2%80%93Hall_distribution>: "the
sum of a number of independent random variables, each having a uniform
distribution". The individual random variables need to follow continuous
(not discrete) uniform distributions though, so I don't know if
Irwin-Hall "applies" here at all.
Wikipedia says that, as "n" increases -- that is, as we sample our
underlying uniform U(0, 1) distribution more and more times, and add up
the results --, the distribution of the sum (= the Irwin-Hall
distribution) approximates a Normal distribution, with both the epxected
value and the mode being (s/2), where "s" is the number of samples
(variables) summed.
I don't know what happens if we scale the underlying U(0, 1)
distribution's domain by 2/p, so that it becomes U(0, 2/p). I guess we'd
*hope* that the expected value of the sum scales similarly, to:
E = s/2 * 2/p = s/p
Because in that case, we coul wish for the expected value (and mode) of
the sum of the jumps to match our block size, that is:
E = s/p = n
or put differently,
s = n*p
Meaning that we'd expect having to sum as many "samples", i.e. having to
take as many "jumps", in the average case, as n*p is (= 100 jumps with
our numbers).
But this is entirely hand-waving; I don't know if this holds up at all!
Yes I don't think the bit picking method I chose gets the right
distribution exactly. However it has the advantage of being
(a) reproducible without needing to store any state except for the
small random number generator state, and (b) easy enough to implement.
(2) I think the interval as written is off-by-one. The left hand
side
(0) should indeed be inclusive, but the RHS 2*(1/p) should be
*excluded*. We're supposed to have 20,000 integers in the set, from 0 to
19,999 inclusive.
This is actually what we use - I'll update the comment.
> +# This will give an approximate count of the number of set
bits.
> +
> +zbytes="$( od -A n -w1 -v -t x1 < $f | sort | uniq -c |
> + $SED -n -E -e 's/([0-9]+)[[:space:]]+00[[:space:]]*$/\1/p'
)"
> +nzbits=$(( 10000000 - zbytes )); # looks wrong but actually correct ...
Some explanation on this would be nice.
In v2 I killed this completely and use some Python code instead.
(That didn't fix the bias I was observing in this test, and nor did
adding lots of debugging, so I still don't understand that.)
Rich.
--
Richard Jones, Virtualization Group, Red Hat
http://people.redhat.com/~rjones
Read my programming and virtualization blog:
http://rwmj.wordpress.com
virt-top is 'top' for virtual machines. Tiny program with many
powerful monitoring features, net stats, disk stats, logging, etc.
http://people.redhat.com/~rjones/virt-top