tor-browser

The Tor Browser
git clone https://git.dasho.dev/tor-browser.git
Log | Files | Refs | README | LICENSE

FastBernoulliTrial.h (16995B)


      1 /* -*- Mode: C++; tab-width: 8; indent-tabs-mode: nil; c-basic-offset: 2 -*- */
      2 /* vim: set ts=8 sts=2 et sw=2 tw=80: */
      3 /* This Source Code Form is subject to the terms of the Mozilla Public
      4 * License, v. 2.0. If a copy of the MPL was not distributed with this
      5 * file, You can obtain one at http://mozilla.org/MPL/2.0/. */
      6 
      7 #ifndef mozilla_FastBernoulliTrial_h
      8 #define mozilla_FastBernoulliTrial_h
      9 
     10 #include "mozilla/Assertions.h"
     11 #include "mozilla/XorShift128PlusRNG.h"
     12 
     13 #include <cmath>
     14 #include <stdint.h>
     15 
     16 namespace mozilla {
     17 
     18 /**
     19 * class FastBernoulliTrial: Efficient sampling with uniform probability
     20 *
     21 * When gathering statistics about a program's behavior, we may be observing
     22 * events that occur very frequently (e.g., function calls or memory
     23 * allocations) and we may be gathering information that is somewhat expensive
     24 * to produce (e.g., call stacks). Sampling all the events could have a
     25 * significant impact on the program's performance.
     26 *
     27 * Why not just sample every N'th event? This technique is called "systematic
     28 * sampling"; it's simple and efficient, and it's fine if we imagine a
     29 * patternless stream of events. But what if we're sampling allocations, and the
     30 * program happens to have a loop where each iteration does exactly N
     31 * allocations? You would end up sampling the same allocation every time through
     32 * the loop; the entire rest of the loop becomes invisible to your measurements!
     33 * More generally, if each iteration does M allocations, and M and N have any
     34 * common divisor at all, most allocation sites will never be sampled. If
     35 * they're both even, say, the odd-numbered allocations disappear from your
     36 * results.
     37 *
     38 * Ideally, we'd like each event to have some probability P of being sampled,
     39 * independent of its neighbors and of its position in the sequence. This is
     40 * called "Bernoulli sampling", and it doesn't suffer from any of the problems
     41 * mentioned above.
     42 *
     43 * One disadvantage of Bernoulli sampling is that you can't be sure exactly how
     44 * many samples you'll get: technically, it's possible that you might sample
     45 * none of them, or all of them. But if the number of events N is large, these
     46 * aren't likely outcomes; you can generally expect somewhere around P * N
     47 * events to be sampled.
     48 *
     49 * The other disadvantage of Bernoulli sampling is that you have to generate a
     50 * random number for every event, which can be slow.
     51 *
     52 * [significant pause]
     53 *
     54 * BUT NOT WITH THIS CLASS! FastBernoulliTrial lets you do true Bernoulli
     55 * sampling, while generating a fresh random number only when we do decide to
     56 * sample an event, not on every trial. When it decides not to sample, a call to
     57 * |FastBernoulliTrial::trial| is nothing but decrementing a counter and
     58 * comparing it to zero. So the lower your sampling probability is, the less
     59 * overhead FastBernoulliTrial imposes.
     60 *
     61 * Probabilities of 0 and 1 are handled efficiently. (In neither case need we
     62 * ever generate a random number at all.)
     63 *
     64 * The essential API:
     65 *
     66 * - FastBernoulliTrial(double P)
     67 *   Construct an instance that selects events with probability P.
     68 *
     69 * - FastBernoulliTrial::trial()
     70 *   Return true with probability P. Call this each time an event occurs, to
     71 *   decide whether to sample it or not.
     72 *
     73 * - FastBernoulliTrial::trial(size_t n)
     74 *   Equivalent to calling trial() |n| times, and returning true if any of those
     75 *   calls do. However, like trial, this runs in fast constant time.
     76 *
     77 *   What is this good for? In some applications, some events are "bigger" than
     78 *   others. For example, large allocations are more significant than small
     79 *   allocations. Perhaps we'd like to imagine that we're drawing allocations
     80 *   from a stream of bytes, and performing a separate Bernoulli trial on every
     81 *   byte from the stream. We can accomplish this by calling |t.trial(S)| for
     82 *   the number of bytes S, and sampling the event if that returns true.
     83 *
     84 *   Of course, this style of sampling needs to be paired with analysis and
     85 *   presentation that makes the size of the event apparent, lest trials with
     86 *   large values for |n| appear to be indistinguishable from those with small
     87 *   values for |n|.
     88 */
     89 class FastBernoulliTrial {
     90  /*
     91   * This comment should just read, "Generate skip counts with a geometric
     92   * distribution", and leave everyone to go look that up and see why it's the
     93   * right thing to do, if they don't know already.
     94   *
     95   * BUT IF YOU'RE CURIOUS, COMMENTS ARE FREE...
     96   *
     97   * Instead of generating a fresh random number for every trial, we can
     98   * randomly generate a count of how many times we should return false before
     99   * the next time we return true. We call this a "skip count". Once we've
    100   * returned true, we generate a fresh skip count, and begin counting down
    101   * again.
    102   *
    103   * Here's an awesome fact: by exercising a little care in the way we generate
    104   * skip counts, we can produce results indistinguishable from those we would
    105   * get "rolling the dice" afresh for every trial.
    106   *
    107   * In short, skip counts in Bernoulli trials of probability P obey a geometric
    108   * distribution. If a random variable X is uniformly distributed from [0..1),
    109   * then std::floor(std::log(X) / std::log(1-P)) has the appropriate geometric
    110   * distribution for the skip counts.
    111   *
    112   * Why that formula?
    113   *
    114   * Suppose we're to return |true| with some probability P, say, 0.3. Spread
    115   * all possible futures along a line segment of length 1. In portion P of
    116   * those cases, we'll return true on the next call to |trial|; the skip count
    117   * is 0. For the remaining portion 1-P of cases, the skip count is 1 or more.
    118   *
    119   * skip:                0                         1 or more
    120   *             |------------------^-----------------------------------------|
    121   * portion:            0.3                            0.7
    122   *                      P                             1-P
    123   *
    124   * But the "1 or more" section of the line is subdivided the same way: *within
    125   * that section*, in portion P the second call to |trial()| returns true, and
    126   * in portion 1-P it returns false a second time; the skip count is two or
    127   * more. So we return true on the second call in proportion 0.7 * 0.3, and
    128   * skip at least the first two in proportion 0.7 * 0.7.
    129   *
    130   * skip:                0                1              2 or more
    131   *             |------------------^------------^----------------------------|
    132   * portion:            0.3           0.7 * 0.3          0.7 * 0.7
    133   *                      P             (1-P)*P            (1-P)^2
    134   *
    135   * We can continue to subdivide:
    136   *
    137   * skip >= 0:  |------------------------------------------------- (1-P)^0 --|
    138   * skip >= 1:  |                  ------------------------------- (1-P)^1 --|
    139   * skip >= 2:  |                               ------------------ (1-P)^2 --|
    140   * skip >= 3:  |                                 ^     ---------- (1-P)^3 --|
    141   * skip >= 4:  |                                 .            --- (1-P)^4 --|
    142   *                                               .
    143   *                                               ^X, see below
    144   *
    145   * In other words, the likelihood of the next n calls to |trial| returning
    146   * false is (1-P)^n. The longer a run we require, the more the likelihood
    147   * drops. Further calls may return false too, but this is the probability
    148   * we'll skip at least n.
    149   *
    150   * This is interesting, because we can pick a point along this line segment
    151   * and see which skip count's range it falls within; the point X above, for
    152   * example, is within the ">= 2" range, but not within the ">= 3" range, so it
    153   * designates a skip count of 2. So if we pick points on the line at random
    154   * and use the skip counts they fall under, that will be indistinguishable
    155   * from generating a fresh random number between 0 and 1 for each trial and
    156   * comparing it to P.
    157   *
    158   * So to find the skip count for a point X, we must ask: To what whole power
    159   * must we raise 1-P such that we include X, but the next power would exclude
    160   * it? This is exactly std::floor(std::log(X) / std::log(1-P)).
    161   *
    162   * Our algorithm is then, simply: When constructed, compute an initial skip
    163   * count. Return false from |trial| that many times, and then compute a new
    164   * skip count.
    165   *
    166   * For a call to |trial(n)|, if the skip count is greater than n, return false
    167   * and subtract n from the skip count. If the skip count is less than n,
    168   * return true and compute a new skip count. Since each trial is independent,
    169   * it doesn't matter by how much n overshoots the skip count; we can actually
    170   * compute a new skip count at *any* time without affecting the distribution.
    171   * This is really beautiful.
    172   */
    173 public:
    174  /**
    175   * Construct a fast Bernoulli trial generator. Calls to |trial()| return true
    176   * with probability |aProbability|. Use |aState0| and |aState1| to seed the
    177   * random number generator; both may not be zero.
    178   */
    179  FastBernoulliTrial(double aProbability, uint64_t aState0, uint64_t aState1)
    180      : mProbability(0),
    181        mInvLogNotProbability(0),
    182        mGenerator(aState0, aState1),
    183        mSkipCount(0) {
    184    setProbability(aProbability);
    185  }
    186 
    187  /**
    188   * Return true with probability |mProbability|. Call this each time an event
    189   * occurs, to decide whether to sample it or not. The lower |mProbability| is,
    190   * the faster this function runs.
    191   */
    192  bool trial() {
    193    if (mSkipCount) {
    194      mSkipCount--;
    195      return false;
    196    }
    197 
    198    return chooseSkipCount();
    199  }
    200 
    201  /**
    202   * Equivalent to calling trial() |n| times, and returning true if any of those
    203   * calls do. However, like trial, this runs in fast constant time.
    204   *
    205   * What is this good for? In some applications, some events are "bigger" than
    206   * others. For example, large allocations are more significant than small
    207   * allocations. Perhaps we'd like to imagine that we're drawing allocations
    208   * from a stream of bytes, and performing a separate Bernoulli trial on every
    209   * byte from the stream. We can accomplish this by calling |t.trial(S)| for
    210   * the number of bytes S, and sampling the event if that returns true.
    211   *
    212   * Of course, this style of sampling needs to be paired with analysis and
    213   * presentation that makes the "size" of the event apparent, lest trials with
    214   * large values for |n| appear to be indistinguishable from those with small
    215   * values for |n|, despite being potentially much more likely to be sampled.
    216   */
    217  bool trial(size_t aCount) {
    218    if (mSkipCount > aCount) {
    219      mSkipCount -= aCount;
    220      return false;
    221    }
    222 
    223    return chooseSkipCount();
    224  }
    225 
    226  void setRandomState(uint64_t aState0, uint64_t aState1) {
    227    mGenerator.setState(aState0, aState1);
    228  }
    229 
    230  void setProbability(double aProbability) {
    231    MOZ_ASSERT(0 <= aProbability && aProbability <= 1);
    232    mProbability = aProbability;
    233    if (0 < mProbability && mProbability < 1) {
    234      /*
    235       * Let's look carefully at how this calculation plays out in floating-
    236       * point arithmetic. We'll assume IEEE, but the final C++ code we arrive
    237       * at would still be fine if our numbers were mathematically perfect. So,
    238       * while we've considered IEEE's edge cases, we haven't done anything that
    239       * should be actively bad when using other representations.
    240       *
    241       * (In the below, read comparisons as exact mathematical comparisons: when
    242       * we say something "equals 1", that means it's exactly equal to 1. We
    243       * treat approximation using intervals with open boundaries: saying a
    244       * value is in (0,1) doesn't specify how close to 0 or 1 the value gets.
    245       * When we use closed boundaries like [2**-53, 1], we're careful to ensure
    246       * the boundary values are actually representable.)
    247       *
    248       * - After the comparison above, we know mProbability is in (0,1).
    249       *
    250       * - The gaps below 1 are 2**-53, so that interval is (0, 1-2**-53].
    251       *
    252       * - Because the floating-point gaps near 1 are wider than those near
    253       *   zero, there are many small positive doubles ε such that 1-ε rounds to
    254       *   exactly 1. However, 2**-53 can be represented exactly. So
    255       *   1-mProbability is in [2**-53, 1].
    256       *
    257       * - log(1 - mProbability) is thus in (-37, 0].
    258       *
    259       *   That range includes zero, but when we use mInvLogNotProbability, it
    260       *   would be helpful if we could trust that it's negative. So when log(1
    261       *   - mProbability) is 0, we'll just set mProbability to 0, so that
    262       *   mInvLogNotProbability is not used in chooseSkipCount.
    263       *
    264       * - How much of the range of mProbability does this cause us to ignore?
    265       *   The only value for which log returns 0 is exactly 1; the slope of log
    266       *   at 1 is 1, so for small ε such that 1 - ε != 1, log(1 - ε) is -ε,
    267       *   never 0. The gaps near one are larger than the gaps near zero, so if
    268       *   1 - ε wasn't 1, then -ε is representable. So if log(1 - mProbability)
    269       *   isn't 0, then 1 - mProbability isn't 1, which means that mProbability
    270       *   is at least 2**-53, as discussed earlier. This is a sampling
    271       *   likelihood of roughly one in ten trillion, which is unlikely to be
    272       *   distinguishable from zero in practice.
    273       *
    274       *   So by forbidding zero, we've tightened our range to (-37, -2**-53].
    275       *
    276       * - Finally, 1 / log(1 - mProbability) is in [-2**53, -1/37). This all
    277       *   falls readily within the range of an IEEE double.
    278       *
    279       * ALL THAT HAVING BEEN SAID: here are the five lines of actual code:
    280       */
    281      double logNotProbability = std::log(1 - mProbability);
    282      if (logNotProbability == 0.0)
    283        mProbability = 0.0;
    284      else
    285        mInvLogNotProbability = 1 / logNotProbability;
    286    }
    287 
    288    chooseSkipCount();
    289  }
    290 
    291 private:
    292  /* The likelihood that any given call to |trial| should return true. */
    293  double mProbability;
    294 
    295  /*
    296   * The value of 1/std::log(1 - mProbability), cached for repeated use.
    297   *
    298   * If mProbability is exactly 0 or exactly 1, we don't use this value.
    299   * Otherwise, we guarantee this value is in the range [-2**53, -1/37), i.e.
    300   * definitely negative, as required by chooseSkipCount. See setProbability for
    301   * the details.
    302   */
    303  double mInvLogNotProbability;
    304 
    305  /* Our random number generator. */
    306  non_crypto::XorShift128PlusRNG mGenerator;
    307 
    308  /* The number of times |trial| should return false before next returning true.
    309   */
    310  size_t mSkipCount;
    311 
    312  /*
    313   * Choose the next skip count. This also returns the value that |trial| should
    314   * return, since we have to check for the extreme values for mProbability
    315   * anyway, and |trial| should never return true at all when mProbability is 0.
    316   */
    317  bool chooseSkipCount() {
    318    /*
    319     * If the probability is 1.0, every call to |trial| returns true. Make sure
    320     * mSkipCount is 0.
    321     */
    322    if (mProbability == 1.0) {
    323      mSkipCount = 0;
    324      return true;
    325    }
    326 
    327    /*
    328     * If the probabilility is zero, |trial| never returns true. Don't bother us
    329     * for a while.
    330     */
    331    if (mProbability == 0.0) {
    332      mSkipCount = SIZE_MAX;
    333      return false;
    334    }
    335 
    336    /*
    337     * What sorts of values can this call to std::floor produce?
    338     *
    339     * Since mGenerator.nextDouble returns a value in [0, 1-2**-53], std::log
    340     * returns a value in the range [-infinity, -2**-53], all negative. Since
    341     * mInvLogNotProbability is negative (see its comments), the product is
    342     * positive and possibly infinite. std::floor returns +infinity unchanged.
    343     * So the result will always be positive.
    344     *
    345     * Converting a double to an integer that is out of range for that integer
    346     * is undefined behavior, so we must clamp our result to SIZE_MAX, to ensure
    347     * we get an acceptable value for mSkipCount.
    348     *
    349     * The clamp is written carefully. Note that if we had said:
    350     *
    351     *    if (skipCount > double(SIZE_MAX))
    352     *       mSkipCount = SIZE_MAX;
    353     *    else
    354     *       mSkipCount = skipCount;
    355     *
    356     * that leads to undefined behavior 64-bit machines: SIZE_MAX coerced to
    357     * double can equal 2^64, so if skipCount equaled 2^64 converting it to
    358     * size_t would induce undefined behavior.
    359     *
    360     * Jakob Olesen cleverly suggested flipping the sense of the comparison to
    361     * skipCount < double(SIZE_MAX).  The conversion will evaluate to 2^64 or
    362     * the double just below it: either way, skipCount is guaranteed to have a
    363     * value that's safely convertible to size_t.
    364     *
    365     * (On 32-bit machines, all size_t values can be represented exactly in
    366     * double, so all is well.)
    367     */
    368    double skipCount =
    369        std::floor(std::log(mGenerator.nextDouble()) * mInvLogNotProbability);
    370    if (skipCount < double(SIZE_MAX))
    371      mSkipCount = skipCount;
    372    else
    373      mSkipCount = SIZE_MAX;
    374 
    375    return true;
    376  }
    377 };
    378 
    379 } /* namespace mozilla */
    380 
    381 #endif /* mozilla_FastBernoulliTrial_h */