FastBernoulliTrial.h (16995B)
1 /* -*- Mode: C++; tab-width: 8; indent-tabs-mode: nil; c-basic-offset: 2 -*- */ 2 /* vim: set ts=8 sts=2 et sw=2 tw=80: */ 3 /* This Source Code Form is subject to the terms of the Mozilla Public 4 * License, v. 2.0. If a copy of the MPL was not distributed with this 5 * file, You can obtain one at http://mozilla.org/MPL/2.0/. */ 6 7 #ifndef mozilla_FastBernoulliTrial_h 8 #define mozilla_FastBernoulliTrial_h 9 10 #include "mozilla/Assertions.h" 11 #include "mozilla/XorShift128PlusRNG.h" 12 13 #include <cmath> 14 #include <stdint.h> 15 16 namespace mozilla { 17 18 /** 19 * class FastBernoulliTrial: Efficient sampling with uniform probability 20 * 21 * When gathering statistics about a program's behavior, we may be observing 22 * events that occur very frequently (e.g., function calls or memory 23 * allocations) and we may be gathering information that is somewhat expensive 24 * to produce (e.g., call stacks). Sampling all the events could have a 25 * significant impact on the program's performance. 26 * 27 * Why not just sample every N'th event? This technique is called "systematic 28 * sampling"; it's simple and efficient, and it's fine if we imagine a 29 * patternless stream of events. But what if we're sampling allocations, and the 30 * program happens to have a loop where each iteration does exactly N 31 * allocations? You would end up sampling the same allocation every time through 32 * the loop; the entire rest of the loop becomes invisible to your measurements! 33 * More generally, if each iteration does M allocations, and M and N have any 34 * common divisor at all, most allocation sites will never be sampled. If 35 * they're both even, say, the odd-numbered allocations disappear from your 36 * results. 37 * 38 * Ideally, we'd like each event to have some probability P of being sampled, 39 * independent of its neighbors and of its position in the sequence. This is 40 * called "Bernoulli sampling", and it doesn't suffer from any of the problems 41 * mentioned above. 42 * 43 * One disadvantage of Bernoulli sampling is that you can't be sure exactly how 44 * many samples you'll get: technically, it's possible that you might sample 45 * none of them, or all of them. But if the number of events N is large, these 46 * aren't likely outcomes; you can generally expect somewhere around P * N 47 * events to be sampled. 48 * 49 * The other disadvantage of Bernoulli sampling is that you have to generate a 50 * random number for every event, which can be slow. 51 * 52 * [significant pause] 53 * 54 * BUT NOT WITH THIS CLASS! FastBernoulliTrial lets you do true Bernoulli 55 * sampling, while generating a fresh random number only when we do decide to 56 * sample an event, not on every trial. When it decides not to sample, a call to 57 * |FastBernoulliTrial::trial| is nothing but decrementing a counter and 58 * comparing it to zero. So the lower your sampling probability is, the less 59 * overhead FastBernoulliTrial imposes. 60 * 61 * Probabilities of 0 and 1 are handled efficiently. (In neither case need we 62 * ever generate a random number at all.) 63 * 64 * The essential API: 65 * 66 * - FastBernoulliTrial(double P) 67 * Construct an instance that selects events with probability P. 68 * 69 * - FastBernoulliTrial::trial() 70 * Return true with probability P. Call this each time an event occurs, to 71 * decide whether to sample it or not. 72 * 73 * - FastBernoulliTrial::trial(size_t n) 74 * Equivalent to calling trial() |n| times, and returning true if any of those 75 * calls do. However, like trial, this runs in fast constant time. 76 * 77 * What is this good for? In some applications, some events are "bigger" than 78 * others. For example, large allocations are more significant than small 79 * allocations. Perhaps we'd like to imagine that we're drawing allocations 80 * from a stream of bytes, and performing a separate Bernoulli trial on every 81 * byte from the stream. We can accomplish this by calling |t.trial(S)| for 82 * the number of bytes S, and sampling the event if that returns true. 83 * 84 * Of course, this style of sampling needs to be paired with analysis and 85 * presentation that makes the size of the event apparent, lest trials with 86 * large values for |n| appear to be indistinguishable from those with small 87 * values for |n|. 88 */ 89 class FastBernoulliTrial { 90 /* 91 * This comment should just read, "Generate skip counts with a geometric 92 * distribution", and leave everyone to go look that up and see why it's the 93 * right thing to do, if they don't know already. 94 * 95 * BUT IF YOU'RE CURIOUS, COMMENTS ARE FREE... 96 * 97 * Instead of generating a fresh random number for every trial, we can 98 * randomly generate a count of how many times we should return false before 99 * the next time we return true. We call this a "skip count". Once we've 100 * returned true, we generate a fresh skip count, and begin counting down 101 * again. 102 * 103 * Here's an awesome fact: by exercising a little care in the way we generate 104 * skip counts, we can produce results indistinguishable from those we would 105 * get "rolling the dice" afresh for every trial. 106 * 107 * In short, skip counts in Bernoulli trials of probability P obey a geometric 108 * distribution. If a random variable X is uniformly distributed from [0..1), 109 * then std::floor(std::log(X) / std::log(1-P)) has the appropriate geometric 110 * distribution for the skip counts. 111 * 112 * Why that formula? 113 * 114 * Suppose we're to return |true| with some probability P, say, 0.3. Spread 115 * all possible futures along a line segment of length 1. In portion P of 116 * those cases, we'll return true on the next call to |trial|; the skip count 117 * is 0. For the remaining portion 1-P of cases, the skip count is 1 or more. 118 * 119 * skip: 0 1 or more 120 * |------------------^-----------------------------------------| 121 * portion: 0.3 0.7 122 * P 1-P 123 * 124 * But the "1 or more" section of the line is subdivided the same way: *within 125 * that section*, in portion P the second call to |trial()| returns true, and 126 * in portion 1-P it returns false a second time; the skip count is two or 127 * more. So we return true on the second call in proportion 0.7 * 0.3, and 128 * skip at least the first two in proportion 0.7 * 0.7. 129 * 130 * skip: 0 1 2 or more 131 * |------------------^------------^----------------------------| 132 * portion: 0.3 0.7 * 0.3 0.7 * 0.7 133 * P (1-P)*P (1-P)^2 134 * 135 * We can continue to subdivide: 136 * 137 * skip >= 0: |------------------------------------------------- (1-P)^0 --| 138 * skip >= 1: | ------------------------------- (1-P)^1 --| 139 * skip >= 2: | ------------------ (1-P)^2 --| 140 * skip >= 3: | ^ ---------- (1-P)^3 --| 141 * skip >= 4: | . --- (1-P)^4 --| 142 * . 143 * ^X, see below 144 * 145 * In other words, the likelihood of the next n calls to |trial| returning 146 * false is (1-P)^n. The longer a run we require, the more the likelihood 147 * drops. Further calls may return false too, but this is the probability 148 * we'll skip at least n. 149 * 150 * This is interesting, because we can pick a point along this line segment 151 * and see which skip count's range it falls within; the point X above, for 152 * example, is within the ">= 2" range, but not within the ">= 3" range, so it 153 * designates a skip count of 2. So if we pick points on the line at random 154 * and use the skip counts they fall under, that will be indistinguishable 155 * from generating a fresh random number between 0 and 1 for each trial and 156 * comparing it to P. 157 * 158 * So to find the skip count for a point X, we must ask: To what whole power 159 * must we raise 1-P such that we include X, but the next power would exclude 160 * it? This is exactly std::floor(std::log(X) / std::log(1-P)). 161 * 162 * Our algorithm is then, simply: When constructed, compute an initial skip 163 * count. Return false from |trial| that many times, and then compute a new 164 * skip count. 165 * 166 * For a call to |trial(n)|, if the skip count is greater than n, return false 167 * and subtract n from the skip count. If the skip count is less than n, 168 * return true and compute a new skip count. Since each trial is independent, 169 * it doesn't matter by how much n overshoots the skip count; we can actually 170 * compute a new skip count at *any* time without affecting the distribution. 171 * This is really beautiful. 172 */ 173 public: 174 /** 175 * Construct a fast Bernoulli trial generator. Calls to |trial()| return true 176 * with probability |aProbability|. Use |aState0| and |aState1| to seed the 177 * random number generator; both may not be zero. 178 */ 179 FastBernoulliTrial(double aProbability, uint64_t aState0, uint64_t aState1) 180 : mProbability(0), 181 mInvLogNotProbability(0), 182 mGenerator(aState0, aState1), 183 mSkipCount(0) { 184 setProbability(aProbability); 185 } 186 187 /** 188 * Return true with probability |mProbability|. Call this each time an event 189 * occurs, to decide whether to sample it or not. The lower |mProbability| is, 190 * the faster this function runs. 191 */ 192 bool trial() { 193 if (mSkipCount) { 194 mSkipCount--; 195 return false; 196 } 197 198 return chooseSkipCount(); 199 } 200 201 /** 202 * Equivalent to calling trial() |n| times, and returning true if any of those 203 * calls do. However, like trial, this runs in fast constant time. 204 * 205 * What is this good for? In some applications, some events are "bigger" than 206 * others. For example, large allocations are more significant than small 207 * allocations. Perhaps we'd like to imagine that we're drawing allocations 208 * from a stream of bytes, and performing a separate Bernoulli trial on every 209 * byte from the stream. We can accomplish this by calling |t.trial(S)| for 210 * the number of bytes S, and sampling the event if that returns true. 211 * 212 * Of course, this style of sampling needs to be paired with analysis and 213 * presentation that makes the "size" of the event apparent, lest trials with 214 * large values for |n| appear to be indistinguishable from those with small 215 * values for |n|, despite being potentially much more likely to be sampled. 216 */ 217 bool trial(size_t aCount) { 218 if (mSkipCount > aCount) { 219 mSkipCount -= aCount; 220 return false; 221 } 222 223 return chooseSkipCount(); 224 } 225 226 void setRandomState(uint64_t aState0, uint64_t aState1) { 227 mGenerator.setState(aState0, aState1); 228 } 229 230 void setProbability(double aProbability) { 231 MOZ_ASSERT(0 <= aProbability && aProbability <= 1); 232 mProbability = aProbability; 233 if (0 < mProbability && mProbability < 1) { 234 /* 235 * Let's look carefully at how this calculation plays out in floating- 236 * point arithmetic. We'll assume IEEE, but the final C++ code we arrive 237 * at would still be fine if our numbers were mathematically perfect. So, 238 * while we've considered IEEE's edge cases, we haven't done anything that 239 * should be actively bad when using other representations. 240 * 241 * (In the below, read comparisons as exact mathematical comparisons: when 242 * we say something "equals 1", that means it's exactly equal to 1. We 243 * treat approximation using intervals with open boundaries: saying a 244 * value is in (0,1) doesn't specify how close to 0 or 1 the value gets. 245 * When we use closed boundaries like [2**-53, 1], we're careful to ensure 246 * the boundary values are actually representable.) 247 * 248 * - After the comparison above, we know mProbability is in (0,1). 249 * 250 * - The gaps below 1 are 2**-53, so that interval is (0, 1-2**-53]. 251 * 252 * - Because the floating-point gaps near 1 are wider than those near 253 * zero, there are many small positive doubles ε such that 1-ε rounds to 254 * exactly 1. However, 2**-53 can be represented exactly. So 255 * 1-mProbability is in [2**-53, 1]. 256 * 257 * - log(1 - mProbability) is thus in (-37, 0]. 258 * 259 * That range includes zero, but when we use mInvLogNotProbability, it 260 * would be helpful if we could trust that it's negative. So when log(1 261 * - mProbability) is 0, we'll just set mProbability to 0, so that 262 * mInvLogNotProbability is not used in chooseSkipCount. 263 * 264 * - How much of the range of mProbability does this cause us to ignore? 265 * The only value for which log returns 0 is exactly 1; the slope of log 266 * at 1 is 1, so for small ε such that 1 - ε != 1, log(1 - ε) is -ε, 267 * never 0. The gaps near one are larger than the gaps near zero, so if 268 * 1 - ε wasn't 1, then -ε is representable. So if log(1 - mProbability) 269 * isn't 0, then 1 - mProbability isn't 1, which means that mProbability 270 * is at least 2**-53, as discussed earlier. This is a sampling 271 * likelihood of roughly one in ten trillion, which is unlikely to be 272 * distinguishable from zero in practice. 273 * 274 * So by forbidding zero, we've tightened our range to (-37, -2**-53]. 275 * 276 * - Finally, 1 / log(1 - mProbability) is in [-2**53, -1/37). This all 277 * falls readily within the range of an IEEE double. 278 * 279 * ALL THAT HAVING BEEN SAID: here are the five lines of actual code: 280 */ 281 double logNotProbability = std::log(1 - mProbability); 282 if (logNotProbability == 0.0) 283 mProbability = 0.0; 284 else 285 mInvLogNotProbability = 1 / logNotProbability; 286 } 287 288 chooseSkipCount(); 289 } 290 291 private: 292 /* The likelihood that any given call to |trial| should return true. */ 293 double mProbability; 294 295 /* 296 * The value of 1/std::log(1 - mProbability), cached for repeated use. 297 * 298 * If mProbability is exactly 0 or exactly 1, we don't use this value. 299 * Otherwise, we guarantee this value is in the range [-2**53, -1/37), i.e. 300 * definitely negative, as required by chooseSkipCount. See setProbability for 301 * the details. 302 */ 303 double mInvLogNotProbability; 304 305 /* Our random number generator. */ 306 non_crypto::XorShift128PlusRNG mGenerator; 307 308 /* The number of times |trial| should return false before next returning true. 309 */ 310 size_t mSkipCount; 311 312 /* 313 * Choose the next skip count. This also returns the value that |trial| should 314 * return, since we have to check for the extreme values for mProbability 315 * anyway, and |trial| should never return true at all when mProbability is 0. 316 */ 317 bool chooseSkipCount() { 318 /* 319 * If the probability is 1.0, every call to |trial| returns true. Make sure 320 * mSkipCount is 0. 321 */ 322 if (mProbability == 1.0) { 323 mSkipCount = 0; 324 return true; 325 } 326 327 /* 328 * If the probabilility is zero, |trial| never returns true. Don't bother us 329 * for a while. 330 */ 331 if (mProbability == 0.0) { 332 mSkipCount = SIZE_MAX; 333 return false; 334 } 335 336 /* 337 * What sorts of values can this call to std::floor produce? 338 * 339 * Since mGenerator.nextDouble returns a value in [0, 1-2**-53], std::log 340 * returns a value in the range [-infinity, -2**-53], all negative. Since 341 * mInvLogNotProbability is negative (see its comments), the product is 342 * positive and possibly infinite. std::floor returns +infinity unchanged. 343 * So the result will always be positive. 344 * 345 * Converting a double to an integer that is out of range for that integer 346 * is undefined behavior, so we must clamp our result to SIZE_MAX, to ensure 347 * we get an acceptable value for mSkipCount. 348 * 349 * The clamp is written carefully. Note that if we had said: 350 * 351 * if (skipCount > double(SIZE_MAX)) 352 * mSkipCount = SIZE_MAX; 353 * else 354 * mSkipCount = skipCount; 355 * 356 * that leads to undefined behavior 64-bit machines: SIZE_MAX coerced to 357 * double can equal 2^64, so if skipCount equaled 2^64 converting it to 358 * size_t would induce undefined behavior. 359 * 360 * Jakob Olesen cleverly suggested flipping the sense of the comparison to 361 * skipCount < double(SIZE_MAX). The conversion will evaluate to 2^64 or 362 * the double just below it: either way, skipCount is guaranteed to have a 363 * value that's safely convertible to size_t. 364 * 365 * (On 32-bit machines, all size_t values can be represented exactly in 366 * double, so all is well.) 367 */ 368 double skipCount = 369 std::floor(std::log(mGenerator.nextDouble()) * mInvLogNotProbability); 370 if (skipCount < double(SIZE_MAX)) 371 mSkipCount = skipCount; 372 else 373 mSkipCount = SIZE_MAX; 374 375 return true; 376 } 377 }; 378 379 } /* namespace mozilla */ 380 381 #endif /* mozilla_FastBernoulliTrial_h */