Discrete staircase probability distribution

August 4, 2012 Math

Fixed-step discrete probability distributionThe Bernoulli distribution can be generalized to support more than two discrete states, where each state's relative probability is linearly interpolated between two given extremes, evenly dividing the difference from one extreme to the other over all states. This article describes the properties of this discrete staircase probability distribution, together with a method to generate random samples from it in constant time.

The discrete staircase probability distribution describes the probabilities of \(n\) integer states, with probability \(a c\) at state \(0\) (i.e. \(p[0] = ac\)) and probability \(b c\) at state \(n-1\) (i.e. \(p[n-1] = b c\)). The probabilities for the states in between (i.e. \(p[k]\), where \(1 < k < n - 1\)) evenly spread the difference between the first and last state's probability. This generates the distribution's staircase-like appearance. Note that, by definition \(\frac{p[n-1]}{p[0]} = \frac{bc}{ac} = \frac{b}{a}\), effectively letting \(a\) and \(b\) specify the relative probabilities of the first and last of the \(n\) states. The scale factor \(c\) in the dethfinition above is there to scale the distribution so that the sum of probabilities always equals one (i.e. \(\sum_{k=0}^{n-1} p_k = 1\)) for any choice of  \(a\), \(b\) and \(n\).

The distribution has been useful to me to be able to quickly select a random item from an ordered set of any size \(n\) in constant time, where the first item is exactly \(n\) times more likely to be randomly selected than the last one, and so \(a=n\) and \(b=1\) (or equivalently, \(a=1\) and \(b=1/n\)). One practical application of the distribution would be to generate random playlists from a user's top 1000, preferring the highest ranked songs, but mixing in less preferred songs once in a while as well. Another application would be to use it for selecting parents in a genetic algorithm, typically selecting the most fit individuals for breeding, but giving less fit individuals a chance as well.

As a side note, it's quite likely that the 'discrete staircase probability distribution' has already been described by other literature under a different name. But as I couldn't find it and had to derive it myself, it made sense to write a post about it as well, as I'm sure it could be used to solve many more practical problems.


Probability mass function (a=1, b=1/3, n=5)

Probability mass function (a=1, b=1/3, n=5)

In the image on the right, the probabilities of five integer states are represented (so, \(n=5\)). Note that the horizontal \(k\) axis is the state, where \(0 \leq k < n\) and \(k\in\mathbb{N} \) (that is, \(k\) is any non-negative integer smaller than \(n\)). The vertical axis represents the distribution's probability mass function, or PMF, stating the probability of a discrete random variable \(X\) to be equal to \(k\). In the image, the first state is 3 times more likely to be selected randomly than the last state, and thus, \(a/b = 3/1\). Again, as the distribution will normalize the sum of all probabilities through \(c\), the absolute values of \(a\) and \(b\) do not matter. So, for example, both \(\{a=6, b=2, n=5\}\) and \(\{a=1,b=1/3, n=5\}\), as shown in the image, will result in the probabilities \(p[0] = ac = 0.3\), \(p[1] = 0.25\), \(p[2] = 0.2\), \(p[3] = 0.15\) and \(p[4] = p[n-1] = bc = 0.1\).


Below is a list of the most important statistical properties of the discrete staircase probability distribution. How they have been derived is the topic of the last section of this article.

Probability mass distribution /
PMD (\(P(X = k)\))
Cumulative distribution function /
CDF (\(P(X \leq k)\))
Mean (\(\mu\), \(E[X]\)) \(\frac{1}{3}(n-1+\frac{bn-a}{a + b})\)
Variance (\(\sigma^2\), \(E[(X - \mu)^2]\)) \(\frac{n+1}{18} (n-2 + \frac{2 a b (n+1)}{(a + b)^2})\)
Uniform sample transform \(k = \lfloor\frac{b-a(2n-1)+\sqrt{(b-a(2n-1))^2+4(a + b)(b-a) (n-1) ny}}{2(b-a)}\rfloor\)

Generating samples

The above 'uniform sample transform' maps a sample from the continuous uniform distribution to a sample from the discrete staircase probability distribution. In other words, to generate a sample from the discrete staircase probability distribution, a continuous uniform \([0, 1)\) distribution can be sampled to generate a sample \(y\), which can then be mapped using the formula above to generate sample \(k\) from the discrete staircase probability distribution. Please note that this function is only defined when \(a \neq b\). If \(a = b\), the distribution is simply a discrete uniform probability distribution and the uniform sample transform can (and should) be simplified to \(k = \lfloor n x \rfloor\).

When the transform for \(a \neq b\) is implemented naively in software, floating point rounding errors could potentially result in out-of-range output. The following efficient C++ implementation handles these rounding issues robustly. However, if \(n\) is a lot larger than 1,000, be sure to use a Mersenne Twister implementation instead of rand() to prevent specific states to be generated more or less often than expected (or even never). Also, if \(n\) is larger than 100,000, use double instead of float for the same reason.

// Giliam de Carpentier, Copyright (c) 2012. Licensed under the MIT license.
#include   // needed for assert()
#include     // needed for sqrt() and abs()
#include   // needed for rand()

class DiscreteStaircaseSampler {
    DiscreteStaircaseSampler(float a, float b, unsigned int n) {
        assert(a > 0 && b > 0); // can't have negative probabilities
        assert(abs(a-b) > 1.0E-9f); // don't use class when distributed uniformly
        const float nFloat = static_cast(n);
        const float bMinusA = b - a;
        const float twoNMinusOne = nFloat + nFloat - 1.0f;
        const float twoMinusTwoN = 1.0f - twoNMinusOne;
        const float twoBMinusTwoA = bMinusA + bMinusA;
        const float p = b - a * twoNMinusOne;
        mMaxK = n - 1;
        mQ = 1.0f / twoBMinusTwoA;  // 1/(2(b-a))
        mPP = p * p;  // (b-a(2n-1))^2
        mPQ = p * mQ;   // (b-a(2n-1))/(2(b-a))
        mR = (a + b) * twoBMinusTwoA * twoMinusTwoN * nFloat;//-4(a+b)(b-a)(n-1)n
        if (mR > mPP) mR = mPP;  // prevent invalid results due to rounding

    int generateNextK() const {
        // if a fancier uniform [0,1] PRNG is preferred, replace the next line
        const float y = static_cast(rand()) / RAND_MAX; // 0.0 <= x <= 1.0

        const float  x = mPQ + mQ * sqrt(mPP - mR * y); // transform y to x
        const unsigned int k = static_cast(static_cast(x));
        return k <= mMaxK ? k : mMaxK; // prevent invalid results due to rounding

    unsigned int mMaxK;
    float mPQ, mQ, mPP, mR;


In the sections below, the mathematical derivation of the distribution's most important statistical properties is provided.


Again, let \(P[k]\) be the PMF (i.e. probability mass function) where \(0 \leq k < n, k \in\mathbb{N}\). Then, by the distribution's definition, \(P[0] = ac\) and \(P[n-1] = bc\). And as in-between states have a linearly interpolated probability, all the PMF's probabilies will lie on a line passing through the values \(ac\) and \(bc\) at \(0\) and \(n-1\), respectively. It's easy to verify that the following linear function fits this description.

\(P[k] = ac + \frac{bc-ac}{n-1}k = c P'[k]\), where

\(P'[k] = a + \frac{b-a}{n-1}k\).

The normalization factor \(c\) will be derived soon. But let's first define the distribution's discrete CDF (i.e. cumulative distribution function) denoted here as the function \(C[k]\).

\(C[k] = \sum_{i=0}^{k} P[i] = c\sum_{i=0}^{k} P'[i] = c C'[k]\), where

\(C'[k] = \sum_{i=0}^{k}P'[i]\).

As \(P'[i]\) is a linear function and \(C'[k]\) is summing it over \(i\), \(C'[k]\) must be a quadratic function. This quadratic function can be found by fitting the following three data points:

\(C'[0] = P'[0] = a\),
\(C'[1] = P'[0] + P'[1] = 2a + \frac{b-a}{n-1}\) and
\(C'[2] = P'[0] + P'[1] + P'[2] = 3a + 3\frac{b-a}{n-1}\).

By using a 2nd order Lagrange polynomial to fit these points, we get:

\(C'[k] = \frac{(k-1)(k-2)}{(0-1)(0-2)}a + \frac{(k-0)(k-2)}{(1-0)(1-2)}(2a + \frac{b-a}{n-1}) + \frac{(k-0)(k-1)}{(2-0)(2-1)}(3a + 3\frac{b-a}{n-1}) = \frac{k+1}{2}(2a+\frac{b-a}{n-1}k)\)

As a side note, \(C'[k]\) could also have been rewritten into a recursive definition, from which the explicit formula could then be sought by solving the recurrence relation. This would probably have led to a more formal derivation, but the method above leads to the same result and is more direct.

By definition \(\sum_{k=0}^{n-1} P[k] = 1\), from which follows that \(\sum_{k=0}^{n-1} P[k] = C[n-1] = cC'[n-1] = 1\) and so \(c = 1 / C'[n-1] = \frac{2}{(a+b)n}\). Plugging this back into \(P[k]\) and \(C[k]\) results in the formulas

\(P[k] = cP'[k] = \frac{2}{(a+b)n}P'[k] = \frac{2}{(a+b)n}(a+\frac{(b-a)k}{n-1})\) and

\(C[k] = cC'[k] = \frac{2}{(a+b)n}C'[k] = \frac{k+1}{(a+b)n}(2a+\frac{b-a}{n-1}k)\), respectively.


Now that an explicit formula for \(P[k]\) has been derived, the mean \(\mu\) can be derived as well. By definition,

\(\mu = \sum_{k=0}^{n-1} P[k]k = M[n-1]\), where

\(M[k] = \sum_{i=0}^{k} P[i]i\)

As \(P[i]i\) is quadratic in \(i\), summing \(P[i]i\) over \(i\) must result in a cubic function. And so, \(M[k]\) can be calculated by fitting the 3rd order Lagrange polynomial to the data points \(M[0] = 0\), \(M[1] = P[1]\), \(M[2] = P[1] + 2P[2]\) and \(M[3] = P[1] + 2P[2] + 3P[3]\), analogous to the method used earlier to derive \(C'[k]\). The intermediate steps are skipped here for brevity, but the result found through this method is \(M[k] = \frac{b+(3n-4)a+2(b-a)k}{3(a+b)(n-1)n}(k+1)k\), and so

\(\mu = M[n-1] = \frac{(n-2)a+(2n-1)b}{3(a+b)} = \frac{1}{3}(n-1+\frac{bn-a}{a + b})\).


The variance \(\sigma^2\) can be derived quite similarly. By definition,

\(\sigma^2= \sum_{k=0}^{n-1} P[k](k - \mu)^2 =\sum_{k=0}^{n-1} P[k]k^2 - \mu^2 = V[n-1] - \mu^2\), where \(V[k] = \sum_{i=0}^{k} P[i]i^2\).

Applying the same train of thought as before, \(P[i]i^2\) is cubic in \(i\) and \(V[k]\) is thus a quartic function that can be fitted using a 4th order Lagrange polynomial. Fitting the data points \(V[0]\), \(V[1]\), \(V[2]\), \(V[3]\) and \(V[4]\) leads to \(V[k] = \frac{(2(n-1)+(4n-3 k-7)k)a+3(k+1)kb}{6(a+b)(n-1)n}k(1+k)\). Consequently, \(V[n-1] = \frac{(n-2)a+3bn}{6(a+b)}(n-1)\), and so

\(\sigma^2 = V[n-1] - \mu^2= \frac{(n-2)a+3bn}{6(a+b)}(n-1) - \mu^2 = \frac{n+1}{18} (n-2 + \frac{2 a b (n+1)}{(a + b)^2})\).

Uniform sample transform

The transform that maps a sample from the uniform \([0, 1)\) probability distribution to a sample of a target distribution is directly related to the CDF. The following shows why this is and how this can be used for the discrete staircase probability distribution.

By definition, the probability of generating state \(0\) (i.e. \(k=0\)) is PMF \(P[0]\), which also is equal to the CDF \(C[0]\). Also, the probability of generating some sample \(y\) under the uniform \([0,1)\) distribution which lies in the interval \([0,P[0])\) is obviously \(P[0]\). Consequently, mapping any uniformly generated sample \(y\) that lies in the interval \([0,P[0])\) to state \(0\) (that is, \(k=0\) for \(y \in [0,P[0]) = [0, C[0])\)) will generate state \(0\) with the correct probability. Likewise, state \(1\) may be selected for samples in the next interval of 'width' \(P[1]\). That is, \(k=1\) for \(y \in [P[0],P[0]+P[1]) = [C[0],C[1])\). This observation can be generalized to selecting state \(k\) for \(y \in [C[k-1], C[k])\) where \(0 < k < n\). Furthermore, it turns out that \(C[-1] = 0\), and so the generalized observation is even true for \(k = 0\).

Function D(x) for (a=1, b=1/3, n=5)

Function D(x) for (a=1, b=1/3, n=5)

Now, suppose some continuous function \(y=D(x)\) has the property that \(D(k) = C[k-1]\). That is, \(y=D(x)\) is a continuous function that is fitted to match certain discrete points defined by \(C[x]\). Note that, because \(C[x]\) is a quadratic function, \(y=D(x)\) can be as well, making it a strictly monotonic function. Consequently, \(k \leq x < k+1\) when \(D(k) \leq y=D(x) < D(k+1)\) for any sample \(y\). In other words, each integer interval  \(k \leq x < k+1\) (representing state \(k = \lfloor x \rfloor\)) in the domain of function \(D(x)\) maps to a specific range, which will contain any random uniformly generated sample \(y\) with probability \(P[k]\). And conversely, that means that \(k=\lfloor D^{-1}(y)\rfloor\) maps any uniform sample \(y\) to state \(k\) with probability \(P[k]\).

Inversed D(x) (in red) and its floor (in blue) for (a=1, b=1/3, n=5)

Inversed D(x) (in red) and its floor (in blue) for (a=1, b=1/3, n=5)

What remains is finding \(D(x)\) and inverting it. As \(D(x) = C[k-1]\), and both are quadratic functions, we can trivially interpret the formula for \(C[x]\) as a continuous function, and substitute \(x+1\) for \(k\) in \(C[k]\) to get \(D(x)\). This leads to \(D(x) = \frac{(a(2n-1)-b)x+(b-a)x^2}{(a+b)(n-1)n}\). Solving \(y=D(x)\) for \(x\) results in the two roots \(x\) \(=\) \(\frac{b-a(2n-1) \pm \sqrt{(b-a(2n-1))^2+4(a + b)(b-a) (n-1) ny}}{2(b-a)}\). One of the roots is the inverse we're looking for and a simple analysis shows which one. For the case of \( a < b \), the \(D(x)\)'s minimum is (close) to the left of the domain \([0,n)\), and we need the 'plus' root to return the greater of the two. For the case of \(a>b\), the \(D(x)\)'s maximum is (close) to the right of the domain, and we need the 'plus root' (together with the now negative denominator) to return the smaller of the two. So, in both cases, \(k = \lfloor\frac{b-a(2n-1)+\sqrt{(b-a(2n-1))^2+4(a + b)(b-a) (n-1) ny}}{2(b-a)}\rfloor\).

Comments (3)

October 19, 2012

Yes, it's an interesting poblrem, but I much prefer the 3-door game show puzzle, mostly because 99% of people (it's good to throw in an arbitrary and made up statistic in the middle of a probability poblrem, right?) get it wrong and don't understand why, even when you tell them the answer. While I'm sure you know the poblrem and solution, I'll write the poblrem here for the sake of your many loyal readers. So suppose you're on a game show, and you're presented with 3 doors to choose from, one of which has a new car behind it, the other 2 have goats (or whatever, the point is you don't want those doors). So you make your choice, but the host, instead of opening the door you chose, opens one of the other doors he knows the car is not behind. Then he present you with a choice, you can either keep your first choice or change your pick to the other unopened door. What should you do?Most people come to this conclusion, at the opening of the door, the probability that the car is behind either remainding door is 50-50, so it really doesn't matter. And most will go one step further and say you should never second guess yourself so they'll stick with their original choice. Well those people are most likely (2/3 of the time) are going to driving a goat to work. The correct answer is that you should always change your pick. The odds of winning when you change your pick are 2 out of every 3 times. Anyway, I'm bored at work and your poblrem reminded me of that poblrem and I thought it would make me sound smart. Enjoy.

Andrej Oskin
December 13, 2012

I am sorry, but what you describe is not quite Bernoulli distribution. Bernoulli distribution describes probablity of having "m" times event A in "n" tries, if probability of single event is "p". And Bernoulli distribution is given with
P_n(m) = C_n^m p^m (1-p)^(n-m), where

C_n^m = n!/(m! (n-m)!)

Generalization of Bernoulli distribution would be something like this: you may encounter either event A, or B, or C with probabilities correspondingly p, q, r. (of course p+q+r = 1). Then question sounds like that: what is the probability to encounter event A - n times, event B - m times, event C - l times during n+m+l tries.

Answer is the following:
P(A:n, B:m, C:l) = (n+m+l)!/(n!m!l!) p^n * q^m * r^l

It would be an interesting question to find continous limit of this distribution. Bernoulli limit is Gauss distribution - and what will it be in this case?

December 13, 2012

Hi Andrej. Thanks for the input. I have to disagree, though. You seem to have described the binomial distribution, not the Bernoulli distribution, although one may think of the binomial distribution as one possible type of generalization of the Bernoulli distribution. In this article, I've chosen to generalize the Bernoulli distribution in a different direction, namely the number of discrete steps (still in one trial), instead of in the number of trials. Nonetheless, your last question is an interesting one :-) . Regards, Giliam

Leave a comment