Deriving Single-Element Reservoir Sampling

July 20, 2024

Introduction

Reservoir sampling is a method of selecting a random subset from a collection of a priori unknown size. Here, we consider the problem of selecting a single element subject to an arbitrary probability distribution.

In particular, let $N \in \N$ and $X$ a random variable taking values in $\set{1, 2, \dots, N}$ constructed as follows:

Let $x_1 = 1$
For each $n > 1$, let $x_n$ be $n$ with probability $p_n$, and $x_{n-1}$ otherwise
At the end, take $X = x_N$

That is, we begin by selecting the first element, and then consider each subsequent element in turn, replacing our choice for the sampled value of $X$ by the current element $n$ with probability $p_n$.

Reproducing a Probability Distribution

The above scheme gives rise to the probability distribution where the result $x_N = n$ is produced by:

Selecting $n$ for $x_n$
Keeping $n$ selected for all $x_i$ with $i > n$

Therefore,

\[P_n \defeq \mathbb{P}(X = n) = p_n \cdot \prod_{i=n+1}^N (1-p_i)\]

We want to determine how, given a fixed probability distribution $\set{P_n}$, we invert this to compute the per-step probabilities $\set{p_n}$.

Begin by observing that $P_n$ depends only on $p_i$ for $i \geq n$, and in particular, that if we define

\[\bar{P}_n \defeq \frac{P_n}{p_n} = \begin{cases} 1 & n = N \\ \prod_{i=n+1}^N (1-p_i) & \text{otherwise} \end{cases}\]

then the expression for $\bar{P}_{n-1}$ in terms of $\bar{P}_n$ is particularly simple,

\[\begin{align*} \bar{P}_{n-1} &= \prod_{i=n}^N (1-p_i) \\ &= (1-p_n) \prod_{i=n+1}^N (1-p_i) \\ &= (1-p_n) \bar{P}_n \\ &= \bar{P}_n - p_n \bar{P}_n \\ &= \bar{P}_n - P_n \end{align*}\]

Shifting indices by one, this becomes

\[\bar{P}_n = \bar{P}_{n+1} - P_{n+1}\]

which we can substitute recursively to obtain

\[\begin{align*} \bar{P}_n &= \left(\bar{P}_{n+2} - P_{n+2}\right) - P_{n+1} \\ &= \left(\bar{P}_{n+3} - P_{n+3} \right) - P_{n+2} - P_{n+1} \\ &= \dots \\ &= \bar{P}_N - P_N - P_{N-1} - \dots - P_{n+1} \\ &= 1 - \sum_{i=n+1}^N P_i \end{align*}\]

and hence

\[\frac{P_n}{p_n} = 1 - \sum_{i=n+1}^N P_i \iff p_n = \frac{P_n}{1 - \sum_{i=n+1}^N P_i}\]

In our sampling method, we want to write $p_n$ in terms of state obtained by sampling up to $n$, rather than no earlier than $n$. We can do this by recalling that, since the $\set{P_i}$ are a probability distribution, we have

\[\sum_{i} P_i = 1 \implies \sum_{i={n+1}}^N P_i = 1 - \sum_{i=1}^n P_i\]

so that

\[p_n = \frac{P_n}{\sum_{i=1}^n P_i}\]

Probability Distribution Examples

Uniform Distribution

For a uniform distribution, we take $P_n = \frac{1}{N} \; \forall n$. Then $\sum_{i=1}^n P_i = \frac{n}{N}$, and hence

\[p_n = \frac{1}{N} \cdot \left(\frac{n}{N}\right)^{-1} = \frac{1}{n}\]

Weighted Distribution

For a sequence with weights $\omega_n > 0$, let $\Omega = \sum_n \omega_n$ so that

\[P_n = \frac{\omega_n}{\Omega}\]

To form an expression for $p_n$ in terms of the weights, let $\Omega_n = \sum_{i=1}^n \omega_i$. Then

\[\sum_{i=1}^n P_n = \frac{\Omega_n}{\Omega}\]

and so

\[p_n = \frac{\omega_n}{\Omega} \cdot \left(\frac{\Omega_n}{\Omega}\right)^{-1} = \frac{\omega_n}{\Omega_n}\]

Twitter Facebook LinkedIn

Matt Rowe

Deriving Single-Element Reservoir Sampling

Introduction

Reproducing a Probability Distribution

Probability Distribution Examples

Uniform Distribution

Weighted Distribution

You May Also Enjoy

Transformations of normals under parametric scalings

Fourier Transforms

Superellipses

Special Functions: Gamma, Beta, and Zeta