The Counter-Intuitive Nature of Memoryless Distributions

Our intuitions about probability aren’t always right

Consider flipping a coin. If you have flipped the coin 100 times and gotten 100 tails in a row, what is the probability that you will get a heads on the next flip? What if you have gotten 10,000 tails in a row first? 1,000,000? Obviously, the answer is still 50%. While the odds of getting into a particular state of coin flips varies (these examples would be highly unusual to see in reality), each coin flip event is wholly independent of those that have come before.

In statistics, a probability distribution which has this property is called memoryless, in that the distribution has no memory of previous states.

The formal definition is: given a random variable $X \sim \text{dist}$ where $\text{dist}$ is a memoryless distribution, $$ \textbf{Pr}\left[X > s + t | X > s\right] = \textbf{Pr}\left[X > t\right] $$ That is to say, the probability that $X$ has a value that is $t$ larger than some $s$, given the knowledge that $X$ is already greater than $s$, is the same as the probability that $X$ is greater than $t$ alone. The knowledge that $X$ already exceeds some value is irrelevant to the probability.

Let’s give a more specific example: flipping a coin until we get a heads. Say that we have already flipped the coin $s = 100$ times with nary a heads in sight. What is the probability that we’ll flip $t = 2$ more times and still not get a heads? That’s easy enough to calculate, the probability of getting two tails in a row is $\frac{1}{2} \cdot \frac{1}{2} = \frac{1}{4}$. Notice that the fact that we have a history of 100 tails in a row doesn’t even factor into the equation? That’s because the repeated flipping of a coin follows a memoryless distribution (a geometric distribution, to be precise).

Compare this to a uniform distribution. Say that we are waiting in a line, and the amount of time that we need to wait is uniformly distributed on an interval of $0$ - $10$ minutes. This means that the average wait time will be around $\frac{10 - 0}{2} = 5$ minutes. Let’s say that we have already been waiting for 2 minutes. What is the expected amount of time that we have left to wait?

A uniform distribution is not memoryless, so our wait time will change given this new information. The expected wait time after already waiting for 2 minutes is 4 minutes (not the 3 minutes you may have expected), because $\frac{10 - 2}{2} = 4$. This makes perfect, intuitive sense. As we wait in line, the expected amount of additional waiting time decreases.

Where things get very strange is when you start using a memoryless distribution to model things like wait time. One common example of this is the exponential distribution, which intuitively represents a continuous analog to the geometric distribution that we saw above with respect to coin flips.

An exponential distribution is used to model a process where events occur continuously and independently at a constant rate. Specifically, it has a density function of $$ f(t) = \lambda e^{- \lambda t} $$ where $\lambda$ represents the rate.

Skipping a lot of math, the average (or expected) value of a random variable following an exponential distribution is, $$ \mathbf{E}[X] = \frac{1}{\lambda} $$

So, let’s rexamine our wait time example above, except this time we’ll use $W \sim \text{exp}(\frac{1}{5})$ as our random variable representing the wait time. This means that the average wait time is $$ \mathbf{E}[W] = \frac{1}{\frac{1}{5}} = 5 \text{ minutes} $$

Let’s pose the same question again. Given that I have already waited for 2 minutes, what is the expected amount of time that I still must wait?

Why, 5 minutes of course! The distribution is memoryless. So the amount of time that we have left to wait is independent of the amount of time that we have already waited. Which is a very counter-intuitive result.

To understand why this is the case, we need to go back to the definition. We are considering events to occur at a constant rate, so we might imagine that the service that we are in line for finishes serving the current person within a consistent amount of time. Okay, that makes sense. However the sneaky thing is that in this exponential model the next person to be served is selected at random from all of the people waiting. On each event completion, a random person is selected to be served next.

This means that the probability that you are selected has nothing to do with how long you have been waiting. And so the average amount of time that you need to wait is strictly a function of how many other people are waiting and how frequently a random person is picked (the effects of which are combined together and modeled using $\lambda$ in our equation). As a result, your average remaining wait time is constant, irrespective of how long you’ve been waiting already. This is the same reasoning that we saw initially when considering the probability of getting a heads on a given coin flip.

Naturally, this is a very bad model for a queue of people waiting for some service, at the very least from a psychological standpoint, but it does crop up a lot in queueing theory because of how convenient it is to work with memoryless distributions (and a very helpful relationship between exponential distributions and Poisson Processes–but that’s a topic for another day).

Thus, it’s worth at least being aware of memoryless distributions and some of their seemingly odd properties. And, even more generally, it’s worth ensuring that you understand the properties of any probability distribution that you are using to model a problem–beyond simply knowing the right equations to use–because our intuitive understanding of probability often falls short, and can lead us to either incorrect answers, or at least result in us doing a lot more work than we really need to.