Concentration Inequalities

Suppose we have a biased die, but we don’t know what the bias is. A good estimate of the bias might come from calculating the average, and comparing that with the expected value. Since E[Sn]=np, our estimator p has the expected value of E[p]=1nE[Sn]=p. This tells us that if n is sufficiently large, we can expect p to be close to p, thanks to the Law of Large Numbers.

How large does n need to be? There is no good answer. We can never guarantee with absolute certainty that |pp|ϵ, where ϵ is some error. Instead, we can state that |pp|ϵ with a confidence 1δ, such that P[|pp|ϵ]1δ. There is a small probability δ that the error is more than ϵ.

To put this into a formula:

n14ϵ2δ

In order to prove this, let’s take a detour to Markov’s Inequality.

Markov’s Inequality

For a nonnegative random variable X (i.e. X(ω)0,ωΩ) with a finite mean,

P[Xc]E[X]c

for any positive constant c.

Proof

Let A denote the range of X, and consider any constant cA. Then,

E[X]=aAa×P[X=a]aca×P[X=a]acc×P[X=a]=cacP[X=a]=c×P[Xc]

The proof starts with the definition of expectation, then uses the fact that X is a nonnegative random variable, and rearranging the last inequality gives us the desired result.

We can define a slicker proof using the indicator function I{}, defined as:

I{statement}={1,if statement is true0,if statement is false

Proof

Since X is a nonnegative random variable and c>0, we have, ωΩ,

X(ω)cI{X(ω)c},

since the right hand side is 0 if X(ω)<c and is c if X(ω)>c. Multiply both sides by P[ω] and sum over ωΩ to give,

E[X]cE[I{Xc}]=c×P[Xc]

where the first inequality follows from (1) and the fact that P[ω]0,ωΩ , while the last inequality follows from the fact that I{Xc} is an indicator random variable.

Generalized Markov’s Inequality

Let Y be an arbitrary random variable with finite mean. For any positive constants c and r,

P[|Y|c]E[|Y|r]cr

Proof

For c>0 and r>0 (if r were negative, the last inequality below does not hold), we have,

|Y|r|Y|rI{|Y|c}crI{|Y|c}

Taking the expectations of both sides,

E[|Y|r]crE[I{|Y|c}]=crP[|Y|c]

Divde by cr to find the desired result.

Chebyshev’s Inequality

For a random variable X with a finite expectation E[X]=μ,

P[|Xμ|c]Var(X)c2

Proof

Let Y=(Xμ)2, and note that E[Y]=E[(Xμ)2]=Var(X).

|Xμ|cY=(Xμ)2c2

Therefore, P[|Xμ|c]=P[Yc2]. Since Y is nonnegative, applying Markov’s inequality yields

P[|Xμ|c]=P[Yc2]E[Y]c2=Var(X)c2

And here’s another proof using the generalized theorem:

Proof

Let Y=Xμ. Then, since |Y|2=(Xμ)2, we have E[|Y|2]=E[(Xμ)2]=Var(X). Hence,

|Xμ|cY=(Xμ)2c2

follows from applying Generalized Markov’s Inequality to Y=Xμ with r=2.

Stopping to consider what Chebyshev’s Inequality means, it tells us that the probability of any given deviation, c, from the mean is at most Var(X)c2. If the variance is small, the deviation probability is small. This has the immediate consequence of making the following statement true:

For any random variable X with finite expectation E[X]=μ and finite standard deviation σ=Var(X),

P[|Xμ|kσ]1k2

for any constant k>0, where c=kσ in Chebyshev’s Inequality.

This means that, for example, the probability of deviating from the mean by more than 2 standard deviations on either side is at most 14, allowing standard deviation to be a good working definition of the “spread” of a distribution.