Large deviations: Cramér vs Sanov

发布时间：2025-08-25 10:05

使用开发工具：VS Code或Sublime Text #生活知识# #编程教程#

Posted on August 03, 2012

5 minute read

I have recently been reading up on two classical results from large deviation theory: the Cramér-Chernoff theorem and Sanov’s theorem. Both of them bound the probability that an empirical average is far away from its mean, and both of them are asymptotically tight in the exponent.

It turns out that the two are more or less equivalent, but while Sanov’s theorem has the nicest information theoretic interpretation, the Cramér-Chernoff theorem seems to introduce the fewest measure-theoretic complications. Let me explain…

The Cramér-Chernoff TheoremPermalink

Let X,X1,...,Xn be independent, identically distributed random variables, and let μ(λ)=logE[eλX] be their cumulant generating function. Then many standard concentration inequalities (Hoeffding, Bernstein, Bennett) can be derived [1, Appendix A.1] from a single basic result:

Pr(1nn∑i=1Xi≥a)≤e−n{λa−μ(λ)}(λ>0).

This result can easily be proved using Markov’s inequality and is non-trivial as long as a>E[X]. Its ubiquity is explained by the Cramér-Chernoff theorem [2], which states that it asymptotically achieves the best possible exponent if we optimize our choice of λ:

limn→∞−1nlogPr(1nn∑i=1Xi≥a)=supλ>0 {λa−μ(λ)}.

Sanov’s TheoremPermalink

The empirical distribution Pn of X1,…,Xn gives probability Pn(A)=1n∑ni=11{Xi∈A} to any event A, which is the fraction of variables taking their value in A. If the distribution of the variables is P∗, then asymptotically we would expect Pn to be close to P∗. So then if P is a set of distributions that is far away from P∗ in some sense, the probability that Pn∈P should be small. This intuition is quantified by Sanov’s theorem [3], [4].

To avoid measure-theoretic complications, let us assume that the variables X1,…,Xn are discrete, with a finite number of possible values. Suppose P is a convex set of distributions and assume that the information projection

Q∗=argminP∈P D(P‖P∗)

of P∗ on P exists, where D(P‖P∗)=∑xP(x)logP(x)P∗(x) is the Kullback-Leibler divergence. Then D(Q∗‖P∗) provides an information-theoretic measure of the “distance” of P from P∗, and indeed [3]

Pr(Pn∈P)≤e−nD(Q∗‖P∗).

Moreover, Sanov’s theorem tells us that this bound is asymptotically tight in the exponent:

limn→∞−1nlogPr(Pn∈P)=D(Q∗‖P∗).

Csiszár [3, p. 790] has an elegant information-theoretic proof of the upper bound. He also works out sufficient measure-theoretic conditions for the theorem to hold in continuous spaces, which are quite clean, but still require considerable care to verify.

One may extend Sanov’s theorem to non-convex sets P by a union bound argument. For example, Cover and Thomas [4] take a union bound over all possible values for Pn, which they call types. By discretization arguments one may further extend the theorem to infinite spaces [5], but then things get a bit too asymptotic for my taste.

From Sanov to Cramér-ChernoffPermalink

It turns out that the Cramér-Chernoff theorem can be obtained from Sanov’s theorem. (This is called contraction.) The trick is to observe that

EX∼Pn[X]=1nn∑i=1Xi,

so if we define the convex set P={P∣EX∼P[X]≥a}, then

Pr(Pn∈P)=Pr(1nn∑i=1Xi≥a).

It remains to evaluate D(Q∗‖P∗) in this case, which can be done as follows:

Introduce a Lagrange multiplier to obtain:

minP∈PD(P‖P∗)=min{P∣EX∼P[X]≥a}D(P‖P∗)=minPsupλ>0{D(P‖P∗)−λ(EP[X]−a)}.

Use Sion’s minimax theorem to swap the order of the min and the sup:

minPsupλ>0{D(P‖P∗)−λ(EP[X]−a)}=supλ>0infP{D(P‖P∗)−λ(EP[X]−a)}.

Recognize the following convex duality for Kullback-Leibler divergence:

supP (EX∼P[λX]−D(P||P∗)=μ(λ),

where μ(λ)=logEP∗[eλX] is the cumulant generating function defined above. We get:

supλ>0infP{D(P‖P∗)−λ(EP[X]−a)}=supλ>0{λa−supP(EP[λX]−D(P‖P∗))}=supλ>0 {λa−μ(λ)}.

Chaining everything together we exactly recover the Cramér-Chernoff theorem, and we see that the upper bounds have exactly the same constants.

From Cramér-Chernoff to SanovPermalink

One may also view things the other way around. The Cramér-Chernoff theorem bounds the probability that the value of the empirical mean 1n∑ni=1Xi lies in the set A=[a,∞). As discussed by [2, pp. 208-210], both the notion of empirical mean and the set A can be generalized. In particular, one may regard the empirical distribution Pn as the mean of n point-masses (i.e. Dirac measures) on the points X1,…,Xn. Van der Vaart then presents Sanov’s theorem as just one instance of such generalized Cramér-Chernoff theorems.

ConclusionPermalink

We have seen the close similarities between the Cramér-Chernoff theorem and Sanov’s theorem. For me Sanov’s theorem seems easier to interpret, but in continuous spaces one has to deal with the more complicated measure-theoretic conditions of Csiszár [3]. For technical reasons it may therefore be preferable to use the Cramér-Chernoff result.

Last WordsPermalink

It turns out that the upper bound in the Cramér-Chernoff theorem does leave some slack in the order of 1/√n, which is negligible compared to the term in the exponent.

ReferencesPermalink

N. Cesa-Bianchi and G. Lugosi, Prediction, learning, and games. Cambridge University Press, 2006. A. W. van der Vaart, Asymptotic Statistics. Cambridge University Press, 1998. I. Csiszár, “Sanov Property, Generalized I-Projection and a Conditional Limit Theorem,” The Annals of Probability, vol. 12, no. 3, pp. 768–793, 1984. T. M. Cover and J. A. Thomas, Elements of Information Theory, Second Edition. John Wiley & Sons, 2006. F. den Hollander, “Large deviations,” in Large Deviations, vol. 14, American Mathematical Society, 2000.

网址：Large deviations: Cramér vs Sanov https://www.yuejiaxmz.com/news/view/1241583

⬅️上一篇：帅康燃气灶熄火保护清洁小技巧

➡️下一篇：Xamarin.Mac