## Which concentration bound to use

2022 Feb 18-19

I want to discuss typical use cases for **Markov, Chebyshev, Chernoff, Hoeffding **inequalities.

An inequality cheat sheet: http://www.lkozma.net/inequalities_cheat_sheet/

2. **Markov** inequality (MI): convenient, simple, works for non-negative random variable. Roughly it says "the probability of the random variable being this away (e.g., 1*average) from the average is less than this (e.g., 1/2)". Intuitively, if the distribution has mass concentrated around the mean, then MI gives a loose bound. In other words, for distributions with think tails, MI does not give a meaningfully tight bound. But otherwise, MI would work fine.

**3. ****Chebyshev** inequality (CI): simple, follows from MI, uses variance of the original distribution. Roughly it says "the probability of the random variable being this away (e.g., 2*standard deviation) from the average is less than this (e.g., 1/2^2)". I did some calculations similar to this post (https://math.stackexchange.com/a/1999727) and the conclusion is that, Chebyshev to some extent captures the tail and is therefore tighter than MI at the tail.

4. **Chernoff** inequality (CI2): captures the tail (gives exponentially decaying bound), works for the sum of * independent* random variables of general distributions over [0,1] (can be generalized to

*distributions), has several forms (e.g., additive, multiplicative). It roughly says "the sample average is not far from the actual average if one chooses sample size properly". CI2 is useful in estimation, e.g., carrying out experiments to estimate the average with high confidence.*

**correlated**One of the form of CI2 is agnostic to the hidden variable itself. It only eats the error and gives a suggested sample size. This form basically says "the mass is concentrated around mean with radius sqrt(n) where n is the sample size".

Additive form is suitable for small error. This form basically says "the mass is concentrated around mean with radius sqrt(average)".

Linearity of expectation, AM-GM inequality, Jensen's inequality and Taylor theorem are useful in proofs. An important quantity measuring similarity of two distributions called **Kullback-Liebler divergence** is useful here to simplify the expressions.

*Related*: **Stirling's approximation** also gives very tight bounds for summation of Bernoulli random variables. But Chernoff can accommodate random variables over [0,1] as long as the average is known. The proof uses Jensen's inequality. The intuition is roughly that when one tries to allocate mass over [0,1] such that a given mean is achieved, the sparsest / most unconcentrated case is when you put all mass on two end points, 0 and 1, which forms the Bernoulli distribution.

5. **Hoeffding **inequality (HI): generalizes CI2 to random variables over [ai,bi], also oblivious or agnostic to the average. Hoeffding lemma is useful in proofs.