# User:Mneykov/sandbox

## Introduction

Vapnik–Chervonenkis theory (also known as VC theory) was developed during 1960–1990 by Vladimir Vapnik and Alexey Chervonenkis. The theory is a form of computational learning theory, which attempts to explain the learning process from a statistical point of view.

VC theory is related to statistical learning theory and to empirical processes. Richard M. Dudley and Vladimir Vapnik himself, among others, apply VC-theory to empirical processes.

VC theory covers at least four parts (as explained in The Nature of Statistical Learning Theory):

• Theory of consistency of learning processes
• What are (necessary and sufficient) conditions for consistency of a learning process based on the empirical risk minimization principle?
• Nonasymptotic theory of the rate of convergence of learning processes
• How fast is the rate of convergence of the learning process?
• Theory of controlling the generalization ability of learning processes
• How can one control the rate of convergence (the generalization ability) of the learning process?
• Theory of constructing learning machines
• How can one construct algorithms that can control the generalization ability?

VC Theory can also be viewed as a major subbranch of statistical learning theory.In addition, VC theory and VC dimension are instrumental in the theory of empirical processes, in the case of processes indexed by VC classes. Arguably these are the most important applications of the VC theory. Several techniques will be introduced that are widely used in the empirical process and VC theory. The discussion is based on the book "Weak Convergence and Empirical Processes: With Applications to Statistics".

## Overview of VC theory in Empirical Processes

### Background on Empirical Processes

$f\mapsto \mathbb {P} _{n}f$ hold. Here $P$ is the underlying true distribution of the data, which is unknown in practice. In the former case the class ${\mathcal {F}}$ is called Glivenko-Cantelli, and in the latter case (under the assumption $\sup _{f\in {\mathcal {F}}}\vert f(x)-Pf\vert <\infty .\forall x$ ) the class ${\mathcal {F}}$ is called Donsker or $P$ -Donsker. Obviously, a Donsker class is Glivenko-Cantelli in probability by an application of Slutsky's theorem .

These statements are true for a single $f$ , by standard LLN, CLT arguments under regularity conditions, and the difficulty in the Empirical Processes comes in because joint statements are being made for all $f\in {\mathcal {F}}$ . Intuitively then, the set ${\mathcal {F}}$ cannot be too large, and as it turns out that the geometry of ${\mathcal {F}}$ plays a very important role.

One way of measuring how big the function set ${\mathcal {F}}$ is to use the so-called covering numbers. The covering number $N(\varepsilon ,{\mathcal {F}},||\cdot ||)$ is the minimal number of balls $\{g:||g-f||<\varepsilon \}$ needed to cover the set ${\mathcal {F}}$ (here it is obviously assumed that there is an underlying norm on ${\mathcal {F}}$ ). The entropy is the logarithm of the covering number.

Two sufficient conditions are provided below, under which it can be proved that the set ${\mathcal {F}}$ is Glivenko-Cantelli or Donsker.

$\sup _{Q}N(\varepsilon ||F||_{Q},{\mathcal {F}},L_{1}(Q))<\infty$ for every $\varepsilon >0$ .

The next condition is a version of the celebrated Dudley's theorem. If ${\mathcal {F}}$ is a class of functions such that

$\int _{0}^{\infty }\sup _{Q}{\sqrt {\log N(\varepsilon ||F||_{Q,2},{\mathcal {F}},L_{2}(Q))}}d\varepsilon <\infty$ ### Symmetrization

The majority of the arguments of how to bound the empirical process, rely on symmetrization, maximal and concentration inequalities and chaining . Symmetrization is usually the first step of the proofs, and since it is used in many machine learning proofs on bounding empirical loss functions (including the proof of the VC inequality which is discussed in the next section) it is presented here.

Consider the empirical process:

$f\mapsto (\mathbb {P} _{n}-P)f={\dfrac {1}{n}}\sum _{i=1}^{n}(f(X_{i})-Pf)$ Turns out that there is a connection between the empirical and the following symmetrized process:

$f\mapsto \mathbb {P} _{n}^{0}={\dfrac {1}{n}}\sum _{i=1}^{n}\varepsilon _{i}f(X_{i})$ The symmetrized process is a Rademacher process, conditionally on the data $X_{i}$ . Therefore it is a sub-Gaussian process by Hoeffding's inequality.

$\mathbb {E} \Phi (||\mathbb {P} _{n}-P||_{\mathcal {F}})\leq \mathbb {E} \Phi (2||\mathbb {P} _{n}^{0}||_{\mathcal {F}})$ The proof of the Symmetrization lemma relies on introducing independent copies of the original variables $X_{i}$ (sometimes referred to as a ghost sample) and replacing the inner expectation of the LHS by these copies. After an application of Jensen's inequality different signs could be introduced (hence the name symmetrization) without changing the expectation. The proof can be found below because of its instructive nature.

A typical way of proving empirical CLTs, first uses symmetrization to pass the empirical process to $\mathbb {P} _{n}^{0}$ and then argue conditionally on the data, using the fact that Rademacher processes are simple processes with nice properties.

### VC Connection

It turns out that there is a fascinating connection between certain combinatorial properties of the set ${\mathcal {F}}$ and the entropy numbers. Uniform covering numbers can be controlled by the notion of Vapnik-Cervonenkis classes of sets - or shortly VC sets.

$\max _{x_{1},\ldots ,x_{n}}\Delta _{n}({\mathcal {C}},x_{1},\ldots ,x_{n})\leq \sum _{j=0}^{V({\mathcal {C}})-1}{n \choose j}\leq \left({\frac {ne}{V({\mathcal {C}})-1}}\right)^{V({\mathcal {C}})-1}$ Which is a polynomial number $O(n^{V({\mathcal {C}})-1})$ of subsets rather than an exponential number. Intuitively this means that a finite VC-index implies that ${\mathcal {C}}$ has an apparent simplistic structure.

A similar bound can be shown (with a different constant, same rate) for the so called VC subgraph classes. For a function $f:{\mathcal {X}}\mapsto \mathbb {R}$ the subgraph is a subset of ${\mathcal {X}}\times \mathbb {R}$ such that: $\{(x,t):t . A collection of ${\mathcal {F}}$ is called a VC subgraph class if all subgraphs form a VC-class.

$N(\varepsilon ,{\mathcal {I}}_{\mathcal {C}},L_{r}(Q))\leq KV({\mathcal {C}})(4e)^{V({\mathcal {C}})}\left({\dfrac {1}{\varepsilon }}\right)^{r(V({\mathcal {C}})-1)}$ $N(\varepsilon ||F||_{Q,2},{\mathcal {F}},L_{2}(Q))\leq C\left({\dfrac {1}{\varepsilon }}\right)^{V}$ the following is valid for the convex hull of ${\mathcal {F}}$ :

$\log N(\varepsilon ||F||_{Q,2},\operatorname {sconv} {\mathcal {F}},L_{2}(Q))\leq K\left({\dfrac {1}{\varepsilon }}\right)^{\frac {2V}{V+2}}$ The important consequence of this fact is that the power of $1/\varepsilon$ -- $2V/(V+2)$ is strictly less than 2, which is just enough so that the the entropy integral is going to converge, and therefore the class $\operatorname {sconv} {\mathcal {F}}$ is going to be $P$ -Donsker.

Finally an example of a VC-subgraph class is considered. Any finite-dimensional vector space ${\mathcal {F}}$ of measurable functions $f:{\mathcal {X}}\mapsto \mathbb {R}$ is VC-subgraph of index smaller than or equal to $\dim({\mathcal {F}})+2$ .

There are generalizations of the notion VC subgraph class, e.g. there is the notion of pseudo-dimension. The interested reader can look into.

## VC Inequality

A similar setting is considered, which is more common to machine learning. Let ${\mathcal {X}}$ is a feature space and ${\mathcal {Y}}=\{0,1\}$ . A function $f:{\mathcal {X}}\mapsto {\mathcal {Y}}$ is called a classifier. Let ${\mathcal {F}}$ be a set of classifiers. Similarly to the previous section, define the shattering coefficient $S({\mathcal {F}},n)=\max _{x_{1},\ldots ,x_{n}}|\{(f(x_{1}),\ldots ,f(x_{n})),f\in {\mathcal {F}}\}|$ . The shattering coefficient is also known as growth function. Note here that there is a 1-1 mapping between each of the functions in ${\mathcal {F}}$ and the set on which the function is 1. Therefore in terms of the previous section the shattering coefficient is precisely $\max _{x_{1},\ldots ,x_{n}}\Delta _{n}({\mathcal {C}},x_{1},\ldots ,x_{n})$ for ${\mathcal {C}}$ being the collection of all sets described above. Now for the same reasoning as before, namely using Sauer's Lemma it can be shown that $S({\mathcal {F}},n)$ is going to be polynomial in $n$ provided that the class ${\mathcal {F}}$ has a finite VC-dimension or equivalently the collection ${\mathcal {C}}$ has finite VC-index.

Let $D_{n}=\{(X_{1},Y_{1}),\ldots ,(X_{n},Y_{m})\}$ is an observed dataset. Assume that the data is generated by an unknown probability distribution $P_{XY}$ . Define $R(f)=P(f(X)\neq Y)$ to be the expected 0/1 loss. Of course since $P_{XY}$ is unknown in general, one has no access to $R(f)$ . However the empirical risk, given by:

${\hat {R}}_{n}(f)={\dfrac {1}{n}}\sum _{i=1}^{n}\mathbb {I} (f(X_{n})\neq Y_{n})$ can certainly be evaluated. Then one has the following Theorem:

Theorem (VC Inequality) For binary classification and the 0/1 loss function we have the following generalization bounds:

$P\left(\sup _{f\in {\mathcal {F}}}|{\hat {R}}_{n}(f)-R(f)|>\varepsilon \right)\leq 8S({\mathcal {F}},n)e^{-n\varepsilon ^{2}/32}$ and

$\mathbb {E} \left[\sup _{f\in {\mathcal {F}}}|{\hat {R}}_{n}(f)-R(f)|\right]\leq 2{\sqrt {\dfrac {\log S({\mathcal {F}},n)+\log 2}{n}}}$ In words the VC inequality is saying that as the sample increases, provided that ${\mathcal {F}}$ has a finite VC dimension, the empirical 0/1 risk becomes a good proxy for the expecred 0/1 risk. Note that both RHS of the two inequalities will converge to 0, provided that $S({\mathcal {F}},n)$ grows polynomially in $n$ .

The connection between this framework and the Empirical Process framework is evident. Here one is dealing with a modified empirical process $|{\hat {R}}_{n}-R|_{\mathcal {F}}$ but not surprisingly the ideas are the same. The proof of the (first part of) VC inequality, relies on symmetrization, and then argue conditionally on the data using concentration inequalities (in particular Hoeffding's inequality). The interested reader can check the book  Theorems 12.4 and 12.5.