This is a review of work by Omar Shamir et al. on the difficulty of learning parity with gradient descent. The papers Distribution-Specific Hardness of Learning Neural Networks [1] and Failures of Gradient-Based Deep Learning [2] cover much more than the hardness of learning parity, so this is a slimmed-down review containing only the parity difficulty part.


In SATNet: Bridging deep learning and logical reasoning using a differentiable satisfiability solver [3], Wang et. al. design a continuous relaxation of a SAT solver. Their solver takes as input a CNF formula SS and partial assignment to SS's variables II, and outputs an assignment to the remaining variables, I\setminus I, that maximizes satisfied clauses in SS. Because the solver is continuous, given a target set of values for II and I\setminus I, one can take the derivative of the difference between the solver's output on II and the target outputs (I\setminus I) with respect to SS and use gradient descent to find a CNF formula the satisfiability of which matches the behavior in the target set.

For example, [3] generates many example input/output assignments for if the number of true variables in the input is even or odd and uses gradient descent to find a value for SS with this behavior, namely a CNF formula for the parity function. This is the same idea, in spirit, to how Moitti et. al. use a continuous relaxation of a 2D cellular automata to learn the discrete ruleset for Conway's game of life.

I think this is pretty incredible, but most interesting to me was that [3] cited [2] to invoke a theoretical difficulty learning parity, saying "Learning parity functions from such single-bit supervision is known to pose difficulties for conventional deep learning approaches." At this point I had read neither [1] nor [2], but had read Intractability of Learning the Discrete Logarithm with Gradient-Based Methods [4] by Takhanov et. al., which shows the discrete logarithm is difficult to learn via gradient descent in the same sense that [2] says parity is hard. So, a wrinkle in the literature: [3] claims to synthesize parity with gradient descent despite [4]'s claim such a thing should be intractable.

The purpose of this essay is to smooth this out. In short: [3]'s success learning parity does not contradict [2]'s claim it is difficult, and [2] doesn't prove learning parity is difficult in the colloquial sense. In particular, [2] considers a "family" of parity functions, each of which samples a different subset of a length-nn input and outputs if the number of 11s in the sample is even or odd. Then, [2] shows that when any of these parity functions could be learned by the neural net, there is an exponential-in-nn run of gradient descent that converges to a value independent of the target function (i.e. gradient descent learns nothing about the target function). On the other hand, [3] learns parity with an inductive bias by weaving the input through their model. For example, on a length-44 input x1,,x4[2]x_1,\dots,x_4 \in [2], [3]'s model (mm below) outputs

m(x4,m(x3,m(x2,x1))).m(x_4, m(x_3, m(x_2,x_1))).

By inspection, the only two elements of the family of parity functions this can learn are the trivial case when the subset of bits considered is empty, and full-parity. So while [2] predicts the existence of a worst-case-exponential run, practically speaking, because there are only 1616 functions from [2]2[2][2]^2\rightarrow[2], the constants work out in [3]'s favor.

If this introduction has piqued your curiosity about how [2] analyzes parity difficulty, read on. I think it is a lovely bit of math.

Background

Let's start by considering worst-case complexity of learning a parity function from examples. The parity functions in this piece take as input a bit string, pick a subset of the bits, then return 1-1 if the number of 11s in the subset is odd and 11 otherwise. Formally, the family of parity functions is

H{x(1)v,xv[2]n},\mathcal{H}\coloneqq\{x\mapsto (-1)^{\langle v, x\rangle}\mid v\in [2]^n\},

where v,xΣi=1nvixi\langle v, x\rangle\coloneqq\Sigma_{i=1}^nv_ix_i is the inner product and nn is the input length.

Our task is, given an unknown hHh\in\mathcal{H} and set of examples {(x(i),h(x(i)))0i<t}\{(x^{(i)},h(x^{(i)}))\mid 0\leq i < t\}, to determine the value of hh. In the worst case, this takes t>2n1t>2^{n-1} examples. Because, letting xj(i)x^{(i)}_j denote the jjth element of x(i)x^{(i)}, consider the case where ji,xj(i)=0\exists j\forall i, x^{(i)}_j=0. In this case, letting h=x(1)v,xh=x\mapsto(-1)^{\langle v, x\rangle} and vv' be the element of [2]n[2]^n satisfying vkvkj=kv'_k\neq v_k\leftrightarrow j=k, for h=x(1)v,xh'=x\mapsto(-1)^{\langle v', x\rangle} we have

i,h(x(i))=(1)Σkvkxk(i)=(1)vjxj(i)+Σkjvjxj(i)=(1)0+Σkjvjxj(i)=(1)vjxj(i)+Σkjvjxj(i)=h(x(i)).\begin{aligned} \forall i,\quad h'(x^{(i)}) &= (-1)^{\Sigma_k v'_kx^{(i)}_k} \\ &= (-1)^{v'_jx^{(i)}_j+\Sigma_{k\neq j} v'_jx^{(i)}_j} \\ &= (-1)^{0+\Sigma_{k\neq j} v_jx^{(i)}_j} \\ &= (-1)^{v_jx^{(i)}_j+\Sigma_{k\neq j} v_jx^{(i)}_j} \\ &= h(x^{(i)}). \end{aligned}

As hh and hh' output the same value on every example input, it is ambiguous which is the true function based on our examples. There is a set of examples of size [2]n1[2]^{n-1} satisfying ji,xj(i)=0\exists j\forall i, x^{(i)}_j=0, so in the worst case learning hh takes t>2n1t>2^{n-1} examples.

Shamir's work shows, even when the examples are drawn from a nice distribution (i.e. don't have the ji,xj(i)=0\exists j\forall i, x^{(i)}_j=0 property), gradient descent can take an exponential (in nn) number of steps to learn some hHh\in\mathcal{H}. In my view, the star of the show is Theorem 1, which establishes when the gradient of the loss is similar for all elements of H\mathcal{H}, the gradient carries little information about what element of H\mathcal{H} should actually be learned. This small signal is lost in noise from finite precision arithmetic and sampling.

Gradient Descent

In the typical machine learning setting, for some target function h:RnRmh:\mathbb{R}^n\rightarrow\mathbb{R}^m and neural network architecture pwp_w parameterized by weights wRww\in\mathbb{R}^{|w|} we'd like to compute

minwFh(w)Ex(12h(x)pw(x)2),\min_wF_h(w)\coloneqq\mathbb{E}_x(\frac{1}{2}\|h(x)-p_w(x)\|^2),

where Fh(w)F_h(w) is the expected loss given a choice of ww.

One approach is to use a variation of gradient descent. This starts by selecting an initial value for ww, call it w(0)w^{(0)}, then procedures to iteratively update the weights according to the formula

w(i+1)w(i)η(wFh)(w(i)),w^{({i+1})}\coloneqq w^{(i)}-\eta(\frac{\partial}{\partial w}F_h)(w^{(i)}),

where η\eta is the learning rate. Intuitively, this works because (wFh)(w(i))(\frac{\partial}{\partial w}F_h)(w^{(i)}) is an element of Rw\mathbb{R}^{|w|} pointing in the direction of steepest increase of Exh(x)pw(x)2\mathbb{E}_x\|h(x)-p_w(x)\|^2, i.e. the loss. By inverting and subtracting this value from wiw_i we move the weights in a direction that decreases the loss.

In practice, computing Exh(x)pw(x)2\mathbb{E}_x\|h(x)-p_w(x)\|^2, and in turn wFh\frac{\partial}{\partial w}F_h, is computationally infeasible, as xx's distribution is unknown. As such, the standard approach is to sample x1,x2,,xtx_1,x_2,\dots,x_t and approximate Fh(w)F_h(w) as

Fh(w)Σih(xi)pw(xi)2.F_h(w) \approx \Sigma_i \|h(x_i)-p_w(x_i)\|^2.

wFh\frac{\partial}{\partial w}F_h can be approximated in turn, and gradient descent run using the approximation.

Approximate Gradient Descent

We'll model this approximation with two definitions, approximate gradient oracles capture the error involved in computing wFh\frac{\partial}{\partial w}F_h and approximate gradient-based methods use these approximations.

An Approximate Gradient Oracle is a function OFh,ϵ:RwRwO_{F_h,\epsilon}:\mathbb{R}^{|w|}\rightarrow \mathbb{R}^{|w|} satisfying

w,OFh,ϵ(w)wFh(w)ϵ.\forall w, |O_{F_h,\epsilon}(w) - \frac{\partial}{\partial w}F_h(w)| \leq \epsilon.

An Approximate Gradient-Based Method is an algorithm that generates an initial guess w(0)w^{(0)}, then decides w(i+1)w^{(i+1)} based on responses from an approximate gradient oracle.

w(i+1)=f(w(0),{OFh,ϵ(w(i))i<i+1})w^{(i+1)} = f(w^{(0)}, \{O_{F_h,\epsilon}(w^{(i)}) \mid i<i+1\})

Parity Hardness

Now, consider a family of functions H={h1,h2,}\mathcal{H}=\{h_1,h_2,\dots\} and the variance of the gradient at ww with respect to any hHh\in\mathcal{H}.

VarhH(wFh(w))Eh(wFh)(w)Eh((wFh)(w))2\text{Var}_{h\in\mathcal{H}}(\frac{\partial}{\partial w}F_h(w))\coloneqq \mathbb{E}_h\| (\frac{\partial}{\partial w}F_h)(w) - \mathbb{E}_{h'}((\frac{\partial}{\partial w}F_{h'})(w)) \|^2

To show parity is difficult to learn, we'll show when H\mathcal{H} is the family of parity functions, this variance is exponentially small w.r.t. the length of the parity function's inputs. In turn, an adversarial approximate gradient oracle can repeatedly return Eh((wFh)(w))\mathbb{E}_{h'}((\frac{\partial}{\partial w}F_{h'})(w)) instead of (wFh)(w)(\frac{\partial}{\partial w}F_h)(w) while staying within its ϵ\epsilon error tolerance. Because Eh((wFh)(w))\mathbb{E}_{h'}((\frac{\partial}{\partial w}F_{h'})(w)) is independent of the hHh\in\mathcal{H} being learned, an approximate gradient-based method using this adversarial oracle can converge to a value independent of the target function, hh, unless it takes an exponentially large number of steps.

Theorem 1 (DSHLNN Theorem 10). For some family of functions H\mathcal{H}, if

w,VarhH(wFh)(w)ϵ3\forall w, \text{Var}_{h\in\mathcal{H}}(\frac{\partial}{\partial w}F_h)(w) \leq \epsilon^3

then for any approximate gradient-based method learning hHh\in\mathcal{H}, there is a run such that the value of w(ϵ1)w^{(\lfloor \epsilon^{-1}\rfloor)} is independent of hh.

Proof. By Chebyshev's inequality and the hypothesis, we have

w,Ph((wFh)(w)Eh((wFh)(w))>ϵ)Varh((wFh)(w))/ϵ2ϵ.\begin{aligned} \forall w, \quad &\mathbb{P}_h(\|(\frac{\partial}{\partial w}F_h)(w)-\mathbb{E}_h((\frac{\partial}{\partial w}F_h)(w))\| > \epsilon) \\ &\leq \text{Var}_h((\frac{\partial}{\partial w}F_h)(w))/\epsilon^2 \\ &\leq \epsilon. \end{aligned}

So, consider the adversarial approximate gradient oracle

OFh,ϵ(w){Eh((wFh)(w))if (wFh)(w)Eh((wFh)(w))ϵ(wFh)(w)otherwise.O_{F_h,\epsilon}(w) \coloneqq \begin{cases} \mathbb{E}_h((\frac{\partial}{\partial w}F_h)(w)) &\text{if }\|(\frac{\partial}{\partial w}F_h)(w)-\mathbb{E}_h((\frac{\partial}{\partial w}F_h)(w))\| \leq \epsilon \\ (\frac{\partial}{\partial w}F_h)(w) &\text{otherwise.} \end{cases}

Using our Chebyshev inequality, we can bound the likelihood of OFh,ϵO_{F_h,\epsilon}'s otherwise\text{otherwise} case.

w,Ph(OFh,ϵ(w)=(wFh)(w))=Ph((wFh)(w)Eh((wFh)(w))>ϵ)ϵ.\begin{aligned} \forall w, &\mathbb{P}_h(O_{F_h,\epsilon}(w)=(\frac{\partial}{\partial w}F_h)(w)) \\ &= \mathbb{P}_h(\|(\frac{\partial}{\partial w}F_h)(w)-\mathbb{E}_h((\frac{\partial}{\partial w}F_h)(w))\| > \epsilon) \\ &\leq \epsilon. \end{aligned}

Because Eh((wFh)(w))\mathbb{E}_h((\frac{\partial}{\partial w}F_h)(w)) is independent of what hh is being learned, the inequality above bounds the likelihood OFh,ϵ(w)O_{F_h,\epsilon}(w) is dependent on hh.

Now, for any approximate gradient-based method learning hHh\in\mathcal{H}, w(0)w^{(0)} is independent of hh, as nothing has been sampled from the gradient when it is chosen. As

w(1)=f(w(0),OFh,ϵ(w(0))),w^{(1)}=f(w^{(0)}, O_{F_h,\epsilon}(w^{(0)})),

evidently, w(1)w^{(1)} is dependent on the hh being learned if OFh,ϵ(w(0))O_{F_h,\epsilon}(w^{(0)}) is, and, per the inequality above, the likelihood of this is ϵ\leq\epsilon. Repeating this argument, let A(i)A^{(i)} be the event OFh,ϵ(w(i))O_{F_h,\epsilon}(w^{(i)}) is dependent on hh. We have P(A(i))ϵ\mathbb{P}(A^{(i)})\leq\epsilon, so by the union bound

P(i=1IA(i))Σi=1IP(A(i))Iϵ.\begin{aligned} \mathbb{P}(\bigvee_{i=1}^IA^{(i)})&\leq\Sigma_{i=1}^I\mathbb{P}(A^{(i)})\\ &\leq I\epsilon. \end{aligned}

If P(i=1IA(i))<1\mathbb{P}(\bigvee_{i=1}^IA^{(i)})<1, then there is an II step run of our gradient-based method where w(I)w^{(I)} is independent of the target function, hh. Solving for II using the equation above gives the desired bound: If I<1/ϵI<1/\epsilon, then there is a run of the gradient-based method where w(ϵ1)w^{(\lfloor \epsilon^{-1} \rfloor)} is independent of hh (I am simplifying somewhat here because for the case we're interested ϵ1\epsilon^{-1} will not be an integer and flooring gives the strict << inequality we want, but if you're feeling picky I=ϵ11I=\lceil \epsilon^{-1}\rceil -1 will do).

\blacksquare

Lemma 1.

Σx[n]dΠi=1dfi(xi)=Πi=1dΣx[n]fi(x)\Sigma_{x\in[n]^d}\Pi_{i=1}^d f_i(x_i) = \Pi_{i=1}^d\Sigma_{x\in[n]}f_i(x)

Proof. Shamir wordlessly invokes this, but it took me several hours on an airplane and help from ChatGPT to see. By induction on dd. When d=2d=2,

Σx[n]2Πi=12fi(xi)=Σx[n]2f1(x1)f2(x2)=Σx1[n]Σx2[n]f1(x1)f2(x2)=Σx1[n]f1(x1)(Σx2[n]f2(x2))=(Σx1[n]f1(x1))(Σx2[n]f2(x2))=Πi=12Σx[n]fi(x).\begin{aligned} \Sigma_{x\in[n]^2}\Pi_{i=1}^2 f_i(x_i) &= \Sigma_{x\in[n]^2}f_1(x_1)f_2(x_2) \\ &= \Sigma_{x_1\in[n]}\Sigma_{x_2\in[n]}f_1(x_1)f_2(x_2) \\ &= \Sigma_{x_1\in[n]}f_1(x_1)(\Sigma_{x_2\in[n]}f_2(x_2)) \\ &= (\Sigma_{x_1\in[n]}f_1(x_1))(\Sigma_{x_2\in[n]}f_2(x_2)) \\ &= \Pi_{i=1}^2\Sigma_{x\in[n]}f_i(x). \end{aligned}

For the inductive step,

Σx[n]dΠi=1dfi(xi)=Σx1[n]Σx2[n]Σxd[n]Πi=1dfi(xi)=Σxd[n](Σx1[n]Σxd1[n]Πi=1d1fi(xi)inductive hypothesis)fd(xd)=Σxd[n](Πi=1d1Σx[n]fi(x))fd(xd)=(Πi=1d1Σx[n]fi(x))(Σxd[n]fd(xd))=Πi=1dΣx[n]fi(x).\begin{aligned} \Sigma_{x\in[n]^d}\Pi_{i=1}^df_i(x_i) &= \Sigma_{x_1\in[n]}\Sigma_{x_2\in[n]}\dots\Sigma_{x_d\in[n]}\Pi_{i=1}^df_i(x_i) \\ &= \Sigma_{x_d\in[n]}(\underbrace{\Sigma_{x_1\in[n]}\dots\Sigma_{x_{d-1}\in[n]}\Pi_{i=1}^{d-1}f_i(x_i)}_{\substack{\text{inductive hypothesis}}})f_d(x_d) \\ &= \Sigma_{x_d\in[n]}(\Pi_{i=1}^{d-1}\Sigma_{x\in[n]}f_i(x))f_d(x_d) \\ &= (\Pi_{i=1}^{d-1}\Sigma_{x\in[n]}f_i(x))(\Sigma_{x_d\in[n]}f_d(x_d)) \\ &= \Pi_{i=1}^d\Sigma_{x\in[n]}f_i(x). \end{aligned}

\blacksquare

Lemma 2. For the family of parity functions, H\mathcal{H}, and any h,hHh,h'\in\mathcal{H}

Ex(h(x)h(x))={1(h=h)0(hh)\mathbb{E}_x(h(x)h'(x))=\begin{cases} 1 & (h=h') \\ 0 & (h\neq h')\end{cases}

Proof.

Ex(h(x)h(x))=Ex((1)v,x(1)v,x)=Ex((1)v+v,x)=Ex(Πi=1n(1)(vi+vi)xi)=12nΣx[2]nΠi=1n(1)(vi+vi)xi=12nΠi=1nΣx[2](1)(vi+vi)x(Lemma 1)=12nΠi=1n212Σx[2](1)(vi+vi)x=Πi=1nExi((1)(vi+vi)xi)=Πi=1n((1)0(vi+vi)+(1)1(vi+vi))/2=Πi=1n(1+(1)vi+vi)/2.\begin{aligned} \mathbb{E}_x(h(x)h'(x)) &= \mathbb{E}_x((-1)^{\langle v,x\rangle}(-1)^{\langle v',x\rangle}) \\ &= \mathbb{E}_x((-1)^{\langle v+v',x\rangle}) \\ &= \mathbb{E}_x(\Pi_{i=1}^n (-1)^{(v_i+v'_i)x_i}) \\ &= \frac{1}{2^n}\Sigma_{x\in[2]^n}\Pi_{i=1}^n (-1)^{(v_i+v'_i)x_i} \\ &= \frac{1}{2^n}\Pi_{i=1}^n\Sigma_{x\in[2]}(-1)^{(v_i+v'_i)x} \quad (\text{Lemma }1) \\ &= \frac{1}{2^n}\Pi_{i=1}^n2\frac{1}{2}\Sigma_{x\in[2]}(-1)^{(v_i+v'_i)x} \\ &= \Pi_{i=1}^n\mathbb{E}_{x_i}((-1)^{(v_i+v'_i)x_i}) \\ &= \Pi_{i=1}^n((-1)^{0\cdot(v_i+v'_i)} + (-1)^{1\cdot(v_i+v'_i)})/2 \\ &= \Pi_{i=1}^n(1 + (-1)^{v_i+v'_i})/2. \end{aligned}

\blacksquare

Lemma 3. For any nn, mm, and f:RnRmf:\mathbb{R}^n\rightarrow\mathbb{R}^m,

aRm,Exf(x)Ex(f(x))2Exf(x)a2.\forall a\in\mathbb{R}^m, \mathbb{E}_x\|f(x)-\mathbb{E}_x(f(x))\|^2 \leq \mathbb{E}_x\|f(x)-a\|^2.

Proof. Notice g(a)Exf(x)a2g(a)\coloneqq\mathbb{E}_x\|f(x)-a\|^2 is (strictly) convex, so the value of aa satisfying (ag)(a)=0(\frac{\partial}{\partial a}g)(a)=0 is a global minimum. Computing this derivative and solving gives a=Ex(f(x))a=\mathbb{E}_x(f(x)), as desired. This corresponds to the intuitive idea that the mean minimizes the expected difference between it and all other values.

\blacksquare

Lemma 4 (Bessel's inequality). For a Hilbert space HH and finite set {eiiI}\{e_i\mid i\in I\} satisfying

ei,ej={1(i=j)0(ij),\langle e_i,e_j\rangle=\begin{cases}1 &(i=j)\\0 &(i\neq j)\end{cases},

i.e. {eiiI}\{e_i\mid i\in I\} is an orthonormal family, we have

xH,ΣiIx,ei2x2.\forall x\in H, \Sigma_{i\in I}\|\langle x, e_i\rangle\|^2\leq\|x\|^2.

Proof. Intuitively, this says that projecting xx onto an orthonormal basis won't increase its size.Let aia_i be x,ei\langle x, e_i\rangle and yy be ΣiIaiei\Sigma_{i\in I}a_i e_i, i.e. yy is the projection of xx onto the basis formed by {eiiI}\{e_i\mid i\in I\}, and aia_i is the component of xx lying on the iith element of the basis. Now, let the residual rr be xyx-y and note rr is perpendicular to each eie_i

r,ei=x,eiy,ei=aiΣjIajej,ei=aiai=0.\begin{aligned} \langle r, e_i\rangle &= \langle x, e_i\rangle - \langle y, e_i\rangle \\ &= a_i - \Sigma_{j\in I}a_j\langle e_j, e_i\rangle \\ &= a_i - a_i \\ &= 0. \end{aligned}

In turn, rr is perpendicular to yy,

r,y=ΣiIair,ei=0.\langle r, y\rangle = \Sigma_{i\in I}a_i\langle r, e_i\rangle = 0.

So, because x=y+rx=y+r and yy and rr are perpendicular, by Pythagoras',

x2=y+r2=y2+r2y2=ΣiIaiei,ΣiIaiei=ΣiIai2=ΣiIx,ei2\begin{aligned} \|x\|^2&=\|y+r\|^2 \\ &=\|y\|^2+\|r\|^2 \\ &\geq \|y\|^2 \\ &= \langle \Sigma_{i\in I} a_ie_i, \Sigma_{i\in I} a_ie_i\rangle \\ &= \Sigma_{i\in I}a_i^2 \\ &= \Sigma_{i\in I}\|\langle x, e_i\rangle\|^2 \end{aligned}

\blacksquare

Theorem 2 (FGBDL Theorem 1). Let H\mathcal{H} be the family of parity functions and let pwp_w and FhF_h be a neural network and MSE loss function, as before. If pwp_w satisfies

w,Ex(wpw)(x)2G(w)2\forall w, \mathbb{E}_x\|(\frac{\partial}{\partial w}p_w)(x)\|^2\leq G(w)^2

for some scalar G(w)G(w), then

Varh((wFh)(w))G(w)2H.\text{Var}_h((\frac{\partial}{\partial w}F_h)(w))\leq\frac{G(w)^2}{|\mathcal{H}|}.

Proof. First, invoking Lemma 3 with a=Ex(pw(x)(wpw)(x))a=\mathbb{E}_{x}(p_w(x)(\frac{\partial}{\partial w}p_w)(x)), we have

Varh((wFh)(w))=Eh(wFh)(w)Eh((wFh)(w))2Eh(wFh)(w)Ex(pw(x)(wpw)(x))2=EhEx((pw(x)h(x))(wpw)(x))Ex(pw(x)(wpw)(x))2=EhEx(h(x)(wpw)(x))2.\begin{aligned} &\text{Var}_h((\frac{\partial}{\partial w}F_h)(w)) \\ &= \mathbb{E}_h\| (\frac{\partial}{\partial w}F_h)(w) - \mathbb{E}_{h'}((\frac{\partial}{\partial w}F_{h'})(w)) \|^2 \\ &\leq \mathbb{E}_h\| (\frac{\partial}{\partial w}F_h)(w) - \mathbb{E}_{x}(p_w(x)(\frac{\partial}{\partial w}p_w)(x)) \|^2 \\ &= \mathbb{E}_h\|\mathbb{E}_x((p_w(x)-h(x))(\frac{\partial}{\partial w}p_w)(x)) - \mathbb{E}_{x}(p_w(x)(\frac{\partial}{\partial w}p_w)(x))\|^2 \\ &= \mathbb{E}_h\|\mathbb{E}_x(h(x)(\frac{\partial}{\partial w}p_w)(x))\|^2. \end{aligned}

Next, note

Ex(h(x)(wpw)(x))=12nΣx[2]nh(x)(wpw)(x)=12nh,wpw.\begin{aligned} \mathbb{E}_x(h(x)(\frac{\partial}{\partial w}p_w)(x))&=\frac{1}{2^n}\Sigma_{x\in[2]^n}h(x)(\frac{\partial}{\partial w}p_w)(x) \\ &= \frac{1}{2^n}\langle h, \frac{\partial}{\partial w}p_w\rangle. \end{aligned}

So, using Lemma 2 to invoke Lemma 4 we get the desired bound.

EhEx(h(x)(wpw)(x))2=EhΣi=1wEx(h(x)(wpw)i(x))2=Σi=1wEh(12nh,(wpw)i)=Σi=1w12n21HΣhHh,(wpw)iΣi=1w12n1H(wpw)i2(Bessel’s inequality)=Σi=1w1HEx(wpw)i(x)2=1HEx(wpw)(x)2G(w)2H.\begin{aligned} &\mathbb{E}_h\|\mathbb{E}_x(h(x)(\frac{\partial}{\partial w}p_w)(x))\|^2 \\ &= \mathbb{E}_h\Sigma_{i=1}^{|w|}\mathbb{E}_x(h(x)(\frac{\partial}{\partial w}p_w)_i(x))^2 \\ &= \Sigma_{i=1}^{|w|}\mathbb{E}_h(\frac{1}{2^n}\langle h, (\frac{\partial}{\partial w}p_w)_i\rangle) \\ &= \Sigma_{i=1}^{|w|}\frac{1}{2^n}^2\frac{1}{|\mathcal{H}|}\Sigma_{h\in\mathcal{H}}\langle h, (\frac{\partial}{\partial w}p_w)_i\rangle \\ &\leq \Sigma_{i=1}^{|w|}\frac{1}{2^n}\frac{1}{|\mathcal{H}|}\|(\frac{\partial}{\partial w}p_w)_i\|^2 \quad \text{(Bessel's inequality)} \\ &= \Sigma_{i=1}^{|w|}\frac{1}{|\mathcal{H}|}\mathbb{E}_x\|(\frac{\partial}{\partial w}p_w)_i(x)\|^2 \\ &= \frac{1}{|\mathcal{H}|}\mathbb{E}_x\|(\frac{\partial}{\partial w}p_w)(x)\|^2 \\ &\leq \frac{G(w)^2}{|\mathcal{H}|}. \end{aligned}

\blacksquare

Theorem 2 bounds the variance of the gradient of the family of parity functions, so together with Theorem 1 we get our goal result on the hardness of learning parity. For another cool use of Theorem 1 see Intractability of Learning the Discrete Logarithm with Gradient-Based Methods by Takhanov et. al..