This is a review of work by Omar Shamir et al. on the difficulty of learning parity with gradient descent. Namely, the papers Distribution-Specific Hardness of Learning Neural Networks (DSHLNN) and Failures of Gradient-Based Deep Learning (FGBDL).

Together, DSHLNN and FGBDL cover much more than the hardness of learning parity, so the purpose of this piece is twofold.

  1. To give a slimmed-down version of the papers containing only the parity result.
  2. To give a version without a page limit, with all the steps explicit.

Lemma 1, Lemma 3, and Lemma 4 are also left as exercises for the reader, so I've filled those in. Any illusion of intelligence found here should be credited to the authors.

Background

Let's start by considering worst-case complexity of learning a parity function from examples. The parity functions in this piece take as input a bit string, pick a subset of the bits, then return 1-1 if the number of 11s in the subset is odd and 11 otherwise. Formally, the family of parity functions is

H{x(1)v,xv[2]n},\mathcal{H}\coloneqq\{x\mapsto (-1)^{\langle v, x\rangle}\mid v\in [2]^n\},

where v,xΣi=1nvixi\langle v, x\rangle\coloneqq\Sigma_{i=1}^nv_ix_i is the inner product and nn is the input length.

Our task is, given an unknown hHh\in\mathcal{H} and set of examples {(x(i),h(x(i)))0i<t}\{(x^{(i)},h(x^{(i)}))\mid 0\leq i < t\}, to determine the value of hh. In the worst case, this takes t>2n1t>2^{n-1} examples. Because, letting xj(i)x^{(i)}_j denote the jjth element of x(i)x^{(i)}, consider the case where ji,xj(i)=0\exists j\forall i, x^{(i)}_j=0. In this case, letting h=x(1)v,xh=x\mapsto(-1)^{\langle v, x\rangle} and vv' be the element of [2]n[2]^n satisfying vkvkj=kv'_k\neq v_k\leftrightarrow j=k, for h=x(1)v,xh'=x\mapsto(-1)^{\langle v', x\rangle} we have

i,h(x(i))=(1)Σkvkxk(i)=(1)vjxj(i)+Σkjvjxj(i)=(1)0+Σkjvjxj(i)=(1)vjxj(i)+Σkjvjxj(i)=h(x(i)).\begin{aligned} \forall i,\quad h'(x^{(i)}) &= (-1)^{\Sigma_k v'_kx^{(i)}_k} \\ &= (-1)^{v'_jx^{(i)}_j+\Sigma_{k\neq j} v'_jx^{(i)}_j} \\ &= (-1)^{0+\Sigma_{k\neq j} v_jx^{(i)}_j} \\ &= (-1)^{v_jx^{(i)}_j+\Sigma_{k\neq j} v_jx^{(i)}_j} \\ &= h(x^{(i)}). \end{aligned}

As hh and hh' output the same value on every example input, it is ambiguous which is the true function based on our examples. There is a set of examples of size [2]n1[2]^{n-1} satisfying ji,xj(i)=0\exists j\forall i, x^{(i)}_j=0, so in the worst case learning hh takes t>2n1t>2^{n-1} examples.

Shamir's work shows, even when the examples are drawn from a nice distribution (i.e. don't have the ji,xj(i)=0\exists j\forall i, x^{(i)}_j=0 property), gradient descent can take an exponential (in nn) number of steps to learn some hHh\in\mathcal{H}. In my view, the star of the show is Theorem 1, which establishes when the gradient of the loss is similar for all elements of H\mathcal{H}, the gradient carries little information about what element of H\mathcal{H} should actually be learned. This small ``signal'' is then lost in noise from finite precision arithmetic and sampling.

Gradient Descent

In the typical machine learning setting, for some target function h:RnRmh:\mathbb{R}^n\rightarrow\mathbb{R}^m and neural network architecture pwp_w parameterized by weights wRww\in\mathbb{R}^{|w|} we'd like to compute

minwFh(w)Ex(12h(x)pw(x)2),\min_wF_h(w)\coloneqq\mathbb{E}_x(\frac{1}{2}\|h(x)-p_w(x)\|^2),

where Fh(w)F_h(w) is the expected loss given a choice of ww.

One approach is to use a variation of gradient descent. This starts by selecting an initial value for ww, call it w(0)w^{(0)}, then procedures to iteratively update the weights according to the formula

w(i+1)w(i)η(wFh)(w(i)),w^{({i+1})}\coloneqq w^{(i)}-\eta(\frac{\partial}{\partial w}F_h)(w^{(i)}),

where η\eta is the learning rate. Intuitively, this works because (wFh)(w(i))(\frac{\partial}{\partial w}F_h)(w^{(i)}) is an element of Rw\mathbb{R}^{|w|} pointing in the direction of steepest increase of Exh(x)pw(x)2\mathbb{E}_x\|h(x)-p_w(x)\|^2, i.e. the loss. By inverting and subtracting this value from wiw_i we move the weights in a direction that decreases the loss.

In practice, computing Exh(x)pw(x)2\mathbb{E}_x\|h(x)-p_w(x)\|^2, and in turn wFh\frac{\partial}{\partial w}F_h, is computationally infeasible, as xx's distribution is unknown. As such, the standard approach is to sample x1,x2,,xtx_1,x_2,\dots,x_t and approximate Fh(w)F_h(w) as

Fh(w)Σih(xi)pw(xi)2.F_h(w) \approx \Sigma_i \|h(x_i)-p_w(x_i)\|^2.

wFh\frac{\partial}{\partial w}F_h can be approximated in turn, and gradient descent run using the approximation.

Approximate Gradient Descent

We'll model this approximation with two definitions, approximate gradient oracles capture the error involved in computing wFh\frac{\partial}{\partial w}F_h and approximate gradient-based methods use these approximations.

An Approximate Gradient Oracle is a function OFh,ϵ:RwRwO_{F_h,\epsilon}:\mathbb{R}^{|w|}\rightarrow \mathbb{R}^{|w|} satisfying

w,OFh,ϵ(w)wFh(w)ϵ.\forall w, |O_{F_h,\epsilon}(w) - \frac{\partial}{\partial w}F_h(w)| \leq \epsilon.

An Approximate Gradient-Based Method is an algorithm that generates an initial guess w(0)w^{(0)}, then decides w(i+1)w^{(i+1)} based on responses from an approximate gradient oracle.

w(i+1)=f(w(0),{OFh,ϵ(w(i))i<i+1})w^{(i+1)} = f(w^{(0)}, \{O_{F_h,\epsilon}(w^{(i)}) \mid i<i+1\})

Parity Hardness

Now, consider a family of functions H={h1,h2,}\mathcal{H}=\{h_1,h_2,\dots\} and the variance of the gradient at ww with respect to any hHh\in\mathcal{H}.

VarhH(wFh(w))Eh(wFh)(w)Eh((wFh)(w))2\text{Var}_{h\in\mathcal{H}}(\frac{\partial}{\partial w}F_h(w))\coloneqq \mathbb{E}_h\| (\frac{\partial}{\partial w}F_h)(w) - \mathbb{E}_{h'}((\frac{\partial}{\partial w}F_{h'})(w)) \|^2

To show parity is difficult to learn, we'll show when H\mathcal{H} is the family of parity functions, this variance is exponentially small w.r.t. the length of the parity function's inputs. In turn, an adversarial approximate gradient oracle can repeatedly return Eh((wFh)(w))\mathbb{E}_{h'}((\frac{\partial}{\partial w}F_{h'})(w)) instead of (wFh)(w)(\frac{\partial}{\partial w}F_h)(w) while staying within its ϵ\epsilon error tolerance. Because Eh((wFh)(w))\mathbb{E}_{h'}((\frac{\partial}{\partial w}F_{h'})(w)) is independent of the hHh\in\mathcal{H} being learned, an approximate gradient-based method using this adversarial oracle can converge to a value independent of the target function, hh, unless it takes an exponentially large number of steps.

Theorem 1 (DSHLNN Theorem 10). For some family of functions H\mathcal{H}, if

w,VarhH(wFh)(w)ϵ3\forall w, \text{Var}_{h\in\mathcal{H}}(\frac{\partial}{\partial w}F_h)(w) \leq \epsilon^3

then for any approximate gradient-based method learning hHh\in\mathcal{H}, there is a run such that the value of w(ϵ1)w^{(\lfloor \epsilon^{-1}\rfloor)} is independent of hh.

Proof. By Chebyshev's inequality and the hypothesis, we have

w,Ph((wFh)(w)Eh((wFh)(w))>ϵ)Varh((wFh)(w))/ϵ2ϵ.\begin{aligned} \forall w, \quad &\mathbb{P}_h(\|(\frac{\partial}{\partial w}F_h)(w)-\mathbb{E}_h((\frac{\partial}{\partial w}F_h)(w))\| > \epsilon) \\ &\leq \text{Var}_h((\frac{\partial}{\partial w}F_h)(w))/\epsilon^2 \\ &\leq \epsilon. \end{aligned}

So, consider the adversarial approximate gradient oracle

OFh,ϵ(w){Eh((wFh)(w))if (wFh)(w)Eh((wFh)(w))ϵ(wFh)(w)otherwise.O_{F_h,\epsilon}(w) \coloneqq \begin{cases} \mathbb{E}_h((\frac{\partial}{\partial w}F_h)(w)) &\text{if }\|(\frac{\partial}{\partial w}F_h)(w)-\mathbb{E}_h((\frac{\partial}{\partial w}F_h)(w))\| \leq \epsilon \\ (\frac{\partial}{\partial w}F_h)(w) &\text{otherwise.} \end{cases}

Using our Chebyshev inequality, we can bound the likelihood of OFh,ϵO_{F_h,\epsilon}'s otherwise\text{otherwise} case.

w,Ph(OFh,ϵ(w)=(wFh)(w))=Ph((wFh)(w)Eh((wFh)(w))>ϵ)ϵ.\begin{aligned} \forall w, &\mathbb{P}_h(O_{F_h,\epsilon}(w)=(\frac{\partial}{\partial w}F_h)(w)) \\ &= \mathbb{P}_h(\|(\frac{\partial}{\partial w}F_h)(w)-\mathbb{E}_h((\frac{\partial}{\partial w}F_h)(w))\| > \epsilon) \\ &\leq \epsilon. \end{aligned}

Because Eh((wFh)(w))\mathbb{E}_h((\frac{\partial}{\partial w}F_h)(w)) is independent of what hh is being learned, the inequality above bounds the likelihood OFh,ϵ(w)O_{F_h,\epsilon}(w) is dependent on hh.

Now, for any approximate gradient-based method learning hHh\in\mathcal{H}, w(0)w^{(0)} is independent of hh, as nothing has been sampled from the gradient when it is chosen. As

w(1)=f(w(0),OFh,ϵ(w(0))),w^{(1)}=f(w^{(0)}, O_{F_h,\epsilon}(w^{(0)})),

evidently, w(1)w^{(1)} is dependent on the hh being learned if OFh,ϵ(w(0))O_{F_h,\epsilon}(w^{(0)}) is, and, per the inequality above, the likelihood of this is ϵ\leq\epsilon. Repeating this argument, let A(i)A^{(i)} be the event OFh,ϵ(w(i))O_{F_h,\epsilon}(w^{(i)}) is dependent on hh. We have P(A(i))ϵ\mathbb{P}(A^{(i)})\leq\epsilon, so by the union bound

P(i=1IA(i))Σi=1IP(A(i))Iϵ.\begin{aligned} \mathbb{P}(\bigvee_{i=1}^IA^{(i)})&\leq\Sigma_{i=1}^I\mathbb{P}(A^{(i)})\\ &\leq I\epsilon. \end{aligned}

If P(i=1IA(i))<1\mathbb{P}(\bigvee_{i=1}^IA^{(i)})<1, then there is an II step run of our gradient-based method where w(I)w^{(I)} is independent of the target function, hh. Solving for II using the equation above gives the desired bound: If I<1/ϵI<1/\epsilon, then there is a run of the gradient-based method where w(ϵ1)w^{(\lfloor \epsilon^{-1} \rfloor)} is independent of hh (I am simplifying somewhat here because for the case we're interested ϵ1\epsilon^{-1} will not be an integer and flooring gives the strict << inequality we want, but if you're feeling picky I=ϵ11I=\lceil \epsilon^{-1}\rceil -1 will do).

\blacksquare

Lemma 1.

Σx[n]dΠi=1dfi(xi)=Πi=1dΣx[n]fi(x)\Sigma_{x\in[n]^d}\Pi_{i=1}^d f_i(x_i) = \Pi_{i=1}^d\Sigma_{x\in[n]}f_i(x)

Proof. Shamir wordlessly invokes this, but it took me several hours on an airplane and help from ChatGPT to see. By induction on dd. When d=2d=2,

Σx[n]2Πi=12fi(xi)=Σx[n]2f1(x1)f2(x2)=Σx1[n]Σx2[n]f1(x1)f2(x2)=Σx1[n]f1(x1)(Σx2[n]f2(x2))=(Σx1[n]f1(x1))(Σx2[n]f2(x2))=Πi=12Σx[n]fi(x).\begin{aligned} \Sigma_{x\in[n]^2}\Pi_{i=1}^2 f_i(x_i) &= \Sigma_{x\in[n]^2}f_1(x_1)f_2(x_2) \\ &= \Sigma_{x_1\in[n]}\Sigma_{x_2\in[n]}f_1(x_1)f_2(x_2) \\ &= \Sigma_{x_1\in[n]}f_1(x_1)(\Sigma_{x_2\in[n]}f_2(x_2)) \\ &= (\Sigma_{x_1\in[n]}f_1(x_1))(\Sigma_{x_2\in[n]}f_2(x_2)) \\ &= \Pi_{i=1}^2\Sigma_{x\in[n]}f_i(x). \end{aligned}

For the inductive step,

Σx[n]dΠi=1dfi(xi)=Σx1[n]Σx2[n]Σxd[n]Πi=1dfi(xi)=Σxd[n](Σx1[n]Σxd1[n]Πi=1d1fi(xi)inductive hypothesis)fd(xd)=Σxd[n](Πi=1d1Σx[n]fi(x))fd(xd)=(Πi=1d1Σx[n]fi(x))(Σxd[n]fd(xd))=Πi=1dΣx[n]fi(x).\begin{aligned} \Sigma_{x\in[n]^d}\Pi_{i=1}^df_i(x_i) &= \Sigma_{x_1\in[n]}\Sigma_{x_2\in[n]}\dots\Sigma_{x_d\in[n]}\Pi_{i=1}^df_i(x_i) \\ &= \Sigma_{x_d\in[n]}(\underbrace{\Sigma_{x_1\in[n]}\dots\Sigma_{x_{d-1}\in[n]}\Pi_{i=1}^{d-1}f_i(x_i)}_{\substack{\text{inductive hypothesis}}})f_d(x_d) \\ &= \Sigma_{x_d\in[n]}(\Pi_{i=1}^{d-1}\Sigma_{x\in[n]}f_i(x))f_d(x_d) \\ &= (\Pi_{i=1}^{d-1}\Sigma_{x\in[n]}f_i(x))(\Sigma_{x_d\in[n]}f_d(x_d)) \\ &= \Pi_{i=1}^d\Sigma_{x\in[n]}f_i(x). \end{aligned}

\blacksquare

Lemma 2. For the family of parity functions, H\mathcal{H}, and any h,hHh,h'\in\mathcal{H}

Ex(h(x)h(x))={1(h=h)0(hh)\mathbb{E}_x(h(x)h'(x))=\begin{cases} 1 & (h=h') \\ 0 & (h\neq h')\end{cases}

Proof.

Ex(h(x)h(x))=Ex((1)v,x(1)v,x)=Ex((1)v+v,x)=Ex(Πi=1n(1)(vi+vi)xi)=12nΣx[2]nΠi=1n(1)(vi+vi)xi=12nΠi=1nΣx[2](1)(vi+vi)x(Lemma 1)=12nΠi=1n212Σx[2](1)(vi+vi)x=Πi=1nExi((1)(vi+vi)xi)=Πi=1n((1)0(vi+vi)+(1)1(vi+vi))/2=Πi=1n(1+(1)vi+vi)/2.\begin{aligned} \mathbb{E}_x(h(x)h'(x)) &= \mathbb{E}_x((-1)^{\langle v,x\rangle}(-1)^{\langle v',x\rangle}) \\ &= \mathbb{E}_x((-1)^{\langle v+v',x\rangle}) \\ &= \mathbb{E}_x(\Pi_{i=1}^n (-1)^{(v_i+v'_i)x_i}) \\ &= \frac{1}{2^n}\Sigma_{x\in[2]^n}\Pi_{i=1}^n (-1)^{(v_i+v'_i)x_i} \\ &= \frac{1}{2^n}\Pi_{i=1}^n\Sigma_{x\in[2]}(-1)^{(v_i+v'_i)x} \quad (\text{Lemma }1) \\ &= \frac{1}{2^n}\Pi_{i=1}^n2\frac{1}{2}\Sigma_{x\in[2]}(-1)^{(v_i+v'_i)x} \\ &= \Pi_{i=1}^n\mathbb{E}_{x_i}((-1)^{(v_i+v'_i)x_i}) \\ &= \Pi_{i=1}^n((-1)^{0\cdot(v_i+v'_i)} + (-1)^{1\cdot(v_i+v'_i)})/2 \\ &= \Pi_{i=1}^n(1 + (-1)^{v_i+v'_i})/2. \end{aligned}

\blacksquare

Lemma 3. For any nn, mm, and f:RnRmf:\mathbb{R}^n\rightarrow\mathbb{R}^m,

aRm,Exf(x)Ex(f(x))2Exf(x)a2.\forall a\in\mathbb{R}^m, \mathbb{E}_x\|f(x)-\mathbb{E}_x(f(x))\|^2 \leq \mathbb{E}_x\|f(x)-a\|^2.

Proof. Notice g(a)Exf(x)a2g(a)\coloneqq\mathbb{E}_x\|f(x)-a\|^2 is (strictly) convex, so the value of aa satisfying (ag)(a)=0(\frac{\partial}{\partial a}g)(a)=0 is a global minimum. Computing this derivative and solving gives a=Ex(f(x))a=\mathbb{E}_x(f(x)), as desired. This corresponds to the intuitive idea that the mean minimizes the expected difference between it and all other values.

\blacksquare

Lemma 4 (Bessel's inequality). For a Hilbert space HH and finite set {eiiI}\{e_i\mid i\in I\} satisfying

ei,ej={1(i=j)0(ij),\langle e_i,e_j\rangle=\begin{cases}1 &(i=j)\\0 &(i\neq j)\end{cases},

i.e. {eiiI}\{e_i\mid i\in I\} is an orthonormal family, we have

xH,ΣiIx,ei2x2.\forall x\in H, \Sigma_{i\in I}\|\langle x, e_i\rangle\|^2\leq\|x\|^2.

Proof. Intuitively, this says that projecting xx onto an orthonormal basis won't increase its size.Let aia_i be x,ei\langle x, e_i\rangle and yy be ΣiIaiei\Sigma_{i\in I}a_i e_i, i.e. yy is the projection of xx onto the basis formed by {eiiI}\{e_i\mid i\in I\}, and aia_i is the component of xx lying on the iith element of the basis. Now, let the residual rr be xyx-y and note rr is perpendicular to each eie_i

r,ei=x,eiy,ei=aiΣjIajej,ei=aiai=0.\begin{aligned} \langle r, e_i\rangle &= \langle x, e_i\rangle - \langle y, e_i\rangle \\ &= a_i - \Sigma_{j\in I}a_j\langle e_j, e_i\rangle \\ &= a_i - a_i \\ &= 0. \end{aligned}

In turn, rr is perpendicular to yy,

r,y=ΣiIair,ei=0.\langle r, y\rangle = \Sigma_{i\in I}a_i\langle r, e_i\rangle = 0.

So, because x=y+rx=y+r and yy and rr are perpendicular, by Pythagoras',

x2=y+r2=y2+r2y2=ΣiIaiei,ΣiIaiei=ΣiIai2=ΣiIx,ei2\begin{aligned} \|x\|^2&=\|y+r\|^2 \\ &=\|y\|^2+\|r\|^2 \\ &\geq \|y\|^2 \\ &= \langle \Sigma_{i\in I} a_ie_i, \Sigma_{i\in I} a_ie_i\rangle \\ &= \Sigma_{i\in I}a_i^2 \\ &= \Sigma_{i\in I}\|\langle x, e_i\rangle\|^2 \end{aligned}

\blacksquare

Theorem 2 (FGBDL Theorem 1). Let H\mathcal{H} be the family of parity functions and let pwp_w and FhF_h be a neural network and MSE loss function, as before. If pwp_w satisfies

w,Ex(wpw)(x)2G(w)2\forall w, \mathbb{E}_x\|(\frac{\partial}{\partial w}p_w)(x)\|^2\leq G(w)^2

for some scalar G(w)G(w), then

Varh((wFh)(w))G(w)2H.\text{Var}_h((\frac{\partial}{\partial w}F_h)(w))\leq\frac{G(w)^2}{|\mathcal{H}|}.

Proof. First, invoking Lemma 3 with a=Ex(pw(x)(wpw)(x))a=\mathbb{E}_{x}(p_w(x)(\frac{\partial}{\partial w}p_w)(x)), we have

Varh((wFh)(w))=Eh(wFh)(w)Eh((wFh)(w))2Eh(wFh)(w)Ex(pw(x)(wpw)(x))2=EhEx((pw(x)h(x))(wpw)(x))Ex(pw(x)(wpw)(x))2=EhEx(h(x)(wpw)(x))2.\begin{aligned} &\text{Var}_h((\frac{\partial}{\partial w}F_h)(w)) \\ &= \mathbb{E}_h\| (\frac{\partial}{\partial w}F_h)(w) - \mathbb{E}_{h'}((\frac{\partial}{\partial w}F_{h'})(w)) \|^2 \\ &\leq \mathbb{E}_h\| (\frac{\partial}{\partial w}F_h)(w) - \mathbb{E}_{x}(p_w(x)(\frac{\partial}{\partial w}p_w)(x)) \|^2 \\ &= \mathbb{E}_h\|\mathbb{E}_x((p_w(x)-h(x))(\frac{\partial}{\partial w}p_w)(x)) - \mathbb{E}_{x}(p_w(x)(\frac{\partial}{\partial w}p_w)(x))\|^2 \\ &= \mathbb{E}_h\|\mathbb{E}_x(h(x)(\frac{\partial}{\partial w}p_w)(x))\|^2. \end{aligned}

Next, note

Ex(h(x)(wpw)(x))=12nΣx[2]nh(x)(wpw)(x)=12nh,wpw.\begin{aligned} \mathbb{E}_x(h(x)(\frac{\partial}{\partial w}p_w)(x))&=\frac{1}{2^n}\Sigma_{x\in[2]^n}h(x)(\frac{\partial}{\partial w}p_w)(x) \\ &= \frac{1}{2^n}\langle h, \frac{\partial}{\partial w}p_w\rangle. \end{aligned}

So, using Lemma 2 to invoke Lemma 4 we get the desired bound.

EhEx(h(x)(wpw)(x))2=EhΣi=1wEx(h(x)(wpw)i(x))2=Σi=1wEh(12nh,(wpw)i)=Σi=1w12n21HΣhHh,(wpw)iΣi=1w12n1H(wpw)i2(Bessel’s inequality)=Σi=1w1HEx(wpw)i(x)2=1HEx(wpw)(x)2G(w)2H.\begin{aligned} &\mathbb{E}_h\|\mathbb{E}_x(h(x)(\frac{\partial}{\partial w}p_w)(x))\|^2 \\ &= \mathbb{E}_h\Sigma_{i=1}^{|w|}\mathbb{E}_x(h(x)(\frac{\partial}{\partial w}p_w)_i(x))^2 \\ &= \Sigma_{i=1}^{|w|}\mathbb{E}_h(\frac{1}{2^n}\langle h, (\frac{\partial}{\partial w}p_w)_i\rangle) \\ &= \Sigma_{i=1}^{|w|}\frac{1}{2^n}^2\frac{1}{|\mathcal{H}|}\Sigma_{h\in\mathcal{H}}\langle h, (\frac{\partial}{\partial w}p_w)_i\rangle \\ &\leq \Sigma_{i=1}^{|w|}\frac{1}{2^n}\frac{1}{|\mathcal{H}|}\|(\frac{\partial}{\partial w}p_w)_i\|^2 \quad \text{(Bessel's inequality)} \\ &= \Sigma_{i=1}^{|w|}\frac{1}{|\mathcal{H}|}\mathbb{E}_x\|(\frac{\partial}{\partial w}p_w)_i(x)\|^2 \\ &= \frac{1}{|\mathcal{H}|}\mathbb{E}_x\|(\frac{\partial}{\partial w}p_w)(x)\|^2 \\ &\leq \frac{G(w)^2}{|\mathcal{H}|}. \end{aligned}

\blacksquare

Theorem 2 bounds the variance of the gradient of the family of parity functions, so together with Theorem 1 we get our goal result on the hardness of learning parity. For another cool use of Theorem 1 see Intractability of Learning the Discrete Logarithm with Gradient-Based Methods by Takhanov et. al..

Conclusion

This is a subtle result. For example, consider the family of functions,

G{xx+10(100+k)0k<2n}.\mathcal{G}\coloneqq\{x\mapsto x+10^{-(100+k)}\mid 0\leq k<2^n\}.

As these differ only by a constant factor, the variance of their gradient trivially satisfies Theorem 1. In turn, in the worst case gradient descent might converge to a value independent of the target function unless we take an exponential number of steps! But this actually isn't so bad in practice, because the functions are so similar that following EgG((wFg)(w))\mathbb{E}_{g\in\mathcal{G}}((\frac{\partial}{\partial w}F_g)(w)) will likely yield a good approximation.

In turn, saying Theorem 1 and Theorem 2 give a result about the ``hardness'' of learning parity requires subtle assumptions. Really the result is, unless we take an exponential number of steps, gradient descent might converge to a value independent of the target element of H\mathcal{H}. For parity, this is probably reasonable, because unlike G\mathcal{G} parity functions are quite different.

Second, this result only applies when our neural network might converge to any element of H\mathcal{H}. To illustrate this, in SATNet: Bridging deep learning and logical reasoning using a differentiable satisfiability solver by Wang et. al. they learn parity with an inductive bias by weaving the input through their model. For example, on a length-44 input, their model's output is

pw(x4,pw(x3,pw(x2,x1))).p_w(x_4, p_w(x_3, p_w(x_2,x_1))).

By inspection, the only two elements of H\mathcal{H} this can learn are the trivial case when the subset of bits considered is empty, and full-parity. In fact, there are only 1616 functions from [2]2[2][2]^2\rightarrow[2], so in this case parity can easily be learned by brute force.