Artist's disease is a hardening of the categories

This is a review of work by Omar Shamir et al. on the difficulty of learning parity with gradient descent. The papers Distribution-Specific Hardness of Learning Neural Networks [1] and Failures of Gradient-Based Deep Learning [2] cover much more than the hardness of learning parity, so this is a slimmed-down review containing only the parity difficulty part.

In SATNet: Bridging deep learning and logical reasoning using a differentiable satisfiability solver [3], Wang et. al. design a continuous relaxation of a SAT solver. Their solver takes as input a CNF formula $S$ and partial assignment to $S$ 's variables $I$ , and outputs an assignment to the remaining variables, $\setminus I$ , that maximizes satisfied clauses in $S$ . Because the solver is continuous, given a target set of values for $I$ and $\setminus I$ , one can take the derivative of the difference between the solver's output on $I$ and the target outputs ( $\setminus I$ ) with respect to $S$ and use gradient descent to find a CNF formula the satisfiability of which matches the behavior in the target set.

For example, [3] generates many example input/output assignments for if the number of true variables in the input is even or odd and uses gradient descent to find a value for $S$ with this behavior, namely a CNF formula for the parity function. This is the same idea, in spirit, to how Moitti et. al. use a continuous relaxation of a 2D cellular automata to learn the discrete ruleset for Conway's game of life.

I think this is pretty incredible, but most interesting to me was that [3] cited [2] to invoke a theoretical difficulty learning parity, saying "Learning parity functions from such single-bit supervision is known to pose difficulties for conventional deep learning approaches." At this point I had read neither [1] nor [2], but had read Intractability of Learning the Discrete Logarithm with Gradient-Based Methods [4] by Takhanov et. al., which shows the discrete logarithm is difficult to learn via gradient descent in the same sense that [2] says parity is hard. So, a wrinkle in the literature: [3] claims to synthesize parity with gradient descent despite [4]'s claim such a thing should be intractable.

The purpose of this essay is to smooth this out. In short: [3]'s success learning parity does not contradict [2]'s claim it is difficult, and [2] doesn't prove learning parity is difficult in the colloquial sense. In particular, [2] considers a "family" of parity functions, each of which samples a different subset of a length- $n$ input and outputs if the number of $1$ s in the sample is even or odd. Then, [2] shows that when any of these parity functions could be learned by the neural net, there is an exponential-in- $n$ run of gradient descent that converges to a value independent of the target function (i.e. gradient descent learns nothing about the target function). On the other hand, [3] learns parity with an inductive bias by weaving the input through their model. For example, on a length- $4$ input $x_1,\dots,x_4 \in [2]$ , [3]'s model ( $m$ below) outputs

m(x_4, m(x_3, m(x_2,x_1))).

By inspection, the only two elements of the family of parity functions this can learn are the trivial case when the subset of bits considered is empty, and full-parity. So while [2] predicts the existence of a worst-case-exponential run, practically speaking, because there are only $16$ functions from $[2]^2\rightarrow[2]$ , the constants work out in [3]'s favor.

If this introduction has piqued your curiosity about how [2] analyzes parity difficulty, read on. I think it is a lovely bit of math.

Background

Let's start by considering worst-case complexity of learning a parity function from examples. The parity functions in this piece take as input a bit string, pick a subset of the bits, then return $-1$ if the number of $1$ s in the subset is odd and $1$ otherwise. Formally, the family of parity functions is

\mathcal{H}\coloneqq\{x\mapsto (-1)^{\langle v, x\rangle}\mid v\in [2]^n\},

where $\langle v, x\rangle\coloneqq\Sigma_{i=1}^nv_ix_i$ is the inner product and $n$ is the input length.

Our task is, given an unknown $h\in\mathcal{H}$ and set of examples $\{(x^{(i)},h(x^{(i)}))\mid 0\leq i < t\}$ , to determine the value of $h$ . In the worst case, this takes $t>2^{n-1}$ examples. Because, letting $x^{(i)}_j$ denote the $j$ th element of $x^{(i)}$ , consider the case where $\exists j\forall i, x^{(i)}_j=0$ . In this case, letting $h=x\mapsto(-1)^{\langle v, x\rangle}$ and $v'$ be the element of $[2]^n$ satisfying $v'_k\neq v_k\leftrightarrow j=k$ , for $h'=x\mapsto(-1)^{\langle v', x\rangle}$ we have

\begin{aligned} \forall i,\quad h'(x^{(i)}) &= (-1)^{\Sigma_k v'_kx^{(i)}_k} \\ &= (-1)^{v'_jx^{(i)}_j+\Sigma_{k\neq j} v'_jx^{(i)}_j} \\ &= (-1)^{0+\Sigma_{k\neq j} v_jx^{(i)}_j} \\ &= (-1)^{v_jx^{(i)}_j+\Sigma_{k\neq j} v_jx^{(i)}_j} \\ &= h(x^{(i)}). \end{aligned}

As $h$ and $h'$ output the same value on every example input, it is ambiguous which is the true function based on our examples. There is a set of examples of size $[2]^{n-1}$ satisfying $\exists j\forall i, x^{(i)}_j=0$ , so in the worst case learning $h$ takes $t>2^{n-1}$ examples.

Shamir's work shows, even when the examples are drawn from a nice distribution (i.e. don't have the $\exists j\forall i, x^{(i)}_j=0$ property), gradient descent can take an exponential (in $n$ ) number of steps to learn some $h\in\mathcal{H}$ . In my view, the star of the show is Theorem 1, which establishes when the gradient of the loss is similar for all elements of $\mathcal{H}$ , the gradient carries little information about what element of $\mathcal{H}$ should actually be learned. This small signal is lost in noise from finite precision arithmetic and sampling.

Gradient Descent

In the typical machine learning setting, for some target function $h:\mathbb{R}^n\rightarrow\mathbb{R}^m$ and neural network architecture $p_w$ parameterized by weights $w\in\mathbb{R}^{|w|}$ we'd like to compute

\min_wF_h(w)\coloneqq\mathbb{E}_x(\frac{1}{2}\|h(x)-p_w(x)\|^2),

where $F_h(w)$ is the expected loss given a choice of $w$ .

One approach is to use a variation of gradient descent. This starts by selecting an initial value for $w$ , call it $w^{(0)}$ , then procedures to iteratively update the weights according to the formula

w^{({i+1})}\coloneqq w^{(i)}-\eta(\frac{\partial}{\partial w}F_h)(w^{(i)}),

where $\eta$ is the learning rate. Intuitively, this works because $(\frac{\partial}{\partial w}F_h)(w^{(i)})$ is an element of $\mathbb{R}^{|w|}$ pointing in the direction of steepest increase of $\mathbb{E}_x\|h(x)-p_w(x)\|^2$ , i.e. the loss. By inverting and subtracting this value from $w_i$ we move the weights in a direction that decreases the loss.

In practice, computing $\mathbb{E}_x\|h(x)-p_w(x)\|^2$ , and in turn $\frac{\partial}{\partial w}F_h$ , is computationally infeasible, as $x$ 's distribution is unknown. As such, the standard approach is to sample $x_1,x_2,\dots,x_t$ and approximate $F_h(w)$ as

F_h(w) \approx \Sigma_i \|h(x_i)-p_w(x_i)\|^2.

$\frac{\partial}{\partial w}F_h$ can be approximated in turn, and gradient descent run using the approximation.

Approximate Gradient Descent

We'll model this approximation with two definitions, approximate gradient oracles capture the error involved in computing $\frac{\partial}{\partial w}F_h$ and approximate gradient-based methods use these approximations.

An Approximate Gradient Oracle is a function $O_{F_h,\epsilon}:\mathbb{R}^{|w|}\rightarrow \mathbb{R}^{|w|}$ satisfying

\forall w, |O_{F_h,\epsilon}(w) - \frac{\partial}{\partial w}F_h(w)| \leq \epsilon.

An Approximate Gradient-Based Method is an algorithm that generates an initial guess $w^{(0)}$ , then decides $w^{(i+1)}$ based on responses from an approximate gradient oracle.

w^{(i+1)} = f(w^{(0)}, \{O_{F_h,\epsilon}(w^{(i)}) \mid i<i+1\})

Parity Hardness

Now, consider a family of functions $\mathcal{H}=\{h_1,h_2,\dots\}$ and the variance of the gradient at $w$ with respect to any $h\in\mathcal{H}$ .

\text{Var}_{h\in\mathcal{H}}(\frac{\partial}{\partial w}F_h(w))\coloneqq \mathbb{E}_h\| (\frac{\partial}{\partial w}F_h)(w) - \mathbb{E}_{h'}((\frac{\partial}{\partial w}F_{h'})(w)) \|^2

To show parity is difficult to learn, we'll show when $\mathcal{H}$ is the family of parity functions, this variance is exponentially small w.r.t. the length of the parity function's inputs. In turn, an adversarial approximate gradient oracle can repeatedly return $\mathbb{E}_{h'}((\frac{\partial}{\partial w}F_{h'})(w))$ instead of $(\frac{\partial}{\partial w}F_h)(w)$ while staying within its $\epsilon$ error tolerance. Because $\mathbb{E}_{h'}((\frac{\partial}{\partial w}F_{h'})(w))$ is independent of the $h\in\mathcal{H}$ being learned, an approximate gradient-based method using this adversarial oracle can converge to a value independent of the target function, $h$ , unless it takes an exponentially large number of steps.

Theorem 1 (DSHLNN Theorem 10). For some family of functions $\mathcal{H}$ , if

\forall w, \text{Var}_{h\in\mathcal{H}}(\frac{\partial}{\partial w}F_h)(w) \leq \epsilon^3

then for any approximate gradient-based method learning $h\in\mathcal{H}$ , there is a run such that the value of $w^{(\lfloor \epsilon^{-1}\rfloor)}$ is independent of $h$ .

Proof. By Chebyshev's inequality and the hypothesis, we have

\begin{aligned} \forall w, \quad &\mathbb{P}_h(\|(\frac{\partial}{\partial w}F_h)(w)-\mathbb{E}_h((\frac{\partial}{\partial w}F_h)(w))\| > \epsilon) \\ &\leq \text{Var}_h((\frac{\partial}{\partial w}F_h)(w))/\epsilon^2 \\ &\leq \epsilon. \end{aligned}

So, consider the adversarial approximate gradient oracle

O_{F_h,\epsilon}(w) \coloneqq \begin{cases} \mathbb{E}_h((\frac{\partial}{\partial w}F_h)(w)) &\text{if }\|(\frac{\partial}{\partial w}F_h)(w)-\mathbb{E}_h((\frac{\partial}{\partial w}F_h)(w))\| \leq \epsilon \\ (\frac{\partial}{\partial w}F_h)(w) &\text{otherwise.} \end{cases}

Using our Chebyshev inequality, we can bound the likelihood of $O_{F_h,\epsilon}$ 's $\text{otherwise}$ case.

\begin{aligned} \forall w, &\mathbb{P}_h(O_{F_h,\epsilon}(w)=(\frac{\partial}{\partial w}F_h)(w)) \\ &= \mathbb{P}_h(\|(\frac{\partial}{\partial w}F_h)(w)-\mathbb{E}_h((\frac{\partial}{\partial w}F_h)(w))\| > \epsilon) \\ &\leq \epsilon. \end{aligned}

Because $\mathbb{E}_h((\frac{\partial}{\partial w}F_h)(w))$ is independent of what $h$ is being learned, the inequality above bounds the likelihood $O_{F_h,\epsilon}(w)$ is dependent on $h$ .

Now, for any approximate gradient-based method learning $h\in\mathcal{H}$ , $w^{(0)}$ is independent of $h$ , as nothing has been sampled from the gradient when it is chosen. As

w^{(1)}=f(w^{(0)}, O_{F_h,\epsilon}(w^{(0)})),

evidently, $w^{(1)}$ is dependent on the $h$ being learned if $O_{F_h,\epsilon}(w^{(0)})$ is, and, per the inequality above, the likelihood of this is $\leq\epsilon$ . Repeating this argument, let $A^{(i)}$ be the event $O_{F_h,\epsilon}(w^{(i)})$ is dependent on $h$ . We have $\mathbb{P}(A^{(i)})\leq\epsilon$ , so by the union bound

\begin{aligned} \mathbb{P}(\bigvee_{i=1}^IA^{(i)})&\leq\Sigma_{i=1}^I\mathbb{P}(A^{(i)})\\ &\leq I\epsilon. \end{aligned}

If $\mathbb{P}(\bigvee_{i=1}^IA^{(i)})<1$ , then there is an $I$ step run of our gradient-based method where $w^{(I)}$ is independent of the target function, $h$ . Solving for $I$ using the equation above gives the desired bound: If $I<1/\epsilon$ , then there is a run of the gradient-based method where $w^{(\lfloor \epsilon^{-1} \rfloor)}$ is independent of $h$ (I am simplifying somewhat here because for the case we're interested $\epsilon^{-1}$ will not be an integer and flooring gives the strict $<$ inequality we want, but if you're feeling picky $I=\lceil \epsilon^{-1}\rceil -1$ will do).

$\blacksquare$

Lemma 1.

\Sigma_{x\in[n]^d}\Pi_{i=1}^d f_i(x_i) = \Pi_{i=1}^d\Sigma_{x\in[n]}f_i(x)

Proof. Shamir wordlessly invokes this, but it took me several hours on an airplane and help from ChatGPT to see. By induction on $d$ . When $d=2$ ,

\begin{aligned} \Sigma_{x\in[n]^2}\Pi_{i=1}^2 f_i(x_i) &= \Sigma_{x\in[n]^2}f_1(x_1)f_2(x_2) \\ &= \Sigma_{x_1\in[n]}\Sigma_{x_2\in[n]}f_1(x_1)f_2(x_2) \\ &= \Sigma_{x_1\in[n]}f_1(x_1)(\Sigma_{x_2\in[n]}f_2(x_2)) \\ &= (\Sigma_{x_1\in[n]}f_1(x_1))(\Sigma_{x_2\in[n]}f_2(x_2)) \\ &= \Pi_{i=1}^2\Sigma_{x\in[n]}f_i(x). \end{aligned}

For the inductive step,

\begin{aligned} \Sigma_{x\in[n]^d}\Pi_{i=1}^df_i(x_i) &= \Sigma_{x_1\in[n]}\Sigma_{x_2\in[n]}\dots\Sigma_{x_d\in[n]}\Pi_{i=1}^df_i(x_i) \\ &= \Sigma_{x_d\in[n]}(\underbrace{\Sigma_{x_1\in[n]}\dots\Sigma_{x_{d-1}\in[n]}\Pi_{i=1}^{d-1}f_i(x_i)}_{\substack{\text{inductive hypothesis}}})f_d(x_d) \\ &= \Sigma_{x_d\in[n]}(\Pi_{i=1}^{d-1}\Sigma_{x\in[n]}f_i(x))f_d(x_d) \\ &= (\Pi_{i=1}^{d-1}\Sigma_{x\in[n]}f_i(x))(\Sigma_{x_d\in[n]}f_d(x_d)) \\ &= \Pi_{i=1}^d\Sigma_{x\in[n]}f_i(x). \end{aligned}

$\blacksquare$

Lemma 2. For the family of parity functions, $\mathcal{H}$ , and any $h,h'\in\mathcal{H}$

\mathbb{E}_x(h(x)h'(x))=\begin{cases} 1 & (h=h') \\ 0 & (h\neq h')\end{cases}

Proof.

\begin{aligned} \mathbb{E}_x(h(x)h'(x)) &= \mathbb{E}_x((-1)^{\langle v,x\rangle}(-1)^{\langle v',x\rangle}) \\ &= \mathbb{E}_x((-1)^{\langle v+v',x\rangle}) \\ &= \mathbb{E}_x(\Pi_{i=1}^n (-1)^{(v_i+v'_i)x_i}) \\ &= \frac{1}{2^n}\Sigma_{x\in[2]^n}\Pi_{i=1}^n (-1)^{(v_i+v'_i)x_i} \\ &= \frac{1}{2^n}\Pi_{i=1}^n\Sigma_{x\in[2]}(-1)^{(v_i+v'_i)x} \quad (\text{Lemma }1) \\ &= \frac{1}{2^n}\Pi_{i=1}^n2\frac{1}{2}\Sigma_{x\in[2]}(-1)^{(v_i+v'_i)x} \\ &= \Pi_{i=1}^n\mathbb{E}_{x_i}((-1)^{(v_i+v'_i)x_i}) \\ &= \Pi_{i=1}^n((-1)^{0\cdot(v_i+v'_i)} + (-1)^{1\cdot(v_i+v'_i)})/2 \\ &= \Pi_{i=1}^n(1 + (-1)^{v_i+v'_i})/2. \end{aligned}

$\blacksquare$

Lemma 3. For any $n$ , $m$ , and $f:\mathbb{R}^n\rightarrow\mathbb{R}^m$ ,

\forall a\in\mathbb{R}^m, \mathbb{E}_x\|f(x)-\mathbb{E}_x(f(x))\|^2 \leq \mathbb{E}_x\|f(x)-a\|^2.

Proof. Notice $g(a)\coloneqq\mathbb{E}_x\|f(x)-a\|^2$ is (strictly) convex, so the value of $a$ satisfying $(\frac{\partial}{\partial a}g)(a)=0$ is a global minimum. Computing this derivative and solving gives $a=\mathbb{E}_x(f(x))$ , as desired. This corresponds to the intuitive idea that the mean minimizes the expected difference between it and all other values.

$\blacksquare$

Lemma 4 (Bessel's inequality). For a Hilbert space $H$ and finite set $\{e_i\mid i\in I\}$ satisfying

\langle e_i,e_j\rangle=\begin{cases}1 &(i=j)\\0 &(i\neq j)\end{cases},

i.e. $\{e_i\mid i\in I\}$ is an orthonormal family, we have

\forall x\in H, \Sigma_{i\in I}\|\langle x, e_i\rangle\|^2\leq\|x\|^2.

Proof. Intuitively, this says that projecting $x$ onto an orthonormal basis won't increase its size.Let $a_i$ be $\langle x, e_i\rangle$ and $y$ be $\Sigma_{i\in I}a_i e_i$ , i.e. $y$ is the projection of $x$ onto the basis formed by $\{e_i\mid i\in I\}$ , and $a_i$ is the component of $x$ lying on the $i$ th element of the basis. Now, let the residual $r$ be $x-y$ and note $r$ is perpendicular to each $e_i$

\begin{aligned} \langle r, e_i\rangle &= \langle x, e_i\rangle - \langle y, e_i\rangle \\ &= a_i - \Sigma_{j\in I}a_j\langle e_j, e_i\rangle \\ &= a_i - a_i \\ &= 0. \end{aligned}

In turn, $r$ is perpendicular to $y$ ,

\langle r, y\rangle = \Sigma_{i\in I}a_i\langle r, e_i\rangle = 0.

So, because $x=y+r$ and $y$ and $r$ are perpendicular, by Pythagoras',

\begin{aligned} \|x\|^2&=\|y+r\|^2 \\ &=\|y\|^2+\|r\|^2 \\ &\geq \|y\|^2 \\ &= \langle \Sigma_{i\in I} a_ie_i, \Sigma_{i\in I} a_ie_i\rangle \\ &= \Sigma_{i\in I}a_i^2 \\ &= \Sigma_{i\in I}\|\langle x, e_i\rangle\|^2 \end{aligned}

$\blacksquare$

Theorem 2 (FGBDL Theorem 1). Let $\mathcal{H}$ be the family of parity functions and let $p_w$ and $F_h$ be a neural network and MSE loss function, as before. If $p_w$ satisfies

\forall w, \mathbb{E}_x\|(\frac{\partial}{\partial w}p_w)(x)\|^2\leq G(w)^2

for some scalar $G(w)$ , then

\text{Var}_h((\frac{\partial}{\partial w}F_h)(w))\leq\frac{G(w)^2}{|\mathcal{H}|}.

Proof. First, invoking Lemma 3 with $a=\mathbb{E}_{x}(p_w(x)(\frac{\partial}{\partial w}p_w)(x))$ , we have

\begin{aligned} &\text{Var}_h((\frac{\partial}{\partial w}F_h)(w)) \\ &= \mathbb{E}_h\| (\frac{\partial}{\partial w}F_h)(w) - \mathbb{E}_{h'}((\frac{\partial}{\partial w}F_{h'})(w)) \|^2 \\ &\leq \mathbb{E}_h\| (\frac{\partial}{\partial w}F_h)(w) - \mathbb{E}_{x}(p_w(x)(\frac{\partial}{\partial w}p_w)(x)) \|^2 \\ &= \mathbb{E}_h\|\mathbb{E}_x((p_w(x)-h(x))(\frac{\partial}{\partial w}p_w)(x)) - \mathbb{E}_{x}(p_w(x)(\frac{\partial}{\partial w}p_w)(x))\|^2 \\ &= \mathbb{E}_h\|\mathbb{E}_x(h(x)(\frac{\partial}{\partial w}p_w)(x))\|^2. \end{aligned}

Next, note

\begin{aligned} \mathbb{E}_x(h(x)(\frac{\partial}{\partial w}p_w)(x))&=\frac{1}{2^n}\Sigma_{x\in[2]^n}h(x)(\frac{\partial}{\partial w}p_w)(x) \\ &= \frac{1}{2^n}\langle h, \frac{\partial}{\partial w}p_w\rangle. \end{aligned}

So, using Lemma 2 to invoke Lemma 4 we get the desired bound.

\begin{aligned} &\mathbb{E}_h\|\mathbb{E}_x(h(x)(\frac{\partial}{\partial w}p_w)(x))\|^2 \\ &= \mathbb{E}_h\Sigma_{i=1}^{|w|}\mathbb{E}_x(h(x)(\frac{\partial}{\partial w}p_w)_i(x))^2 \\ &= \Sigma_{i=1}^{|w|}\mathbb{E}_h(\frac{1}{2^n}\langle h, (\frac{\partial}{\partial w}p_w)_i\rangle) \\ &= \Sigma_{i=1}^{|w|}\frac{1}{2^n}^2\frac{1}{|\mathcal{H}|}\Sigma_{h\in\mathcal{H}}\langle h, (\frac{\partial}{\partial w}p_w)_i\rangle \\ &\leq \Sigma_{i=1}^{|w|}\frac{1}{2^n}\frac{1}{|\mathcal{H}|}\|(\frac{\partial}{\partial w}p_w)_i\|^2 \quad \text{(Bessel's inequality)} \\ &= \Sigma_{i=1}^{|w|}\frac{1}{|\mathcal{H}|}\mathbb{E}_x\|(\frac{\partial}{\partial w}p_w)_i(x)\|^2 \\ &= \frac{1}{|\mathcal{H}|}\mathbb{E}_x\|(\frac{\partial}{\partial w}p_w)(x)\|^2 \\ &\leq \frac{G(w)^2}{|\mathcal{H}|}. \end{aligned}

$\blacksquare$

Theorem 2 bounds the variance of the gradient of the family of parity functions, so together with Theorem 1 we get our goal result on the hardness of learning parity. For another cool use of Theorem 1 see Intractability of Learning the Discrete Logarithm with Gradient-Based Methods by Takhanov et. al..