What is the shape of knowledge?

In the second chapter of Sapiens, Yuval Noah Harari describes our species' Cognitive Revolution. Sometime 30,000 to 70,000 years ago, there was an inflection point in Homo sapien brains. They learned to gossip and speak about fictitious things. "As far as we know, only Sapiens can talk about entire kinds of entities they have never seen, touched, or smelled" (Yuval). The timing of the Cognitive Revolution corresponded with Homo sapiens leaving Africa, inventing religion, and the extinction of all other human species.

The Cognitive Revolution suggests there is nonlinearity in human-like intelligence. Perhaps, a mutation in Homo sapien brains, a small increase in the number of neurons, and a light turned on.

Language Models are Few-Shot Learners Figure 3.10, Tom B. Brown et al.

Large Language Models have tantalizingly similar behavior. Consider the above and below plots showing GPT-3's accuracy on a select group of tasks as a function of the number of parameters in the model. There is something so get it or you don't for different numbers of paramaters.

Emergent Abilities of Large Language Models Figure 11, Jason Wei et al. Modified to show only GPT-3.

What is the nature of tasks which have these inflection points? If we could find a mathematical description of these tasks, might the Cognitive Revolution suggest it describes human-like intellegence?

Noah Harai's claim seems unfalsifiable, and drawing parallels between biological and artificial neural networks is a tenuous business. But what a fascinating concept.

Consider an artificial neural network that has two inputs and two outputs. In the middle is a single hidden layer, three parameters wide. The model takes a value, and outputs if it belongs to group one or group two. I like to think of this geometrically. As an input, the model takes an ( $x$ , $y$ ) point, which it projects into 3D space ( $x$ , $y$ , $z$ ), and projects back onto ( $x$ , $y$ ).

If the resulting $x$ is greater than $y$ , the input is assigned to group one, and vice versa for $y$ greater than $x$ . We can interpert this as dividing all the output points with the line $y=x$ , and classifying their inputs according to which side of the line they fall on.

After moving through the model, if the output ( $x$ , $y$ ) point is in the shaded region, the input point is classified as group one. Output points outside the shaded region correspond to group two.

If you're familiar with neural networks, this is how a classification is derived from the output of softmax.

Let's look at an example.

Consider how an artificial neural network might classify points as belonging to one of the nested circles above. To begin, no line can be drawn to separate the circles, so the first layer projects to three dimensions, ( $x$ , $y$ , $z$ ).

Next, the second and third three-dimensional layers fold the circles in half.

The fourth layer folds again.

The final layer projects back onto ( $x$ , $y$ ) coordinates, according to the rules $x \leftarrow x+z$ and $y \leftarrow y+z$ .

The result is separable by a line. $y$ values above the line can be classified as corresponding to the outer circle, and below to the inner one. By virtue of having three parameters (three dimensions) per internal layer, the model can separate the circles. If there were two parameters per internal layer, we wouldn't be able to project to 3D, and classification would be impossible.

The classification problem of choosing between two nested circles has an inflection point at three parameters per internal layer. If the internal layers are two parameters wide, perfect classification is impossible. At three parameters, perfect classification is possible. Is this the same Cognitive Revolution Yuval Noah Harari describes in Sapiens?

If so, what describes the shape of data in classification problems which causes an inflection point as you increase the number of parameters in the model? Maybe: circular, nested. Call this description $\bigcirc$ . Large Language Models have inflection points and are trained on vast amounts of human knowledge. Does $\bigcirc$ describe the shape of knowledge?

Appendix I - Circle Classification

Consider the problem of classifying points as inside or outside a circle of radius one centered at the origin. Does this have an inflection point?

I wrote a model (code) with ten hidden layers using RELU, and an output layer using sigmoid. I trained fifteen models, varying parameters per hidden layer between $1, \ldots, 15$ . For training, I used a random sample of three thousand ( $x$ , $y$ ) points where $-1.5 \leq x \leq 1.5$ , and $-1.5 \leq y \leq 1.5$ . I tested the accuracy of each model on an additional thousand random points.

The following plot shows each model's accuracy as a function of the number of parameters per hidden layer.

There is an inflection point between six and eight parameters per hidden layer. What eight-dimensional representation does the model discover, which propels it to 95% accuracy? Why does a six-dimensional representation perform so poorly?

The following plots show one hundred points from the test data, and their resulting positions after being transformed and output by the model (and before applying softmax, for aesthetic reasons). This is equivalent to the final plot in the nested circle separation problem. As before, the model's goal is to divide points inside and outside the circle by a line.

The six-parameter model collapses all points to a single one, classified as outside the circle.

Once seven parameters are available, the output begins to look much less random. The model finds some signal in the noise.

At thirteen parameters per layer, the model is refined. Inside and outside the circle are well separated in the transformed data.

These plots illustrate how a real model takes a classification problem, and turns it into the problem of separating transformed points by a line.

Appendix II - Shape

Large language models operate on thousands of dimensions where shape loses its colloquial meaning. What can we say about shape in higher dimensions?

These points have a symmetric shape - There is a plane ( $x=0$ ), on either side of which the points are mirrored. In higher dimensions, we might use the word hyperplane.

These points have a circular shape - All points are equidistant from ( $0$ , $0$ ). In higher dimensions, we might specify equidistant in terms of Euclidean distance.

These points have a hole in their shape. In higher dimensions, we might look to Betti numbers as in Topology of Deep Neural Networks by Gregory Naitzat et al..

As queries (points) move through the layers of large language models, what shapes do they draw? Might $\bigcirc$ describe them?

Appendix III - Counting Circles

Consider a more complex classification problem, counting the number of circles in an image. Does this have an inflection point?

To test, I generated (code) sixty thousand training images, like the ones above, and ten thousand test images, containing between one and six non-overlapping dots placed randomly. Unlike our earlier binary-choice problems, this one involves selecting between six possible classifications (dot counts).

Once again, each model (code) uses RELU for its hidden layers and applies softmax to the output. The output of the model is a six-dimensional point, where each component corresponds to the probability the image contains that number of dots (the first probability corresponds to one dot, the sixth: six dots).

I measured the accuracy of the model as a function of the number of hidden layers, and the width of those hidden layers. I tested with 0, 2, 4, and 6 hidden layers; and widths of 32, 64, 128, and 256. The maximum classification accuracy on the test data was 85%, achieved by the model with 6 hidden layers of width 256. The experiment's results are plotted above.

What about the shape of this problem means there is no inflection point? There's a deep mystery here.

Thanks to Anna, Janita, Michael, Moses, and Nick for many insightful conversations.