Gradient surfing — the hidden role of regularization
This is an link post, originally posted at https://www.lesswrong.com/posts/Fjoy5SxgBmxfy7FNB/gradient-surfing-the-hidden-role-of-regularization.
This is an link post, originally posted at https://www.lesswrong.com/posts/Fjoy5SxgBmxfy7FNB/gradient-surfing-the-hidden-role-of-regularization.
This is an link post, originally posted at https://www.lesswrong.com/posts/2N7eEKDuL5sHQou3N/spooky-action-at-a-distance-in-the-loss-landscape.
I'm running a series of experiments that involve some variation of: (1) perturb a weight initialization; (2) train the perturbed and baseline models in parallel, and (3) track how the perturbation grows/shrinks over time.
Naively, if we're interested in a perturbation analysis of the choice of weight initialization, we prepare some baseline initialization, , and then apply i.i.d. Gaussian noise, , to each of its elements, . (If we want, we can let this vary layer-by-layer and let it depend on, for example, the norm of the layer it's being applied to.)
The problem with this strategy is that the perturbed weights are, in general, no longer sampled from the same distribution as the baseline weights.
There is nothing wrong with this per se, but it introduces a possible confounder (the thickness). This is especially relevant if we're interested specifically in the question of how behavior changes with the size of a perturbation, this problem introduces a possible confounder. As responsible experimentalists, we don't like confounders.
Fortunately, there's an easy way to "clean up" Kaiming He to make it better suited to this perturbative analysis.
Consider a matrix, , representing the weights of a particular layer with shape . is also called the fan-in of the layer, and the fan-out. For ease of presentation, we'll ignore the bias, though the following reasoning applies equally well to the bias.
We're interested in the vectorized form of this matrix, , where .
In Kaiming initialization, we sample the components, , of this vector, i.i.d. from a normal distribution with mean 0 and variance (where ).
Geometrically, this is equivalent to sampling from a hyperspherical shell, with radius and (fuzzy) thickness, . (Ok, so technically, because the radius can vary from layer-to-layer, it's a hyperellipsoidal shell.)
This follows from some straightforward algebra (dropping the superscript for simplicity):
and
So the thickness as a fraction of the radius is
where the last equality follows from the choice of for Kaiming initialization.
This means that for suitably wide networks (), the thickness of this shell goes to .
This suggests an alternative initialization strategy. What if we immediately take the limit , and sample directly from the boundary of a hypersphere with radius , i.e., modify the shell thickness to be .
This can easily be done by sampling each component from a normal distribution with mean 0 and variance and then normalizing the resulting vector to have length (this is known as the Muller method).
The modification we made to Kaiming initialization was to sample directly from the boundary of a hypersphere, rather than from a hyperspherical shell. This is a more natural choice when conducting a perturbation analysis, because it makes it easier to ensure that the perturbed weights are sampled from the same distribution as the baseline weights.
Geometrically, the intersection of a hypersphere of radius with a hypersphere of radius that is centered at some point on the boundary of the first hypersphere, is a lower-dimensional hypersphere of a modified radius . If we sample uniformly from this lower-dimensional hypersphere, then the resulting points will follow the same distribution over the original hypersphere.
This suggests a procedure to sample from the intersection of the weight initialization hypersphere and the perturbation hypersphere.
First, we sample from a hypersphere of dimension and radius (using the same technique we used to sample the baseline weights). From a bit of trigonometry, see figure below, we know that the radius of this hypersphere will be , where .
Next, we rotate the vector so it is orthogonal to the baseline vector . This is done with a Householder reflection, , that maps the current normal vector onto :
where
and is the unit vector in the direction of the baseline weights.
Implementation note: For the sake of tractability, we directly apply the reflection via:
Finally, we translate the rotated intersection sphere along the baseline vector, so that its boundary goes through the intersection of the two hyperspheres. From the figure above, we find that the translation has the magnitude .
By the uniform sampling of the intersection sphere and the uniform sampling of the baseline vector, we know that the resulting perturbed vector will have the same distribution as the baseline vector, when restricted to the intersection sphere.
Neural networks begin their lives on a hypersphere.
As it turns out, they live out the remainder of their lives on hyperspheres as well. Take the norm of the vectorized weights, , and plot it as a function of the number of training steps. Across many different choices of weight initialization, this norm will grow (or shrink, depending on weight decay) remarkably consistently.
In the following figure, I've plotted this behavior for 10 independent initializations across 70 epochs of training a simple MNIST classifier. The same results hold more generally for a wide range of hyperparameters and architectures.
The norm during training, divided by the norm upon weight initialization. The parameter ϵ controls how far apart (as a fraction of the initial weight norm) the different weight initializations are.
All this was making me think that I disregarded Brownian motion as a model of SGD dynamics a little too quickly. Yes, on lattices/planes, increasing the number of dimensions makes no difference to the overall behavior (beyond decreasing the variance in how displacement grows with time). But maybe that was just the wrong model. Maybe Brownian motion on hyperspheres is considerably different and is actually the thing I should have been looking at.
We'd like to calculate the expected displacement (=distance from our starting point) as a function of time. We're interested in both distance on the sphere and distance in the embedding space.
Before we tackle this in full, let's try to understand Brownian motion on hyperspheres in a few simplifying limits.
Unlike the Euclidean case, where having more than one dimension means you'll never return to the origin, living on a finite sphere means we'll visit every point infinitely many times. The stationary distribution is the uniform distribution over the sphere, which we'll denote for .
So our expectations end up being relatively simple integrals over the sphere.
The distance from the pole at the bottom to any other point on the surface depends only on the angle from the line intersecting the origin and the pole to the line intersecting the origin and the other point,
This symmetry means we only have to consider a single integral from the bottom to to the top of "slices." Each slice has as much probability associated to it as the corresponding has surface area times the measure associated to (). This surface area varies with the angle from the center, as does the distance of that point from the origin. Put it together and you get some nasty integrals.
From this point on, I believe I made a mistake somewhere (the analysis for is correct).
If we let denote the surface area of with radius , we can fill in the formula from Wikipedia:
Filling in , we get an expression for the surface area of each slice as a function of ,
so the measure associated to each slice is
We can simplify further, as , the will tend towards a "comb function," that is zero everywhere except for for integer , where it is equal to if is even, and if is odd.
Within the above integration limits , this means all of the probability becomes concentrated at the middle, where .
So in the infinite-dimensional limit, the expected distance from the origin is the distance between a point on the middle and one of its poles, which is .
An important lesson I've learned here is: "try more keyword variations in Google Scholar and Elicit before going off and solving these nasty integrals yourself because you will probably make a mistake somewhere and someone else has probably already done the thing you're trying to do."
Anyway, here's the thing where someone else has already done the thing I was trying to do. [1] gives us the following form for the expected cosine of the angular displacement with time:
A little trig tells us that the expected displacement in the embedding space is
which gives us a set of nice curves depending on the diffusion constant :
The major difference with Brownian motion on the plane is that expected displacement asymptotes at the limit we argued for above ().
When I had initially looked at curves like the following (this ones from [numerical simulation](https://github.com/jqhoogland/experiminis/blob/main/brownian_hyperspheres.ipynb, not the formula), I thought I had struck gold.
I've been looking for a model that predicts square root like growth at the very start, followed by a long period of nearly linear growth, followed by asymptotic behavior. Eyeballing this, it looked like my searching for the long plateau of linear growth had come to an end.
Of course, this was entirely wishful thinking. Human eyes aren't great discriminators of lines. That "straight line" is actually pretty much parabolic. As I should have known (from the Euclidean nature of manifolds), the initial behavior is incredibly well modeled by a square root.
As you get further away from the nearly flat regime, the concavity only increases. This is not the curve I was looking for.
So my search continues.
This is an link post, originally posted at https://www.lesswrong.com/posts/fovfuFdpuEwQzJu2w/neural-networks-generalize-because-of-this-one-weird-trick.