Gradients and Gradient-Descent — Mathing the Future

Another mathematical concept I want to get into is that of the “gradient”. This is just a specific case of spatial derivative, which I covered earlier. The reason I’m covering gradients specifically is that they’re used a lot in optimization algorithms, and I also plan to reference them in occasionally describing how people and economies work.

Consider a “scalar function of multiple dimensions”, i.e. a function f(x,y, …) which outputs a number for each coordinate (x,y, …). The typical mathematical symbol for a gradient is an inverted triangle, or “nabla”, ∇. You see this a lot in college- and graduate-level equations, since they’re useful in describing how heat diffuses through an object, in fluid mechanics, in quantum physics, or in a bunch of other things. It’s often pronounced “grad” or “del”, and in one spatial dimension, it’s just the usual derivative, d/dx.
To get an intuitive handle on the gradient, imagine walking around a hill. The gradient (of height, with respect to horizontal position) always points horizontally, in the direction of steepest ascent. No matter how complex the terrain gets, the local gradient always points in the direction of steepest, and the magnitude of the vector describes how steep the hill is. Steep hills have large gradients.

Things often move in response to a gradient: For example, the scalar function might be the density of steam near a hot shower. As time passes, the steam will travel away from the point of highest concentration (i.e. in the direction opposite the gradient), diffusing throughout the available space.

In one dimension, the gradient points in the direction of the red arrows, depending whether you’re starting left or right of the minimum. For the gradient-descent method, imagine a ball that rolls down into the minimum.

Just as dimensions can cover non-spatial concepts, so can gradients. But people and social networks can also define the “space” - memes and language diffuse through social networks from the groups where they’re generated and popularized into ancillary or satellite groups. “Trickle-down” economics assumes that money will flow similarly: enrich some wealthy people, and more money will diffuse from them into the pockets of those who have little. The Iron Law of Oligarchy expresses the opposite idea, that money and power will naturally concentrate in a few peoples’ hands.

“Gradient descent” is a common method used in computerized optimization and curve-fitting. The idea is that you construct a space where the scalar function (the analogue of the hill’s altitude) measures the error. The “spatial” dimensions for the gradient are formed by the available parameters: whatever variables you plan to use to summarize the data as a whole:

Start off by choosing some set of parameters - randomly, through intuition, or based on a common solution. This defines your initial point in parameter space.
Calculate the error at that point. This is often the root-mean-square error. Conceptually, you’re saying “If this set of parameters gives the real state of the system, how far off are my measurements?”
Measure the gradient at that point. That is, see how tiny changes in each of the parameters will effect the overall error.
Adjust the parameters in the direction opposite the gradient; move in the direction that decreases the error.
Repeat steps 2 through 4 until the gradient goes to zero.

To go back to the hill analogy: pick a point on the hill to start, then walk downhill until you reach the bottom of a valley.

Illustration of common challenges for gradient-descent in one dimension. As the scalar function (the blue curve) becomes more complex, these problems become more common. Higher-dimensional functions are more complex and thus have more such problems,… — Illustration of common challenges for gradient-descent in one dimension. As the scalar function (the blue curve) becomes more complex, these problems become more common. Higher-dimensional functions are more complex and thus have more such problems, but most of the issues are basically those presented here. The arrows point in the direction of the gradient, opposite the direction of gradient-descent motion. Their length shows the magnitude of the gradient.

Gradient-descent algorithms face two major problems, both illustrated in the figure at the left:

The function may have multiple minima of differing depths. The first minimum found is not necessarily the best one. (The deepest valley is called the “global minimum.” Others are “local minima.”)
Sometimes large reasons will have only a shallow gradient, yielding slow motion toward the minimum over many steps.

Gradient-descent algorithms generally try to reach the global minimum as quickly as possible, so that the optimum fit can be found without a large investment of memory and CPU time. Variant algorithms try various techniques to map the space as they go along and to vary the step size. They may have multiple start positions or occasionally move in random directions to more accurately model the full space of options. In all cases, gradient-descent methods can benefit from users’ knowledge about the overall shape of the space and solutions to previous problems.

A blog on US politics, Math, and Physics… with occasional bits of gaming

Nov 16 Gradients and Gradient-Descent

Dec 1 Governing the economy

Nov 9 Dimensions