Many aspects of modern applied research rely on a crucial algorithm called gradient descent. This is a procedure generally used for finding the largest or smallest values of a particular mathematical function—a process known as optimizing the function. It can be used to calculate anything from the most profitable way to manufacture a product to the best way to assign shifts to workers.
Yet despite this widespread usefulness, researchers have never fully understood which situations the algorithm struggles with most. Now, new work explains it, establishing that gradient descent, at heart, tackles a fundamentally difficult computational problem. The new result places limits on the type of performance researchers can expect from the technique in particular applications.
“There is a kind of worst-case hardness to it that is worth knowing about,” said Paul Goldberg of the University of Oxford, coauthor of the work along with John Fearnley and Rahul Savani of the University of Liverpool and Alexandros Hollender of Oxford. The result received a Best Paper Award in June at the annual Symposium on Theory of Computing.
You can imagine a function as a landscape, where the elevation of the land is equal to the value of the function (the “profit”) at that particular spot. Gradient descent searches for the function’s local minimum by looking for the direction of steepest ascent at a given location and searching downhill away from it. The slope of the landscape is called the gradient, hence the name gradient descent.
Gradient descent is an essential tool of modern applied research, but there are many common problems for which it does not work well. But before this research, there was no comprehensive understanding of exactly what makes gradient descent struggle and when—questions another area of computer science known as computational complexity theory helped to answer.
“A lot of the work in gradient descent was not talking with complexity theory,” said Costis Daskalakis of the Massachusetts Institute of Technology.
Computational complexity is the study of the resources, often computation time, required to solve or verify the solutions to different computing problems. Researchers sort problems into different classes, with all problems in the same class sharing some fundamental computational characteristics.
To take an example—one that’s relevant to the new paper—imagine a town where there are more people than houses and everyone lives in a house. You’re given a phone book with the names and addresses of everyone in town, and you’re asked to find two people who live in the same house. You know you can find an answer, because there are more people than houses, but it may take some looking (especially if they don’t share a last name).
This question belongs to a complexity class called TFNP, short for “total function nondeterministic polynomial.” It is the collection of all computational problems that are guaranteed to have solutions and whose solutions can be checked for correctness quickly. The researchers focused on the intersection of two subsets of problems within TFNP.
The first subset is called PLS (polynomial local search). This is a collection of problems that involve finding the minimum or maximum value of a function in a particular region. These problems are guaranteed to have answers that can be found through relatively straightforward reasoning.
One problem that falls into the PLS category is the task of planning a route that allows you to visit some fixed number of cities with the shortest travel distance possible given that you can only ever change the trip by switching the order of any pair of consecutive cities in the tour. It’s easy to calculate the length of any proposed route and, with a limit on the ways you can tweak the itinerary, it’s easy to see which changes shorten the trip. You’re guaranteed to eventually find a route you can’t improve with an acceptable move—a local minimum.
The second subset of problems is PPAD (polynomial parity arguments on directed graphs). These problems have solutions that emerge from a more complicated process called Brouwer’s fixed point theorem. The theorem says that for any continuous function, there is guaranteed to be one point that the function leaves unchanged—a fixed point, as it’s known. This is true in daily life. If you stir a glass of water, the theorem guarantees that there absolutely must be one particle of water that will end up in the same place it started from.