Top Optimisation Methods In Machine Learning (2024)

“All the impressive achievements of deep learning amount to just curve fitting.”
Judea Pearl

Machine learning in its most reduced form is sometimes referred to as glorified curve fitting. In a way, it is true. Machine learning models are typically founded on the principles of convergence; fitting data to the model. Whether this approach will lead to AGI is still a debatable subject. However, for now, deep neural networks are the best possible solution, and they use optimisation methods to arrive at the target.

Fundamental optimisation methods are typically categorised into first-order, high-order and derivative-free optimisation methods. One usually comes across methods that fall into the category of the first-order optimisation such as the gradient descent and its variants.

Top Optimisation Methods In Machine Learning (1)

According to an extensive survey done on optimisation methods by Shiliang et al., here are a few of the top methods that one can frequently encounter in their ML journey:

Gradient Descent

The gradient descent method is the most popular optimisation method. The idea of this method is to update the variables iteratively in the (opposite) direction of the gradients of the objective function. With every update, this method guides the model to find the target and gradually converge to the optimal value of the objective function.

Top Optimisation Methods In Machine Learning (2)

Stochastic Gradient Descent

Stochastic gradient descent (SGD) was proposed to address the computational complexity involved in each iteration for large scale data. The equation is given as:

Subscribe to our Newsletter

Join our editors every weekday evening as they steer you through the most significant news of the day, introduce you to fresh perspectives, and provide unexpected moments of joy

Your newsletter subscriptions are subject to AIM Privacy Policy and Terms and Conditions.

Adaptive Learning Rate Method

Learning rate is one of the key hyperparameters that undergo optimisation. Learning rate decides whether the model will skip certain portions of the data. If the learning rate is high, then the model might miss on subtler aspects of the data. If it is low, then it is desirable for real-world applications. Learning rate has a great influence on SGD. Setting the right value of the learning rate can be challenging. Adaptive methods were proposed to this tuning automatically.

The adaptive variants of SGD have been widely used in DNNs. Methods like AdaDelta, RMSProp, Adam use the exponential averaging to provide effective updates and simplify the calculation.

Adagrad: weights with a high gradient will have low learning rate and vice versa
RMSprop: adjusts the Adagrad method such that it reduces its monotonically decreasing learning rate.
Adam is almost similar to RMSProp but with momentum
Alternating Direction Method of Multipliers (ADMM) is another alternative to Stochastic Gradient Descent (SGD)

The difference between gradient descent and AdaGrad methods is that the learning rate is no longer fixed. It is computed using all the historical gradients accumulated up to the latest iteration. Read more here.

Conjugate Gradient Method

The conjugate gradient (CG) approach is used for solving large scale linear systems of equations and nonlinear optimisation problems. The first-order methods have a slow convergence speed. Whereas, the second-order methods are resource-heavy. Conjugate gradient optimisation is an intermediate algorithm, which combines the advantages of first-order information while ensuring the convergence speeds of high-order methods.

Know more about gradient methods here.

Derivative-Free Optimisation

For some optimisation problems, it can always be approached through a gradient because the derivative of the objective function may not exist or is not easy to calculate. This is where derivative-free optimisation comes into the picture. It uses a heuristic algorithm that chooses methods that have already worked well, rather than derives solutions systematically. Classical simulated annealing arithmetic, genetic algorithms and particle swarm optimisation are few such examples.

Zeroth Order Optimisation

Zeroth Order optimisation was introduced recently to address the shortcomings of derivative-free optimisation. Derivative-free optimisation methods find it difficult to scale to large-size problems and suffer from lack of convergence rate analysis.

Zeroth Order advantages include:

Ease of implementation with only a small modification of commonly-used gradient-based algorithms
Computationally efficient approximations to derivatives when they are difficult to compute
Comparable convergence rates to first-order algorithms.

For Meta Learning

Meta-optimiser is popular within the meta learning regime. The purpose of meta learning is to achieve fast learning, which, in turn, makes gradient descent more accurate in the optimisation. The optimisation process itself can be regarded as a learning problem to learn the prediction gradient rather than a determined gradient descent algorithm. Due to the similarity between the gradient update in backpropagation and the cell state update, LSTM is often used as the meta-optimiser.

Whereas, model-agnostic meta learning algorithm (MAML) is another method, which learns the parameters of models subjected to gradient descent methods that include classification, regression and reinforcement learning. The basic idea of the model-agnostic algorithm is to begin multiple tasks at the same time, and then get the synthetic gradient direction of different tasks, so as to learn a common base model.

These are few of the frequently used optimisation methods. Apart from these, there are other methods, which can be found in this study.

That said, the problem of optimisation for deep learning doesn’t end here because not all problems come under convex optimisation. Non-convex optimisation is one of the difficulties in the optimisation problem.

One approach is to transform the non-convex optimisation into a convex optimisation problem, and then use the convex optimisation method. The other is to use some special optimisation method such as projection gradient descent, alternating minimisation, expectation maximisation algorithm and stochastic optimisation and its variants.