Discussion

The dynamics of a Neural Network is usually framed in the context of optimizing a convex or non-convex non linear problem. This involves the minimization/maximization of an objective function. The formulation of the objective function is a bit arbitrary but it is typically the squared error between the actual and estimated values:

∑x[q̂ (x)–q(x)]2 ∑x[q^(x)–q(x)]2 The solution to the optimization problem is typically a simple gradient descent approach. What is surprising here is that Deep Learning systems are highly successful despite such a simple approach. One would have thought that gradient descent would be all too often stuck often many local minima one would expect in a non-convex space. However, the intuition of low dimensions does not convey to higher dimensions, where local minima are actually saddle points and a simple gradient descent can escape given enough patience!

However, without a overarching theory or framework, a lot of the techniques employed in Deep Learning (i.e. SGD, Dropout, Normalization, hyper-parameter search etc) all seem to be arbitrary techniques (see: http://arxiv.org/abs/1206.5533 ).

In Information Theory there is the Kullback-Leibler Divergence DKL(p||q)=∑xp(x)log(p(x)q(x))DKL(p||q)=∑xp(x)log(p(x)q(x)) which is a measure of the difference between two probability distributions. (Note: Shannon’s Entropy is a special case of the KL divergence where q is constant). If one takes a distribution and its infinitesimal difference, one arrives as the following equation:

DKL(pθ||qθ+δθ)=gijΔθiΔθj+Oδθ3 DKL(pθ||qθ+δθ)=gijΔθiΔθj+Oδθ3 where gijgij is the Fisher Information Matrix (FIM):

gij=–∑xPθ(x)∂∂θi∂∂θjlogPθ(x) gij=–∑xPθ(x)∂∂θi∂∂θjlogPθ(x) The Cramér–Rao lower bound is an estimate of the lower bound of the variance of an estimator. It is related to the FIM I(θ)I(θ) in scalar form:

Var(θ̂ )>=1I(θ) Var(θ^)>=1I(θ) So the above equation that the FIM has an effect on minimizing the variance between estimated and actual values.

There exists a formulation by Sun-Ichi Amari in a field called Information Geometry that casts the FIM as a metric. Amari shows in his paper “Natural Gradient works Efficiently in Learning“, and speculates that natural gradient may more effectively navigate out of plateaus than conventional stochastic gradient descent. The FIM Information Geometry shares some similarity with Einstein’s General Theory of Relativity in that the dynamics of a system follows a non-Euclidean space. So rather than observing the curvature of light as a consequence of gravity, one would find a curvature of information in the presence of knowledge.

Although the Information Geometry theory is extremely elegant, the general difficulty with the FIM is that is is expensive to calculate. However recent developments (all in 2015) have shown various approaches to calculating an approximation that leads to very encouraging results.

Parallel training of DNNs with Natural Gradient and Parameter Averaging from the folks developing the Speech Recognizer Kaldi have developed a stochastic gradient technique that employs an approximation of the FIM. Their technique not only improves over standard SGD, but allows for parallelization.

Youshua Bengio and his team at the University of Montreal have a paper Topmoumoute online natural gradient algorithm TONGA have developed a low-rank approximation of FIM with an implementation that beats stochastic gradient in speed and generalization.

Finally Google’s DeepMind team have published a paper “Natural Neural Networks“. In this paper they describe a technique that reparameterizes the neural network layers so that the FIM is effectively the identity matrix. It is a novel technique that has similarities to the Batch Normalization that was previously proposed.

We still are in an early stage for a theory of Deep Learning using Information Geometry, however recent developments seem show the promise of employing a more abstract theoretical approach. Abstract mathematics like Information Geometry should not be dismissed as impractical to implement but rather used as a guide towards building better algorithms and analytic techniques.

References http://arxiv.org/pdf/1507.00210v1.pdf Natural Neural Networks

New insights and perspectives on the natural gradient method

Fisher information and natural gradient provided deep insights and powerful tools to artificial neural networks. However related analysis becomes more and more difficult as the learner's structure turns large and complex. This paper makes a preliminary step towards a new direction. We extract a local component of a large neuron system, and defines its relative Fisher information metric that describes accurately this small component, and is invariant to the other parts of the system. This concept is important because the geometry structure is much simplified and it can be easily applied to guide the learning of neural networks. We provide an analysis on a list of commonly used components, and demonstrate how to use this concept to further improve optimization.

Sub-Sampled Newton Methods I: Globally Convergent Algorithms

http://openreview.net/pdf?id=B186cP9gx SINGULARITY OF THE HESSIAN IN DEEP LEARNING

We look at the eigenvalues of the Hessian of a loss function before and after training. The eigenvalue distribution is seen to be composed of two parts, the bulk which is concentrated around zero, and the edges which are scattered away from zero. We present empirical evidence for the bulk indicating how overparametrized the system is, and for the edges indicating the complexity of the input data.

http://openreview.net/pdf?id=By1snw5gl L-SR1: A SECOND ORDER OPTIMIZATION METHOD FOR DEEP LEARNING

We describe L-SR1, a new second order method to train deep neural networks. Second order methods hold great promise for distributed training of deep networks. Unfortunately, they have not proven practical. Two significant barriers to their success are inappropriate handling of saddle points, and poor conditioning of the Hessian.

https://jimmylba.github.io/papers/nsync.pdf DISTRIBUTED SECOND-ORDER OPTIMIZATION USING KRONECKER-FACTORED APPROXIMATIONS

The recently proposed K-FAC method (Martens and Grosse, 2015) uses a stronger and more sophisticated curvature approximation, and has been shown to make much more per-iteration progress than SGD, while only introducing a modest overhead. In this paper, we develop a version of K-FAC that distributes the computation of gradients and additional quantities required by K-FAC across multiple machines, thereby taking advantage of the method’s superior scaling to large mini-batches and mitigating its additional overheads.

https://arxiv.org/abs/1412.1193v8 New insights and perspectives on the natural gradient method

https://arxiv.org/abs/1703.00209v2 Online Natural Gradient as a Kalman Filter

http://www.yann-ollivier.org/rech/publs/gradnn.pdf Riemannian metrics for neural networks I: Feedforward networks

https://arxiv.org/abs/1205.1828v1 The Natural Gradient by Analogy to Signal Whitening, and Recipes and Tricks for its Use

The natural gradient allows for more efficient gradient descent by removing dependencies and biases inherent in a function's parameterization. Several papers present the topic thoroughly and precisely. It remains a very difficult idea to get your head around however. The intent of this note is to provide simple intuition for the natural gradient and its use. We review how an ill conditioned parameter space can undermine learning, introduce the natural gradient by analogy to the more widely understood concept of signal whitening, and present tricks and specific prescriptions for applying the natural gradient to learning problems.

https://arxiv.org/pdf/1712.08449v1.pdf True Asymptotic Natural Gradient Optimization

We introduce a simple algorithm, True Asymptotic Natural Gradient Optimization (TANGO), that converges to a true natural gradient descent in the limit of small learning rates, without explicit Fisher matrix estimation. For quadratic models the algorithm is also an instance of averaged stochastic gradient, where the parameter is a moving average of a “fast”, constant-rate gradient descent. TANGO appears as a particular de-linearization of averaged SGD, and is sometimes quite different on non-quadratic models. This further connects averaged SGD and natural gradient, both of which are arguably optimal asymptotically. In large dimension, small learning rates will be required to approximate the natural gradient well. Still, this shows it is possible to get arbitrarily close to exact natural gradient descent with a lightweight algorithm.

https://arxiv.org/abs/1802.05098 DiCE: The Infinitely Differentiable Monte-Carlo Estimator

To address all these shortcomings in a unified way, we introduce DiCE, which provides a single objective that can be differentiated repeatedly, generating correct gradient estimators of any order in SCGs. Unlike SL, DiCE relies on automatic differentiation for performing the requisite graph manipulations.

https://arxiv.org/abs/1808.10340 A Coordinate-Free Construction of Scalable Natural Gradient

We explicitly construct a Riemannian metric under which the natural gradient matches the K-FAC update; invariance to affine transformations of the activations follows immediately. We extend our framework to analyze the invariance properties of K-FAC applied to convolutional networks and recurrent neural networks, as well as metrics other than the usual Fisher metric.