# Random Orthogonal Initialization

““Disorder increases with time because we measure time in the direction in which disorder increases. You can’t have a safer bet than that!” - Stephen Hawking

Intent

Initializing the network using random orthogonal weights improves learning rates.

Motivation

How do we improve convergence rate for training a network from scratch?

Structure

<Diagram>

Discussion

Random initialization of the model parameters (i.e. weight matrices) seem to have an noticeable effect in the speed of learning convergence. The effect can be traced back to the J-L lemma which provides a formula for the minimum number of random projects that will preserve locality between similar observed input vectors. In other words, randomness has a higher probability of leading to a model that does not destroy locality than an initialization that is uniform.

Are the learned weights truly random?

Known Uses

Random initialization of a ANN's weights is a best practice procedure.

Related Patterns

<Diagram>

Relationship to Canonical Patterns

• Random Matrix properties reveals emergent structure that may be importance to stability and trainability.
• Structured Factorization show the importance of group theory behavior to stability.
• Geometry shows the eigenvalue spectrum of random matrices having universal characteristics.
• Hierarchical Abstraction studies reveals the importance of initialization to ensure stability in a deep network.

References

Now the data is ready. However, before you are beginning to train the network, you have to initialize its parameters.

All Zero Initialization

In the ideal situation, with proper data normalization it is reasonable to assume that approximately half of the weights will be positive and half of them will be negative. A reasonable-sounding idea then might be to set all the initial weights to zero, which you expect to be the “best guess” in expectation. But, this turns out to be a mistake, because if every neuron in the network computes the same output, then they will also all compute the same gradients during back-propagation and undergo the exact same parameter updates. In other words, there is no source of asymmetry between neurons if their weights are initialized to be the same.

Initialization with Small Random Numbers

Thus, you still want the weights to be very close to zero, but not identically zero. In this way, you can random these neurons to small numbers which are very close to zero, and it is treated as symmetry breaking. The idea is that the neurons are all random and unique in the beginning, so they will compute distinct updates and integrate themselves as diverse parts of the full network. The implementation for weights might simply look like weights ~ 0.001 × N(0,1), where N(0,1) is a zero mean, unit standard deviation gaussian. It is also possible to use small numbers drawn from a uniform distribution, but this seems to have relatively little impact on the final performance in practice.

Calibrating the Variances

One problem with the above suggestion is that the distribution of the outputs from a randomly initialized neuron has a variance that grows with the number of inputs. It turns out that you can normalize the variance of each neuron's output to 1 by scaling its weight vector by the square root of its fan-in (i.e., its number of inputs), which is as follows:

w = np.random.randn(n) / sqrt(n) # calibrating the variances with 1/sqrt(n)

where “randn” is the aforementioned Gaussian and “n” is the number of its inputs. This ensures that all neurons in the network initially have approximately the same output distribution and empirically improves the rate of convergence. The detailed derivations can be found from Page. 18 to 23 of the slides. Please note that, in the derivations, it does not consider the influence of ReLU neurons.

Current Recommendation

As aforementioned, the previous initialization by calibrating the variances of neurons is without considering ReLUs. A more recent paper on this topic by He et al.  derives an initialization specifically for ReLUs, reaching the conclusion that the variance of neurons in the network should be 2.0/n as:

w = np.random.randn(n) * sqrt(2.0/n) # current recommendation

which is the current recommendation for use in practice, as discussed in .

http://arxiv.org/abs/1502.01852 Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification

http://arxiv.org/pdf/1602.05897v1.pdf Toward Deeper Understanding of Neural Networks: The Power of Initialization and a Dual View on Expressivity

http://arxiv.org/pdf/1504.08291v4.pdf Deep Neural Networks with Random Gaussian Weights: A Universal Classification Strategy?

Two important properties of a classification machinery are: (i) the system preserves the important information of the input data; (ii) the training examples convey information for unseen data; and (iii) the system is able to treat differently points from different classes. In this work we show that these fundamental properties are inherited by the architecture of deep neural networks. We formally prove that these networks with random Gaussian weights perform a distance-preserving embedding of the data, with a special treatment for in-class and out-of-class data. Similar points at the input of the network are likely to have the same

The theoretical analysis of deep networks here presented exploits tools used in the compressed sensing and dictionary learning literature, thereby making a formal connection between these important topics. The derived results allow drawing conclusions on the metric learning properties of the network and their relation to its structure; and provide bounds on the required size of the training set such that the training examples would represent faithfully the unseen data. The results are validated with state-of-the-art trained networks.

Our work implies that it is possible to view DNN as a stagewise metric learning process, suggesting that it might be possible to replace the currently used layers with other metric learning algorithms, opening a new venue for semi-supervised DNN.

Exact solutions to the nonlinear dynamics of learning in deep linear neural networks

We introduce a mathematical condition for faithful backpropagation of error signals, namely dynamical isometry, and show, surprisingly that random scaled Gaussian initializations cannot achieve this condition despite their norm-preserving nature, while greedy pre-training and random orthogonal initialization can, thereby achieving depth independent learning times.

http://arxiv.org/pdf/1602.07868v3.pdf Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks

http://arxiv.org/pdf/1602.05897v1.pdf Toward Deeper Understanding of Neural Networks: The Power of Initialization and a Dual View on Expressivity

https://staff.fnwi.uva.nl/m.welling/wp-content/uploads/papers/DeepBNN.pdf Structured and Efficient Variational Deep Learning with Matrix Gaussian Posteriors

We introduce a variational Bayesian neural network where the parameters are governed via a probability distribution on random matrices.

https://arxiv.org/abs/1511.06856 Data-dependent Initializations of Convolutional Neural Networks

http://arxiv.org/pdf/1606.05336v1.pdf On the expressive power of deep neural networks

Random networks We consider an average case which arises naturally in the practical usage of neural network architectures, the behavior of networks after random initialization. The expressivity of these random networks is largely unexplored, though a connection between random networks and compositional kernels is developed in Daniely et al. . The class of random networks enables tractable theoretical and experimental analysis of interrelated aspects of expressivity: trajectory length, neuron transitions, network activation patterns, and number of dichotomies (number of labellings on inputs). We find that all of these properties grow exponentially with depth, but only polynomially with width.

http://jmlr.org/proceedings/papers/v48/henaff16.pdf Recurrent Orthogonal Networks and Long-Memory Tasks

In this work, we analyzed two standard synthetic long-term memory problems and provided explicit RNN solutions for them. We found that the (fixed length T) copy problem can be solved using an RNN with a transition matrix that is a T + S root of the identity matrix I, and whose eigenvalues are well distributed on the unit circle, and we remarked that random orthogonal matrices almost satisfy this description.

From 2sigma:

In this paper, the authors construct explicit solutions based on orthogonal weight matrices for the copy and addition benchmark tasks. Orthogonal matrices avoid the vanishing and exploding gradients problem in the same way as unitary matrices, but they have real-valued entries instead of complex-valued entries. The authors show that their hand-designed networks work well when applied to the task for which they are designed, but produce poor results when applied to other tasks. These experiments illustrate the difficulty of designing general networks that perform well on a range of tasks.

http://arxiv.org/abs/1511.03771 Improving performance of recurrent neural network with relu nonlinearity

http://jmlr.org/proceedings/papers/v9/glorot10a/glorot10a.pdf Understanding the difficulty of training deep feedforward neural networks

http://arxiv.org/abs/1606.04801v2 A Powerful Generative Model Using Random Weights for the Deep Image Representation

we try to synthesize natural textures using the random weight VGG. With automatic normalization to scale the squared correlation loss for different activation layers.

replacing the weights with purely random values from a Gaussian distribution N(0, σ). The standard deviation, σ, is set to a small number like 0.015 in the experiments.

The main difference is that we use random weights instead of trained weights, and we apply weighting factors determined by a pre-process to normalize the different impact scales of different activation layers on the input layer. Another change is that we apply a greedy approach to build a “stacked” random weight network using the inversion technique to stabilize the visualization quality.

http://arxiv.org/abs/1607.02488v1 Generalizing and Improving Weight Initialization

We can therefore generalize Xavier initialization.

http://cmp.felk.cvut.cz/~mishkdmy/papers/mishkin-iclr2016.pdf All You Need is a Good Init

Layer-sequential unit-variance (LSUV) initialization – a simple method for weight initialization for deep net learning – is proposed. The method consists of the two steps. First, pre-initialize weights of each convolution or inner-product layer with orthonormal matrices. Second, proceed from the first to the final layer, normalizing the variance of the output of each layer to be equal to one.

http://openreview.net/pdf?id=H1W1UN9gg DEEP INFORMATION PROPAGATION

We study the behavior of untrained neural networks whose weights and biases are randomly distributed using mean field theory. We show the existence of depth scales that naturally limit the maximum depth of signal propagation through these random networks. Our main practical result is to show that random networks may be trained precisely when information can travel through them. Thus, the depth scales that we identify provide bounds on how deep a network may be trained for a specific choice of hyperparameters. As a corollary to this, we argue that in networks at the edge of chaos, one of these depth scales diverges. Thus arbitrarily deep networks may be trained only sufficiently close to criticality. We show that the presence of dropout destroys the order-to-chaos critical point and therefore strongly limits the maximum trainable depth for random networks. Finally, we develop a mean field theory for backpropagation and we show that the ordered and chaotic phases correspond to regions of vanishing and exploding gradient respectively.

http://openreview.net/pdf?id=HkuVu3ige ON ORTHOGONALITY AND LEARNING RECURRENT NETWORKS WITH LONG TERM DEPENDENCIES

Orthogonal matrices preserve gradient norm during backpropagation and can therefore be a desirable property; however, we find that hard constraints on orthogonality can negatively affect the speed of convergence and model performance. This paper explores the issues of optimization convergence, speed and gradient stability using a variety of different methods for encouraging or enforcing orthogonality. In particular we propose a weight matrix factorization and parameterization strategy through which we can bound the gain of hidden and therein to hidden transitions induced on backpropagated gradients.

Our experiments indicate that moving away from hard constraints on norm preservation through orthogonality can help optimization and model performance. However, we observe consistently that relaxing regularization which encourages the spectral norms of weight matrices to be close to one, or allowing bounds on the spectral norms of weight matrices to be too wide leads to unstable optimization.

http://openreview.net/pdf?id=ry_4vpixl Rotation Plane Doubly Orthogonal Recurrent Neural Networks

Recurrent Neural Networks (RNNs) applied to long sequences suffer from the well known vanishing and exploding gradients problem. The recently proposed Unitary Evolution Recurrent Neural Network (uRNN) alleviates the exploding gradient problem and can learn very long dependencies, but its nonlinearities make it still affected by the vanishing gradient problem and so learning can break down for extremely long dependencies. We propose a new RNN transition architecture where the hidden state is updated multiplicatively by a time invariant orthogonal transformation followed by an input modulated orthogonal transformation. There are no additive interactions and so our architecture exactly preserves forward hid-den state activation norm and backwards gradient norm for all time steps, and is provably not affected by vanishing or exploding gradients. We propose using the rotation plane parameterization to represent the orthogonal matrices. We validate our model on a simplified memory copy task and see that our model can learn dependencies as long as 5,000 timesteps.

https://arxiv.org/abs/1611.01232 Deep Information Propagation

We study the behavior of untrained neural networks whose weights and biases are randomly distributed using mean field theory. We show the existence of depth scales that naturally limit the maximum depth of signal propagation through these random networks. Our main practical result is to show that random networks may be trained precisely when information can travel through them. Thus, the depth scales that we identify provide bounds on how deep a network may be trained for a specific choice of hyperparameters. As a corollary to this, we argue that in networks at the edge of chaos, one of these depth scales diverges. Thus arbitrarily deep networks may be trained only sufficiently close to criticality. We show that the presence of dropout destroys the order-to-chaos critical point and therefore strongly limits the maximum trainable depth for random networks. Finally, we develop a mean field theory for backpropagation and we show that the ordered and chaotic phases correspond to regions of vanishing and exploding gradient respectively.

https://openreview.net/pdf?id=S1HcOI5le OMG: ORTHOGONAL METHOD OF GROUPING WITH APPLICATION OF K-SHOT LEARNING

we proposed a generalizable k-shot learning framework that can be used on any pre-trained network, by grouping network parameters to produce a low-dimensional representation of the parameter space. The grouping of the parameters is based on an orthogonal decomposition of the parameter space. To avoid overfitting, groups of parameters will be updated together during the k-shot training process. Furthermore, this framework can be integrated with any existing popular deep neural networks such as VGG, GoogleNet, ResNet, without any changes in the original network structure or any sacrifices in performance.

We present an intriguing discovery related to Random Fourier Features: replacing multiplication by a random Gaussian matrix with multiplication by a properly scaled random orthogonal matrix significantly decreases kernel approximation error. We call this technique Orthogonal Random Features (ORF), and provide theoretical and empirical justification for its effectiveness. Motivated by the discovery, we further propose Structured Orthogonal Random Features (SORF), which uses a class of structured discrete orthogonal matrices to speed up the computation. The method reduces the time cost from (d2) to (dlogd), where d is the data dimensionality, with almost no compromise in kernel approximation quality compared to ORF. Experiments on several datasets verify the effectiveness of ORF and SORF over the existing methods. We also provide discussions on using the same type of discrete orthogonal structure for a broader range of kernels and applications.

https://arxiv.org/pdf/1611.06310.pdf LOCAL MINIMA IN TRAINING OF DEEP NETWORKS

The overwhelming amount of empirical evidence points towards learning being well behaved in practice. We argue that the only way to reconcile these different observations is if the well-behaved learning dynamics are local and conditioned on data structure, initialization and maybe other architectural choices. One can imagine a continuum from the very specific, where every detail of the setup is important to attain good learning dynamics, to the generic, where learning is globally well behaved regardless of dataset or initialization. We believe that a crucial and important step forward in the theoretical study of the neural networks can be made by identifying where exactly this class of models falls on this continuum. In particular, what are the most generic set of constraints that need to be respected in order to attain the good behaviour. Our results focus on constructing counterexamples which result in a bad learning dynamics. While this does not directly lead to sufficient conditions for well-behaved systems, we hope that by carving out the space of possible conditions we move forward towards our goal.

https://arxiv.org/abs/1609.01037 Distribution-specific hardness of learning neural networks

To formally explain the practical success of neural network learning, it seems that one would need to employ a careful combination of assumptions on both the input distribution and the target function. To prove our results, we develop some tools which may be of independent interest, such as extension of Fourier-based techniques for proving hardness in the statistical queries framework \cite{blum1994weakly}, from the Boolean cube to Euclidean space.

https://arxiv.org/pdf/1702.06295.pdf Convolution Aware Initialization

Majority of current initialization schemes do not take fully into account the intrinsic structure of the convolution operator. This paper introduces a new type of initialization built around the duality of the Fourier transform and the convolution operator. With Convolution Aware Initialization we noticed not only higher accuracy and lower loss, but faster convergence in general. We achieve new state of the art on the CIFAR10 dataset, and achieve close to state of the art on various other tasks.

https://arxiv.org/abs/1703.00864v1 The Unreasonable Effectiveness of Random Orthogonal Embeddings

We present a general class of embeddings based on structured random matrices with orthogonal rows which can be applied in many machine learning applications including dimensionality reduction, kernel approximation and locality-sensitive hashing. We show that this class yields improvements over previous state-of-the-art methods either in computational efficiency (while providing similar accuracy) or in accuracy, or both. In particular, we propose the \textit{Orthogonal Johnson-Lindenstrauss Transform} (OJLT) which is as fast as earlier methods yet provably outperforms them in terms of accuracy, leading to a `free lunch' improvement over previous dimensionality reduction mechanisms. We introduce matrices with complex entries that further improve accuracy. Other applications include estimators for certain pointwise nonlinear Gaussian kernels, and speed improvements for approximate nearest-neighbor search in massive datasets with high-dimensional feature vectors.

https://arxiv.org/abs/1703.01827v1 All You Need is Beyond a Good Init: Exploring Better Solution for Training Extremely Deep Convolutional Neural Networks with Orthonormality and Modulation

Deep neural network is difficult to train and this predicament becomes worse as the depth increases. The essence of this problem exists in the magnitude of backpropagated errors that will result in gradient vanishing or exploding phenomenon. We show that a variant of regularizer which utilizes orthonormality among different filter banks can alleviate this problem. Moreover, we design a backward error modulation mechanism based on the quasi-isometry assumption between two consecutive parametric layers. Equipped with these two ingredients, we propose several novel optimization solutions that can be utilized for training a specific-structured (repetitively triple modules of Conv-BNReLU) extremely deep convolutional neural network (CNN) WITHOUT any shortcuts/ identity mappings from scratch. Experiments show that our proposed solutions can achieve 4% improvement for a 44-layer plain network and almost 50% improvement for a 110-layer plain network on the CIFAR-10 dataset. Moreover, we can successfully train plain CNNs to match the performance of the residual counterparts. Besides, we propose new principles for designing network structure from the insights evoked by orthonormality. Combined with residual structure, we achieve comparative performance on the ImageNet dataset.

https://arxiv.org/abs/1702.08591v1 The Shattered Gradients Problem: If resnets are the answer, then what is the question?

A long-standing obstacle to progress in deep learning is the problem of vanishing and exploding gradients. The problem has largely been overcome through the introduction of carefully constructed initializations and batch normalization. Nevertheless, architectures incorporating skip-connections such as resnets perform much better than standard feedforward architectures despite well-chosen initialization and batch normalization. In this paper, we identify the shattered gradients problem. Specifically, we show that the correlation between gradients in standard feedforward networks decays exponentially with depth resulting in gradients that resemble white noise. In contrast, the gradients in architectures with skip-connections are far more resistant to shattering decaying sublinearly. Detailed empirical evidence is presented in support of the analysis, on both fully-connected networks and convnets. Finally, we present a new “looks linear” (LL) initialization that prevents shattering. Preliminary experiments show the new initialization allows to train very deep networks without the addition of skip-connections.

https://arxiv.org/pdf/1611.06013v3.pdf Improving training of deep neural networks via Singular Value Bounding

e. Our research is inspired by theoretical and empirical results that use orthogonal matrices to initialize networks, but we are interested in investigating how orthogonal weight matrices perform when network training converges. To this end, we propose to constrain the solutions of weight matrices in the orthogonal feasible set during the whole process of network training, and achieve this by a simple yet effective method called Singular Value Bounding (SVB). In SVB, all singular values of each weight matrix are simply bounded in a narrow band around the value of 1. Based on the same motivation, we also propose Bounded Batch Normalization (BBN), which improves Batch Normalization by removing its potential risk of ill-conditioned layer transform.

https://arxiv.org/pdf/1602.06662v2.pdf Recurrent Orthogonal Networks and Long-Memory Tasks

We found that the (fixed length T) copy problem can be solved using an RNN with a transition matrix that is a T + S root of the identity matrix I, and whose eigenvalues are well distributed on the unit circle, and we remarked that random orthogonal matrices almost satisfy this description.

https://arxiv.org/abs/1703.08961v1 Scaling the Scattering Transform: Deep Hybrid Networks

We use the scattering network as a generic and fixed initialization of the first layers of a supervised hybrid deep network. We show that early layers do not necessarily need to be learned, providing the best results to-date with pre-defined representations while being competitive with Deep CNNs.

https://arxiv.org/abs/1704.08863 On weight initialization in deep neural networks

https://arxiv.org/pdf/1706.10239v1.pdf Towards Understanding Generalization of Deep Learning: Perspective of Loss Landscapes

We show that it is the characteristics the landscape of the loss function that explains the good generalization capability. For the landscape of loss function for deep networks, the volume of basin of attraction of good minima dominates over that of poor minima, which guarantees optimization methods with random initialization to converge to good minima.

https://arxiv.org/abs/1802.09979v1 The Emergence of Spectral Universality in Deep Networks

Recent work has shown that tight concentration of the entire spectrum of singular values of a deep network's input-output Jacobian around one at initialization can speed up learning by orders of magnitude.

Our results provide a principled framework for the initialization of weights and the choice of nonlinearities in order to produce well-conditioned Jacobians and fast learning. Intriguingly, we find novel universality classes of deep spectra that remain well-conditioned as the depth goes to infinity, as well as theoretical conditions for their existence.

https://arxiv.org/abs/1804.03758 Universal Successor Representations for Transfer Reinforcement Learning

The objective of transfer reinforcement learning is to generalize from a set of previous tasks to unseen new tasks. In this work, we focus on the transfer scenario where the dynamics among tasks are the same, but their goals differ. Although general value function (Sutton et al., 2011) has been shown to be useful for knowledge transfer, learning a universal value function can be challenging in practice. To attack this, we propose (1) to use universal successor representations (USR) to represent the transferable knowledge and (2) a USR approximator (USRA) that can be trained by interacting with the environment. Our experiments show that USR can be effectively applied to new tasks, and the agent initialized by the trained USRA can achieve the goal considerably faster than random initialization.

https://arxiv.org/abs/1806.10909 ResNet with one-neuron hidden layers is a Universal Approximator

https://arxiv.org/pdf/1809.08836.pdf Dense neural networks as sparse graphs and the lightning initialization