Name Weight Sharing aka Tied Weights

Intent

Sharing weights for output improves generalization.

Problem

How can we train multiple ensembles while reducing memory costs?

Structure

<Diagram>

Solution

Rationale

This is related to Ensemble. Convolution Networks also employ this.

Known Uses

DropOut

Convolution Network

Residual Network

RNN

Hashing Trick https://arxiv.org/pdf/1504.04788.pdf

References

Drop Out - http://arxiv.org/pdf/1207.0580v1.pdf

“the justification for sharing weights is the translational invariance ('having the same weights independent of the absolute position within the image') – the reduced number of parameters and greater ease of training is a nice consequence” - Andre Holzner

Dropout Hinton et al. [7] show that dropping out individual neurons during training leads to a network which is equivalent to averaging over an ensemble of exponentially many networks. Similar in spirit, stochastic depth [9] trains an ensemble of networks by dropping out entire layers during training. These two strategies are “ensembles by training” because the ensemble arises only as a result of the special training strategy. However, we show that residual networks are “ensembles by construction” as a natural result of the structure of the architecture.

The main advantage of shared weights, is that you can substantially lower the degrees of freedom of your problem. Take the simplest case, think of a tied autoencoder, where the input weights are Wx∈ℝdWx∈Rd and the output weights are WTxWxT. You have lowered the parameters of your model by half from 2d→d2d→d. You can see some visualizations here: link. Similar results would be obtained in a Conv Net.

This way you can get the following results:

less parameters to optimize,

which means faster convergence to some minima,

at the expense of making your model less flexible. It is interesting to note that, this “less flexibility” can work as a regularizer many times and avoiding overfitting as the weights are shared with some other neurons.

https://arxiv.org/pdf/1608.05859v3.pdf Using the Output Embedding to Improve Language Models

We also offer a new method of regularizing the output embedding. Our methods lead to a significant reduction in perplexity, as we are able to show on a variety of neural network language models. Finally, we show that weight tying can reduce the size of neural translation models to less than half of their original size without harming their performance.

https://arxiv.org/pdf/1702.08782v1.pdf ShaResNet: reducing residual network parameter number by sharing weights

We share the weights of convolutional layers between residual blocks operating at the same spatial scale. The signal flows multiple times in the same convolutional layer. The resulting architecture, called ShaResNet, contains block specificlayersand shared layers.

https://arxiv.org/pdf/1703.00848.pdf Unsupervised Image-to-Image Translation Networks

The proposed framework can learn the translation function without any corresponding images in two domains. We enable this learning capability by combining a weight-sharing constraint and an adversarial training objective.

We model each image domain using a VAE and a GAN. Through an adversarial training objective, an image fidelity function is implicitly defined for each domain. The adversarial training objective interacts with a weight-sharing constraint to generate corresponding images in two domains, while the variational autoencoders relate translated images with input images in the respective domains.

Based on the intuition that a pair of corresponding images in different domains should share a same high-level image representation, we enforce several weight sharing constraints. The connection weights of the last few layers (high-level layers) in E1 and E2 are tied, the connection weights of the first few layers (high-level layers) in G1 and G2 are tied, and the connection weights of the last few layers (high-level layers) in D1 and D2 are tied.

In our paper, we use HyperNetworks to explore a middle ground - to enforce a relaxed version of weight-tying. A HyperNetwork is just a small network that generates the weights of a much larger network, like the weights of a deep ResNet, effectively parameterizing the weights of each layer of the ResNet. We can use hypernetwork to explore the tradeoff between the model’s expressivity versus how much we tie the weights of a deep ConvNet. It is kind of like applying compression to an image, and being able to adjust how much compression we want to use, except here, the images are the weights of a deep ConvNet.

https://arxiv.org/abs/1702.02535v3 Exploiting Domain Knowledge via Grouped Weight Sharing with Application to Text Categorization

Prior work on weight sharing in neural networks has considered it largely as a means of model compression. In contrast, we treat weight sharing as a flexible mechanism for incorporating prior knowledge into neural models. We show that this approach consistently yields improved performance on classification tasks compared to baseline strategies that do not exploit weight sharing.

https://arxiv.org/abs/1705.08142v1 Sluice networks: Learning what to share between loosely related tasks

To overcome this, we introduce Sluice Networks, a general framework for multi-task learning where trainable parameters control the amount of sharing – including which parts of the models to share. Our framework goes beyond and generalizes over previous proposals in enabling hard or soft sharing of all combinations of subspaces, layers, and skip connections. We perform experiments on three task pairs from natural language processing, and across seven different domains, using data from OntoNotes 5.0, and achieve up to 15% average error reductions over common approaches to multi-task learning.

https://arxiv.org/pdf/1705.10494v1.pdf Joint auto-encoders: a flexible multi-task learning framework

We develop a framework for learning multiple tasks simultaneously, based on sharing features that are common to all tasks, achieved through the use of a modular deep feedforward neural network consisting of shared branches, dealing with the common features of all tasks, and private branches, learning the specific unique aspects of each task. Once an appropriate weight sharing architecture has been established, learning takes place through standard algorithms for feedforward networks, e.g., stochastic gradient descent and its variations. The method deals with domain adaptation and multi-task learning in a unified fashion, and can easily deal with data arising from different types of sources.

https://arxiv.org/pdf/1705.08142.pdf Sluice networks: Learning what to share between loosely related tasks

https://arxiv.org/abs/1702.08389 Equivariance Through Parameter-Sharing

https://arxiv.org/abs/1802.03268 Efficient Neural Architecture Search via Parameters Sharing

We propose Efficient Neural Architecture Search (ENAS), a fast and inexpensive approach for automatic model design. In ENAS, a controller learns to discover neural network architectures by searching for an optimal subgraph within a large computational graph. The controller is trained with policy gradient to select a subgraph that maximizes the expected reward on the validation set. Meanwhile the model corresponding to the selected subgraph is trained to minimize a canonical cross entropy loss. Thanks to parameter sharing between child models, ENAS is fast: it delivers strong empirical performances using much fewer GPU-hours than all existing automatic model design approaches, and notably, 1000x less expensive than standard Neural Architecture Search. On the Penn Treebank dataset, ENAS discovers a novel architecture that achieves a test perplexity of 55.8, establishing a new state-of-the-art among all methods without post-training processing. On the CIFAR-10 dataset, ENAS designs novel architectures that achieve a test error of 2.89%, which is on par with NASNet (Zoph et al., 2018), whose test error is 2.65%.

The main contribution of this work is to improve the efficiency of NAS by forcing all child models to share weights. The idea has apparent complications, as different child models might utilize their weights differently, but was encouraged by previous work on transfer learning and multitask learning, which have established that parameters learned for a particular model on a particular task can be used for other models on other tasks, with little to no modifications.

https://arxiv.org/abs/1710.09767v1 Meta Learning Shared Hierarchies

We develop a metalearning approach for learning hierarchically structured policies, improving sample efficiency on unseen tasks through the use of shared primitives—policies that are executed for large numbers of timesteps.