Differences

This shows you the differences between two versions of the page.

Link to this comparison view

layer_sharing [2016/12/08 04:54] (current)
Line 1: Line 1:
 +**Name** Layer Sharing
  
 +**Intent** Improve accuracy by using implicit ensembles that span network layers.
 +
 +
 +**Motivation** How can we share weights across several layers?
 +
 +
 +**Structure**
 +
 +<​Diagram>​
 +
 +**Discussion**
 +
 +
 +
 +**Known Uses**
 +
 +**Related Patterns**
 +
 +Batch Normalization
 +
 +<​Diagram>​
 +
 +
 +**References**
 +
 +http://​arxiv.org/​abs/​1603.09382v2 ​ Deep Networks with Stochastic Depth
 +
 +Stochastic depth, a training procedure that enables the seemingly contradictory setup to train short networks and obtain deep networks. We start with very deep networks but during training, for each mini-batch, randomly drop a subset of layers and bypass them with the identity function. ​
 +
 +http://​arxiv.org/​pdf/​1605.06465v1.pdf ​ Swapout: Learning an ensemble of deep architectures
 +
 +When viewed as a regularization method swapout not only inhibits co-adaptation of units in
 +a layer, similar to dropout, but also across network layers. We conjecture that
 +swapout achieves strong regularization by implicitly tying the parameters across
 +layers. When viewed as an ensemble training method, it samples a much richer
 +set of architectures than existing methods such as dropout or stochastic depth.
 +
 +{{:​wiki:​swapout.png?​nolink|}}
 +
 +https://​arxiv.org/​abs/​1604.03640 Bridging the Gaps Between Residual Learning, Recurrent Neural Networks and Visual Cortex
 +We begin with the observation that a shallow RNN is exactly equivalent to a very deep ResNet with weight sharing among the layers. ​
 +
 +{{https://​ai2-s2-public.s3.amazonaws.com/​figures/​2016-11-01/​1076c2ca761d741f034fd516182e994dc3212891/​14-Figure11-1.png}}
 +
 +A ResNet can be reformulated into a recurrent form that is almost identical to a conventional RNN.
 +
 +http://​arxiv.org/​abs/​1603.03116v2
 +Low-rank passthrough neural networks
 +We observe that these architectures,​ hereby characterized as Passthrough Networks, in addition to the mitigation of the vanishing gradient problem, enable the decoupling of the network state size from the number of parameters of the network, a possibility that is exploited in some recent works but not thoroughly explored. ​
 +
 +{{https://​ai2-s2-public.s3.amazonaws.com/​figures/​2016-11-01/​4953a80791ec513d97ab0519134d50f2aee10039/​5-Figure2-1.png?​600x600}}
 +
 +http://​arxiv.org/​pdf/​1605.07648v1.pdf FractalNet: Ultra-Deep Neural Networks without Residuals
 +
 +{{https://​ai2-s2-public.s3.amazonaws.com/​figures/​2016-11-01/​0d0101e65e52ae0cec38bcd13c6a9d631979c577/​1-Figure1-1.png?​600x600}}
 +
 +{{https://​ai2-s2-public.s3.amazonaws.com/​figures/​2016-11-01/​0d0101e65e52ae0cec38bcd13c6a9d631979c577/​4-Figure2-1.png?​600x600}}
 +
 +Drop-path. A fractal network block functions with some connections between layers disabled, provided some path from input to output is still available. Drop-path guarantees at least one such path, while sampling a subnetwork with many other paths disabled. During training, presenting a different active subnetwork to each mini-batch prevents co-adaptation of parallel paths. A global sampling strategy returns a single column as a subnetwork. Alternating it with local sampling encourages the development of individual columns as performant stand-alone subnetworks.
 +
 +https://​www.quora.com/​How-significant-is-the-FractalNet-paper
 +
 +A related paper that I think is more significant is this one: [1603.09382] Deep Networks with Stochastic Depth
 +
 +The stochastic depth paper has broader significance because it shows that deep models that don’t fall under the paradigm of “deep neural networks learn a different representation at every layer” can work really well. This gives a lot of evidence in favor of the “deep neural networks learn a multi-step program” paradigm. Previously, people were aware of both interpretations. Most deep learning models could be described by both paradigms equally well but I feel like the representation learning interpretation was more popular (considering one of the main deep learning conferences is called the International Conference on Learning Representations). The stochastic depth paper shows that you can have a single representation that gets updated by many steps of a program and that that works really well. It also shows that just letting this program run longer is helpful, even if it wasn’t trained to run that long. This suggests that the relatively neglected multi-step program interpretation may have been the more important one all along.
 +
 +{{https://​ai2-s2-public.s3.amazonaws.com/​figures/​2016-11-01/​a38168015a783fecc5830260a7eb5b9e3e945ee2/​4-Figure2-1.png?​600x600}}
 +
 +https://​arxiv.org/​abs/​1606.01305 Zoneout: Regularizing RNNs by Randomly Preserving Hidden Activations
 +
 +We propose zoneout, a novel method for regularizing RNNs. At each timestep, zoneout stochastically forces some hidden units to maintain their previous values. Like dropout, zoneout uses random noise to train a pseudo-ensemble,​ improving generalization. ​
 +
 +https://​arxiv.org/​abs/​1609.01704 ​ Hierarchical Multiscale Recurrent Neural Networks
 +
 +In this paper, we propose a novel multiscale approach, called the hierarchical multiscale recurrent neural networks, which can capture the latent hierarchical structure in the sequence by encoding the temporal dependencies with different timescales using a novel update mechanism. We show some evidence that our proposed multiscale architecture can discover underlying hierarchical structure in the sequences without using explicit boundary information. ​
 +
 +{{https://​ai2-s2-public.s3.amazonaws.com/​figures/​2016-11-01/​6bc2aa55575279f961c30468bc0b3777d906b23c/​3-Figure1-1.png}}