Edit: https://docs.google.com/a/codeaudit.com/document/d/1uE-umU70oYCxK1YJjBt_Ls_ZGcLYZnLNKw2Y00jNICg/edit?usp=sharing

Attentive Model

Aliases Attention

This identifies the pattern and should be representative of the concept that it describes. The name should be a noun that should be easily usable within a sentence. We would like the pattern to be easily referenceable in conversation between practitioners.


Training a network to focus on a subset of data leads to more accurate recognition while ignoring irrelevant features.


How do we train a network to focus on a subset of input data?


This section provides alternative descriptions of the pattern in the form of an illustration or alternative formal expression. By looking at the sketch a reader may quickly understand the essence of the pattern. Discussion

This is the main section of the pattern that goes in greater detail to explain the pattern. We leverage a vocabulary that we describe in the theory section of this book. We don’t go into intense detail into providing proofs but rather reference the sources of the proofs. How the motivation is addressed is expounded upon in this section. We also include additional questions that may be interesting topics for future research.

Known Uses

models such as sequence to sequence (seq2seq) with attention, memory networks and pointer networks

Related Patterns In this section we describe in a diagram how this pattern is conceptually related to other patterns. The relationships may be as precise or may be fuzzy, so we provide further explanation into the nature of the relationship. We also describe other patterns may not be conceptually related but work well in combination with this pattern.

Relationship to Canonical Patterns

Relationship to other Patterns

Further Reading

We provide here some additional external material that will help in exploring this pattern in more detail.


To aid in reading, we include sources that are referenced in the text in the pattern.

http://arxiv.org/abs/1601.06823v1 Survey on the attention based RNN model and its applications in computer vision

http://arxiv.org/pdf/1409.3215v3.pdf Sequence to Sequence Learning with Neural Networks

http://arxiv.org/abs/1406.1078 Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation


http://cs224d.stanford.edu/reports/HoHuang.pdf Attend and Hop


https://arxiv.org/abs/1502.04623 DRAW: A Recurrent Neural Network For Image Generation



https://arxiv.org/abs/1409.0473 Neural Machine Translation by Jointly Learning to Align and Translate

https://arxiv.org/abs/1610.06258 Using Fast Weights to Attend to the Recent Past

https://arxiv.org/abs/1610.08613v1 Can Active Memory Replace Attention?

We propose an extended model of active memory that matches existing attention models on neural machine translation and generalizes better to longer sentences. We investigate this model and explain why previous active memory models did not succeed. Finally, we discuss when active memory brings most benefits and where attention can be a better choice.

https://arxiv.org/abs/1509.06664 Reasoning about Entailment with Neural Attention

In this paper, we propose a neural model that reads two sentences to determine entailment using long short-term memory units. We extend this model with a word-by-word neural attention mechanism that encourages reasoning over entailments of pairs of words and phrases. Furthermore, we present a qualitative analysis of attention weights produced by this model, demonstrating such reasoning capabilities. On a large entailment dataset this model outperforms the previous best neural model and a classifier with engineered features by a substantial margin. It is the first generic end-to-end differentiable system that achieves state-of-the-art accuracy on a textual entailment dataset.

http://jmlr.org/proceedings/papers/v48/santoro16.html Meta-Learning with Memory-Augmented Neural Networks

We demonstrate the ability of a memory-augmented neural network to rapidly assimilate new data, and leverage this data to make accurate predictions after only a few samples. We also introduce a new method for accessing an external memory that focuses on memory content, unlike previous methods that additionally use memory location-based focusing mechanisms.

https://arxiv.org/pdf/1606.01933v1.pdf A Decomposable Attention Model for Natural Language Inference

We propose a simple neural architecture for natural language inference. Our approach uses attention to decompose the problem into subproblems that can be solved separately, thus making it trivially parallelizable. On the Stanford Natural Language Inference (SNLI) dataset, we obtain state-of-the-art results with almost an order of magnitude fewer parameters than previous work and without relying on any word-order information. Adding intra-sentence attention that takes a minimum amount of order into account yields further improvements.

Pictoral overview of the approach, showing the Attend (left), Compare (center) and Aggregate (right) steps.

https://arxiv.org/pdf/1606.01933v1.pdf Hierarchical Attention Networks for Document Classification

We propose a hierarchical attention network for document classification. Our model has two distinctive characteristics: (i) it has a hierarchical structure that mirrors the hierarchical structure of documents; (ii) it has two levels of attention mechanisms applied at the wordand sentence-level, enabling it to attend differentially to more and less important content when constructing the document representation. Experiments conducted on six large scale text classification tasks demonstrate that the proposed architecture outperform previous methods by a substantial margin. Visualization of the attention layers illustrates that the model selects qualitatively informative words and sentences.

https://arxiv.org/pdf/1612.01887.pdf Knowing When to Look: Adaptive Attention via A Visual Sentinel for Image Captioning

Attention-based neural encoder-decoder frameworks have been widely adopted for image captioning. Most methods force visual attention to be active for every generated word. However, the decoder likely requires little to no visual information from the image to predict non-visual words such as “the” and “of”. Other words that may seem visual can often be predicted reliably just from the language model e.g., “sign” after “behind a red stop” or “phone” following “talking on a cell”. In this paper, we propose a novel adaptive attention model with a visual sentinel. At each time step, our model decides whether to attend to the image (and if so, to which regions) or to the visual sentinel. The model decides whether to attend to the image and where, in order to extract meaningful information for sequential word generation. We test our method on the COCO image captioning 2015 challenge dataset and Flickr30K. Our approach sets the new state-of-the-art by a significant margin.

https://arxiv.org/abs/1606.02245 Iterative Alternating Neural Attention for Machine Reading

We propose a novel neural attention architecture to tackle machine comprehension tasks, such as answering Cloze-style queries with respect to a document. Unlike previous models, we do not collapse the query into a single vector, instead we deploy an iterative alternating attention mechanism that allows a fine-grained exploration of both the query and the document. Our model outperforms state-of-the-art baselines in standard machine comprehension benchmarks such as CNN news articles and the Children's Book Test (CBT) dataset.

https://arxiv.org/pdf/1610.08613v2.pdf Can Active Memory Replace Attention?

So far, however, active memory has not improved over attention for most natural language processing tasks, in particular for machine translation. We analyze this shortcoming in this paper and propose an extended model of active memory that matches existing attention models on neural machine translation and generalizes better to longer sentences. We investigate this model and explain why previous active memory models did not succeed. Finally, we discuss when active memory brings most benefits and where attention can be a better choice.

https://arxiv.org/abs/1706.03762 Attention Is All You Need

The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train.

https://arxiv.org/pdf/1704.06904.pdf Residual Attention Network for Image Classification

In this work, we propose “Residual Attention Network”, a convolutional neural network using attention mechanism which can incorporate with state-of-art feed forward network architecture in an end-to-end training fashion. Our Residual Attention Network is built by stacking Attention Modules which generate attention-aware features. The attention-aware features from different modules change adaptively as layers going deeper. Inside each Attention Module, bottom-up top-down feedforward structure is used to unfold the feedforward and feedback attention process into a single feedforward process. Importantly, we propose attention residual learning to train very deep Residual Attention Networks which can be easily scaled up to hundreds of layers.


https://arxiv.org/abs/1702.04521 Frustratingly Short Attention Spans in Neural Language Modeling

In this paper, we propose a neural language model with a key-value attention mechanism that outputs separate representations for the key and value of a differentiable memory, as well as for encoding the next-word distribution. This model outperforms existing memory-augmented neural language models on two corpora. Yet, we found that our method mainly utilizes a memory of the five most recent output representations. This led to the unexpected main finding that a much simpler model based only on the concatenation of recent output representations from previous time steps is on par with more sophisticated memory-augmented neural language models.

https://arxiv.org/pdf/1709.04696.pdf DiSAN: Directional Self-Attention Network for RNN/CNN-free Language Understanding

We propose a novel attention mechanism in which the attention between elements from input sequence(s) is directional and multidimensional, i.e., feature-wise. A light-weight neural net, “Directional Self-Attention Network (DiSAN)”, is then proposed to learn sentence embedding, based solely on the proposed attention without any RNN/CNN structure. DiSAN is only composed of a directional self-attention block with temporal order encoded, followed by a multi-dimensional attention that compresses the sequence into a vector representation.


https://arxiv.org/abs/1712.05652v1 Pre-training Attention Mechanisms

https://arxiv.org/abs/1801.10296 Reinforced Self-Attention Network: a Hybrid of Hard and Soft Attention for Sequence Modeling

https://arxiv.org/abs/1802.05751v2 Image Transformer

In this work, we generalize a recently proposed model architecture based on self-attention, the Transformer, to a sequence modeling formulation of image generation with a tractable likelihood. By restricting the self-attention mechanism to attend to local neighborhoods we significantly increase the size of images the model can process in practice, despite maintaining significantly larger receptive fields per layer than typical convolutional neural networks. We propose another extension of self-attention allowing it to efficiently take advantage of the two-dimensional nature of images. While conceptually simple, our generative models significantly outperform the current state of the art in image generation on ImageNet, improving the best published negative log-likelihood on ImageNet from 3.83 to 3.77

https://arxiv.org/abs/1803.03067 Compositional Attention Networks for Machine Reasoning

The model approaches problems by decomposing them into a series of attention-based reasoning steps, each performed by a novel recurrent Memory, Attention, and Composition (MAC) cell that maintains a separation between control and memory

https://www.cs.cmu.edu/~diyiy/docs/naacl16.pdf Hierarchical Attention Networks for Document Classification

Our model has two distinctive characteristics: (i) it has a hierarchical structure that mirrors the hierarchical structure of documents; (ii) it has two levels of attention mechanisms applied at the wordand sentence-level, enabling it to attend differentially to more and less important content when constructing the document representation.


https://arxiv.org/abs/1804.06870 Object Ordering with Bidirectional Matchings for Visual Reasoning

In this paper, we propose a novel end-to-end neural model for the NLVR task, where we first use joint bidirectional attention to build a two-way conditioning between the visual information and the language phrases. Next, we use an RL-based pointer network to sort and process the varying number of unordered objects (so as to match the order of the statement phrases) in each of the three images and then pool over the three decisions. Our model achieves strong improvements (of 4-6% absolute) over the state-of-the-art on both the structured representation and raw image versions of the dataset.


https://arxiv.org/abs/1804.09849v2 The Best of Both Worlds: Combining Recent Advances in Neural Machine Translation

In this paper, we tease apart the new architectures and their accompanying techniques in two ways. First, we identify several key modeling and training techniques, and apply them to the RNN architecture, yielding a new RNMT+ model that outperforms all of the three fundamental architectures on the benchmark WMT'14 English to French and English to German tasks.

https://arxiv.org/abs/1805.04174 Joint Embedding of Words and Labels for Text Classification

Word embeddings are effective intermediate representations for capturing semantic regularities between words, when learning the representations of text sequences. We propose to view text classification as a label-word joint embedding problem: each label is embedded in the same space with the word vectors. We introduce an attention framework that measures the compatibility of embeddings between text sequences and labels. The attention is learned on a training set of labeled samples to ensure that, given a text sequence, the relevant words are weighted higher than the irrelevant ones. Our method maintains the interpretability of word embeddings, and enjoys a built-in ability to leverage alternative sources of information, in addition to input text sequences. Extensive results on the several large text datasets show that the proposed framework outperforms the state-of-the-art methods by a large margin, in terms of both accuracy and speed.

https://arxiv.org/abs/1805.09786v1 Hyperbolic Attention Networks

We extend this line of work by imposing hyperbolic geometry on the activations of neural networks. This allows us to exploit hyperbolic geometry to reason about embeddings produced by deep networks. We achieve this by re-expressing the ubiquitous mechanism of soft attention in terms of operations defined for hyperboloid and Klein models. Our method shows improvements in terms of generalization on neural machine translation, learning on graphs and visual question answering tasks while keeping the neural representations compact.

https://arxiv.org/abs/1806.01830 Relational Deep Reinforcement Learning

https://www.mitpressjournals.org/doi/abs/10.1162/isal_a_00095 Being curious about the answers to questions: novelty search with learned attention

https://arxiv.org/abs/1802.04712v4 Attention-based Deep Multiple Instance Learning

Multiple instance learning (MIL) is a variation of supervised learning where a single class label is assigned to a bag of instances. In this paper, we state the MIL problem as learning the Bernoulli distribution of the bag label where the bag label probability is fully parameterized by neural networks. Furthermore, we propose a neural network-based permutation-invariant aggregation operator that corresponds to the attention mechanism. Notably, an application of the proposed attention-based operator provides insight into the contribution of each instance to the bag label. We show empirically that our approach achieves comparable performance to the best MIL methods on benchmark MIL datasets and it outperforms other methods on a MNIST-based MIL dataset and two real-life histopathology datasets without sacrificing interpretability.

https://arxiv.org/abs/1808.00300 Learning Visual Question Answering by Bootstrapping Hard Attention

we introduce a new approach for hard attention and find it achieves very competitive performance on a recently-released visual question answering datasets, equalling and in some cases surpassing similar soft attention architectures while entirely ignoring some features. Even though the hard attention mechanism is thought to be non-differentiable, we found that the feature magnitudes correlate with semantic relevance, and provide a useful signal for our mechanism's attentional selection criterion. Because hard attention selects important features of the input information, it can also be more efficient than analogous soft attention mechanisms.

https://arxiv.org/abs/1808.03728v1 Hierarchical Attention: What Really Counts in Various NLP Tasks

https://arxiv.org/abs/1807.03819 Universal Transformers

nstead of recurring over the individual symbols of sequences like RNNs, the Universal Transformer repeatedly revises its representations of all symbols in the sequence with each recurrent step. In order to combine information from different parts of a sequence, it employs a self-attention mechanism in every recurrent step. Assuming sufficient memory, its recurrence makes the Universal Transformer computationally universal. We further employ an adaptive computation time (ACT) mechanism to allow the model to dynamically adjust the number of times the representation of each position in a sequence is revised. Beyond saving computation, we show that ACT can improve the accuracy of the model. Our experiments show that on various algorithmic tasks and a diverse set of large-scale language understanding tasks the Universal Transformer generalizes significantly better and outperforms both a vanilla Transformer and an LSTM in machine translation, and achieves a new state of the art on the bAbI linguistic reasoning task and the challenging LAMBADA language modeling task.

https://arxiv.org/abs/1808.03867 Pervasive Attention: 2D Convolutional Neural Networks for Sequence-to-Sequence Prediction

Current state-of-the-art machine translation systems are based on encoder-decoder architectures, that first encode the input sequence, and then generate an output sequence based on the input encoding. Both are interfaced with an attention mechanism that recombines a fixed encoding of the source tokens based on the decoder state. We propose an alternative approach which instead relies on a single 2D convolutional neural network across both sequences. Each layer of our network re-codes source tokens on the basis of the output sequence produced so far. Attention-like properties are therefore pervasive throughout the network. Our model yields excellent results, outperforming state-of-the-art encoder-decoder systems, while being conceptually simpler and having fewer parameters.

https://arxiv.org/abs/1808.05578 LARNN: Linear Attention Recurrent Neural Network

The Linear Attention Recurrent Neural Network (LARNN) is a recurrent attention module derived from the Long Short-Term Memory (LSTM) cell and ideas from the consciousness Recurrent Neural Network (RNN). Yes, it LARNNs. The LARNN uses attention on its past cell state values for a limited window size k. The formulas are also derived from the Batch Normalized LSTM (BN-LSTM) cell and the Transformer Network for its Multi-Head Attention Mechanism. The Multi-Head Attention Mechanism is used inside the cell such that it can query its own k past values with the attention window. https://github.com/guillaume-chevalier/Linear-Attention-Recurrent-Neural-Network

http://petar-v.com/GAT/ Graph Attention Networks

https://arxiv.org/abs/1808.04444 Character-Level Language Modeling with Deeper Self-Attention

. In this paper, we show that a deep (64-layer) transformer model with fixed context outperforms RNN variants by a large margin, achieving state of the art on two popular benchmarks- 1.13 bits per character on text8 and 1.06 on enwik8. To get good results at this depth, we show that it is important to add auxiliary losses, both at intermediate network layers and intermediate sequence positions.

https://arxiv.org/abs/1808.08946v2 Why Self-Attention? A Targeted Evaluation of Neural Machine Translation Architectures

Our experimental results show that: 1) self-attentional networks and CNNs do not outperform RNNs in modeling subject-verb agreement over long distances; 2) self-attentional networks perform distinctly better than RNNs and CNNs on word sense disambiguation.

https://arxiv.org/abs/1808.03867 Pervasive Attention: 2D Convolutional Neural Networks for Sequence-to-Sequence Prediction

https://arxiv.org/abs/1809.11087 Learning to Remember, Forget and Ignore using Attention Control in Memory

Applying knowledge gained from psychological studies, we designed a new model called Differentiable Working Memory (DWM) in order to specifically emulate human working memory. As it shows the same functional characteristics as working memory, it robustly learns psychology inspired tasks and converges faster than comparable state-of-the-art models. Moreover, the DWM model successfully generalizes to sequences two orders of magnitude longer than the ones used in training. Our in-depth analysis shows that the behavior of DWM is interpretable and that it learns to have fine control over memory, allowing it to retain, ignore or forget information based on its relevance.

https://openreview.net/forum?id=rJxHsjRqFQ Hyperbolic Attention Networks

By only changing the geometry of embedding of object representations, we can use the embedding space more efficiently without increasing the number of parameters of the model. Mainly as the number of objects grows exponentially for any semantic distance from the query, hyperbolic geometry –as opposed to Euclidean geometry– can encode those objects without having any interference. Our method shows improvements in generalization on neural machine translation on WMT'14 (English to German), learning on graphs (both on synthetic and real-world graph tasks) and visual question answering (CLEVR) tasks while keeping the neural representations compact.