https://arxiv.org/abs/1612.06915 AIVAT: A New Variance Reduction Technique for Agent Evaluation in Imperfect Information Games

In this paper, we introduce AIVAT, a low variance, provably unbiased value assessment tool that uses an arbitrary heuristic estimate of state value, as well as the explicit strategy of a subset of the agents. Unlike existing techniques which reduce the variance from chance events, or only consider game ending actions, AIVAT reduces the variance both from choices by nature and by players with a known strategy. The resulting estimator in no-limit poker can reduce the number of hands needed to draw statistical conclusions by more than a factor of 10.

https://arxiv.org/abs/1612.09596 Counterfactual Prediction with Deep Instrumental Variables Networks

This paper provides a recipe for combining ML algorithms to solve for causal effects in the presence of instrumental variables – sources of treatment randomization that are conditionally independent from the response. We show that a flexible IV specification resolves into two prediction tasks that can be solved with deep neural nets: a first-stage network for treatment prediction and a second-stage network whose loss function involves integration over the conditional treatment distribution. This Deep IV framework imposes some specific structure on the stochastic gradient descent routine used for training, but it is general enough that we can take advantage of off-the-shelf ML capabilities and avoid extensive algorithm customization. We outline how to obtain out-of-sample causal validation in order to avoid over-fit.

https://arxiv.org/pdf/1701.01724v1.pdf DeepStack: Expert-Level Artificial Intelligence in No-Limit Poker

Artificial intelligence has seen a number of breakthroughs in recent years, with games often serving as significant milestones. A common feature of games with these successes is that they involve information symmetry among the players, where all players have identical information. This property of perfect information, though, is far more common in games than in real-world problems. Poker is the quintessential game of imperfect information, and it has been a longstanding challenge problem in artificial intelligence. In this paper we introduce DeepStack, a new algorithm for imperfect information settings such as poker. It combines recursive reasoning to handle information asymmetry, decomposition to focus computation on the relevant decision, and a form of intuition about arbitrary poker situations that is automatically learned from selfplay games using deep learning. In a study involving dozens of participants and 44,000 hands of poker, DeepStack becomes the first computer program to beat professional poker players in heads-up no-limit Texas hold’em. Furthermore, we show this approach dramatically reduces worst-case exploitability compared to the abstraction paradigm that has been favored for over a decade.

https://arxiv.org/pdf/1605.03661v2.pdf Learning Representations for Counterfactual Inference

https://arxiv.org/abs/1701.01724v2 DeepStack: Expert-Level Artificial Intelligence in No-Limit Poker

https://www.cs.cmu.edu/~sandholm/dynamicThresholding.aaai17.pdf

Regret minimization is widely used in determining strategies for imperfect-information games and in online learning. In large games, computing the regrets associated with a single iteration can be slow. For this reason, pruning – in which parts of the decision tree are not traversed in every iteration – has emerged as an essential method for speeding up iterations in large games. The ability to prune is a primary reason why the Counterfactual Regret Minimization (CFR) algorithm using regret matching has emerged as the most popular iterative algorithm for imperfect-information games, despite its relatively poor convergence bound. In this paper, we introduce dynamic thresholding, in which a threshold is set at every iteration such that any action in the decision tree with probability below the threshold is set to zero probability. This enables pruning for the first time in a wide range of algorithms. We prove that dynamic thresholding can be applied to Hedge while increasing its convergence bound by only a constant factor in terms of number of iterations. Experiments demonstrate a substantial improvement in performance for Hedge as well as the excessive gap technique.

https://www.cs.cmu.edu/~sganzfri/Thesis.pdf Computing Strong Game-Theoretic Strategies and Exploiting Suboptimal Opponents in Large Games

https://arxiv.org/pdf/1011.0686.pdf A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning

In this paper, we propose a new iterative algorithm, which trains a stationary deterministic policy, that can be seen as a no regret algorithm in an online learning setting. We show that any such no regret algorithm, combined with additional reduction assumptions, must find a policy with good performance under the distribution of observations it induces in such sequential settings. We demonstrate that this new approach outperforms previous approaches on two challenging imitation learning problems and a benchmark sequence labeling problem.

https://github.com/clinicalml/cfrnet

Counterfactual regression (CFR) by learning balanced representations, as developed by Johansson, Shalit & Sontag (2016) and Shalit, Johansson & Sontag (2016). cfrnet is implemented in Python using TensorFlow and NumPy.

https://arxiv.org/abs/1702.06676v2 Counterfactual Control for Free from Generative Models

We introduce a method by which a generative model learning the joint distribution between actions and future states can be used to automatically infer a control scheme for any desired reward function, which may be altered on the fly without retraining the model. In this method, the problem of action selection is reduced to one of gradient descent on the latent space of the generative model, with the model itself providing the means of evaluating outcomes and finding the gradient, much like how the reward network in Deep Q-Networks (DQN) provides gradient information for the action generator. Unlike DQN or Actor-Critic, which are conditional models for a specific reward, using a generative model of the full joint distribution permits the reward to be changed on the fly. In addition, the generated futures can be inspected to gain insight in to what the network 'thinks' will happen, and to what went wrong when the outcomes deviate from prediction. https://github.com/arayabrain/GenerativeControl

https://arxiv.org/abs/1704.03073 Data-efficient Deep Reinforcement Learning for Dexterous Manipulation

We introduce two extensions to the Deep Deterministic Policy Gradient algorithm (DDPG), a model-free Q-learning based method, which make it significantly more data-efficient and scalable. Our results show that by making extensive use of off-policy data and replay, it is possible to find control policies that robustly grasp objects and stack them. Further, our results hint that it may soon be feasible to train successful stacking policies by collecting interactions on real robots.

https://arxiv.org/abs/1704.00773v1 A comparative study of counterfactual estimators

We provide a comparative study of several widely used off-policy estimators (Empirical Average, Basic Importance Sampling and Normalized Importance Sampling), detailing the different regimes where they are individually suboptimal. We then exhibit properties optimal estimators should possess. In the case where examples have been gathered using multiple policies, we show that fused estimators dominate basic ones but can still be improved.

https://arxiv.org/abs/1705.02955v1 Safe and Nested Subgame Solving for Imperfect-Information Games

Unlike perfect-information games, imperfect-information games cannot be solved by decomposing the game into subgames that are solved independently. Instead, all decisions must consider the strategy of the game as a whole, and more computationally intensive algorithms are used. While it is not possible to solve an imperfect-information game exactly through decomposition, it is possible to approximate solutions, or improve existing strategies, by solving disjoint subgames. This process is referred to as subgame solving. We introduce subgame solving techniques that outperform prior methods both in theory and practice. We also show how to adapt them, and past subgame solving techniques, to respond to opponent actions that are outside the original action abstraction; this significantly outperforms the prior state-of-the-art approach, action translation. Finally, we show that subgame solving can be repeated as the game progresses down the tree, leading to lower exploitability. Subgame solving is a key component of Libratus, the first AI to defeat top humans in heads-up no-limit Texas hold'em poker.

We introduced a subgame solving technique for imperfect-information games that has stronger theoretical guarantees and better practical performance than prior subgame-solving methods. We presented results on exploitability of both safe and unsafe subgame solving techniques. We also introduced a method for nested subgame solving in response to the opponent’s off-tree actions, and demonstrated that this leads to dramatically better performance than the usual approach of action translation. This is, to our knowledge, the first time that exploitability of subgame solving techniques has been measured in large games. Finally, we demonstrated the effectiveness of these techniques in practice against top human professionals in the game of heads-up no-limit Texas hold’em poker, the main benchmark challenge for AI in imperfect-information games. In the 2017 Brains vs. AI competition, our AI Libratus became the first AI to reach the milestone of defeating top humans in heads-up no-limit Texas hold’em.

https://arxiv.org/pdf/1705.08926v1.pdf Counterfactual Multi-Agent Policy Gradients

There is a great need for new reinforcement learning methods that can efficiently learn decentralised policies for such systems. To this end, we propose a new multi-agent actor-critic method called counterfactual multi-agent (COMA) policy gradients. COMA uses a centralised critic to estimate the Q-function and decentralised actors to optimise the agents’ policies.

https://arxiv.org/abs/1605.03661v2 Learning Representations for Counterfactual Inference

We propose a new algorithmic framework for counterfactual inference which brings together ideas from domain adaptation and representation learning. In addition to a theoretical justification, we perform an empirical comparison with previous approaches to causal inference from observational data. Our deep learning algorithm significantly outperforms the previous state-of-the-art.

https://arxiv.org/abs/1608.06580 Communication complexity of approximate Nash equilibria

https://arxiv.org/abs/1708.03658 iTrace: An Implicit Trust Inference Method for Trust-aware Collaborative Filtering

http://www.cs.cmu.edu/~noamb/papers/17-NIPS-Safe.pdf Safe and Nested Subgame Solving for Imperfect-Information Games

http://web.cs.ucla.edu/~kaoru/theoretical-impediments.pdf Theoretical Impediments to Machine Learning

Current machine learning systems operate, almost exclusively, in a purely statistical mode, which puts severe theoretical limits on their performance. We consider the feasibility of leveraging counterfactual reasoning in machine learning tasks, and to identify areas where such reasoning could lead to major breakthroughs in machine learning applications.

http://www.cs.ox.ac.uk/publications/publication11394-abstract.html Counterfactual Multi−Agent Policy Gradients

To this end, we propose a new multi-agent actor-critic method called counterfactual multi-agent (COMA) policy gradients. COMA uses a centralised critic to estimate the Q-function and decentralised actors to optimise the agents' policies. In addition, to address the challenges of multi-agent credit assignment, it uses a counterfactual baseline that marginalises out a single agent's action, while keeping the other agents' actions fixed. COMA also uses a critic representation that allows the counterfactual baseline to be computed efficiently in a single forward pass. We evaluate COMA in the testbed of StarCraft unit micromanagement, using a decentralised variant with significant partial observability. COMA significantly improves average performance over other multi-agent actor-critic methods in this setting, and the best performing agents are competitive with state-of-the-art centralised controllers that get access to the full state.

https://arxiv.org/pdf/1803.10760.pdf Unsupervised Predictive Memory in a Goal-Directed Agent

We develop a model, the Memory, RL, and Inference Network (MERLIN), in which memory formation is guided by a process of predictive modeling. MERLIN facilitates the solution of tasks in 3D virtual reality environments (6) for which partial observability is severe and memories must be maintained over long durations. Our model demonstrates a single learning agent architecture that can solve canonical behavioural tasks in psychology and neurobiology without strong simplifying assumptions about the dimensionality of sensory input or the duration of experiences.

We propose MERLIN, an integrated AI agent architecture that acts in partially observed virtual reality environments and stores information in memory based on different principles from existing end-to-end AI systems: it learns to process high-dimensional sensory streams, compress and store them, and recall events with less dependence on task reward. We bring together ingredients from external memory systems, reinforcement learning, and state estimation (inference) models and combine them into a unified system using inspiration from three ideas originating in psychology and neuroscience: predictive sensory coding , the hippocampal representation theory of Gluck and Myers, and the temporal context model and successor representation.

https://www.arxiv-vanity.com/papers/1803.08460/ Towards Universal Representation for Unseen Action Recognition

this paper proposes a pipeline using a large-scale training source to achieve a Universal Representation (UR) that can generalise to a more realistic Cross-Dataset UAR (CD-UAR) scenario. We first address UAR as a Generalised Multiple-Instance Learning (GMIL) problem and discover ‘building-blocks’ from the large-scale ActivityNet dataset using distribution kernels. Essential visual and semantic components are preserved in a shared space to achieve the UR that can efficiently generalise to new datasets. Predicted UR exemplars can be improved by a simple semantic adaptation, and then an unseen action can be directly recognised using UR during the test.

https://arxiv.org/pdf/1804.09401.pdf Generative Temporal Models with Spatial Memory for Partially Observed Environments

In this work we introduce a novel action-conditioned generative model of such challenging environments. The model features a non-parametric spatial memory system in which we store learned, disentangled representations of the environment. Low-dimensional spatial updates are computed using a state-space model that makes use of knowledge on the prior dynamics of the moving agent, and high-dimensional visual observations are modelled with a Variational Auto-Encoder.

To our knowledge this is the first published work that builds a generative model for agents walking in an environment that, thanks to the separation of the dynamics and visual information in the DND memory, can coherently generate for hundreds of time steps in a scalable way.

https://arxiv.org/pdf/1805.08195.pdf Depth-Limited Solving for Imperfect-Information Games

This paper introduces a principled way to conduct depth-limited solving in imperfect-information games by allowing the opponent to choose among a number of strategies for the remainder of the game at the depth limit. Each one of these strategies results in a different set of values for leaf nodes. This forces an agent to be robust to the different strategies an opponent may employ. We demonstrate the effectiveness of this approach by building a master-level heads-up no-limit Texas hold’em poker AI that defeats two prior top agents using only a 4-core CPU and 16 GB of memory. Developing such a powerful agent would have previously required a supercomputer.

https://arxiv.org/abs/1803.10760 Unsupervised Predictive Memory in a Goal-Directed Agent

An obvious requirement for handling partially observed tasks is access to extensive memory, but we show memory is not enough; it is critical that the right information be stored in the right format. We develop a model, the Memory, RL, and Inference Network (MERLIN), in which memory formation is guided by a process of predictive modeling. MERLIN facilitates the solution of tasks in 3D virtual reality environments for which partial observability is severe and memories must be maintained over long durations. Our model demonstrates a single learning agent architecture that can solve canonical behavioural tasks in psychology and neurobiology without strong simplifying assumptions about the dimensionality of sensory input or the duration of experiences.

https://arxiv.org/abs/1808.10442 Application of Self-Play Reinforcement Learning to a Four-Player Game of Imperfect Information

We introduce a new virtual environment for simulating a card game known as “Big 2”. This is a four-player game of imperfect information with a relatively complicated action space (being allowed to play 1,2,3,4 or 5 card combinations from an initial starting hand of 13 cards). As such it poses a challenge for many current reinforcement learning methods. We then use the recently proposed “Proximal Policy Optimization” algorithm to train a deep neural network to play the game, purely learning via self-play, and find that it is able to reach a level which outperforms amateur human players after only a relatively short amount of training time and without needing to search a tree of future game states.

https://arxiv.org/abs/1809.04040 Solving Imperfect-Information Games via Discounted Regret Minimization

https://arxiv.org/abs/1809.07893v1 Solving Large Extensive-Form Games with Strategy Constraints

In this work we introduce a generalized form of Counterfactual Regret Minimization that provably finds optimal strategies under any feasible set of convex constraints.

https://arxiv.org/abs/1805.08195 Depth-Limited Solving for Imperfect-Information Games

Depth-Limited Solving for Imperfect-Information Games Noam Brown, Tuomas Sandholm, Brandon Amos (Submitted on 21 May 2018) A fundamental challenge in imperfect-information games is that states do not have well-defined values. As a result, depth-limited search algorithms used in single-agent settings and perfect-information games do not apply. This paper introduces a principled way to conduct depth-limited solving in imperfect-information games by allowing the opponent to choose among a number of strategies for the remainder of the game at the depth limit. Each one of these strategies results in a different set of values for leaf nodes. This forces an agent to be robust to the different strategies an opponent may employ. We demonstrate the effectiveness of this approach by building a master-level heads-up no-limit Texas hold'em poker AI that defeats two prior top agents using only a 4-core CPU and 16 GB of memory. Developing such a powerful agent would have previously required a supercomputer.

https://arxiv.org/abs/1807.10299v1 Variational Option Discovery Algorithms

First: we highlight a tight connection between variational option discovery methods and variational autoencoders, and introduce Variational Autoencoding Learning of Options by Reinforcement (VALOR), a new method derived from the connection. In VALOR, the policy encodes contexts from a noise distribution into trajectories, and the decoder recovers the contexts from the complete trajectories. Second: we propose a curriculum learning approach where the number of contexts seen by the agent increases whenever the agent's performance is strong enough (as measured by the decoder) on the current set of contexts. We show that this simple trick stabilizes training for VALOR and prior variational option discovery methods, allowing a single agent to learn many more modes of behavior than it could with a fixed context distribution. Finally, we investigate other topics related to variational option discovery, including fundamental limitations of the general approach and the applicability of learned options to downstream tasks.