Deep Learning Datasets

MNIST( Handwritten digits

Google House Numbers( from street view

CIFAR-10 and CIFAR-100(


Flickr Data ( 100 Million Yahoo dataset

UC Irvine Machine Learning Repository (

The AR Face Database ( - Contains over 4,000 color images corresponding to 126 people's faces (70 men and 56 women). Frontal views with variations in facial expressions, illumination, and occlusions. (Formats: RAW (RGB 24-bit))

Yale Face Database( - 165 images (15 individuals) with different lighting, expression, and occlusion configurations.

DeepMind QA Corpus ( - Textual QA corpus from CNN and DailyMail. More than 300K documents in total. [Paper]( for reference.

Microsoft Coco Dataset


Google Open Images

The Stanford Question Answering Dataset

Open Subtitiles


LSUN one million labeled images for each of 10 scene categories and 20 object categories.

Yelp DataSet Challenge 2.7M reviews and 649K tips by 687K users for 86K businesses

A Large Dataset of Object Scans More than ten thousand 3D scans of real objects.

Ubuntu Dialog Corpus


Yahoo Datasets

MovieLens 22,000,000 ratings and 580,000 tags applied to 33,000 movies by 240,000 users.

AWS Public Datasets

YouTube-8M 8 million YouTube video IDs and associated labels from a diverse vocabulary of 4800 visual entities.


Allen Institute Datasets:

Unreal Integration:

MetaMind Wiki Text

Google Trends Dataset:

NewsQA Making the V in VQA Matter: Elevating the Role of Image Understanding in Visual Question Answering Procedural Generation of Videos to Train Deep Action Recognition Networks Microsoft has released a massive database to 100,000 question and answer pairs written by humans to help AI researchers train their machines to extract information better from websites and respond more naturally to questions asked by users. CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning

Frames is precisely meant to encourage research towards conversational agents which can support decision-making in complex settings, in this case - booking a vacation including flights and a hotel. More than just searching a database, we believe the next generation of conversational agents will need to help users explore a database, compare items, and reach a decision. This paper presents a new selection-based question answering dataset, SelQA. SceneNet RGB-D: 5M Photorealistic Images of Synthetic Indoor Trajectories with Ground Truth Microsoft Speech Language Translation (MSLT) Corpus SUNCG: A Large 3D Model Repository for Indoor Scenes MoleculeNet: A Benchmark for Molecular Machine Learning HolStep: a Machine Learning Dataset for Higher-Order Logic Theorem Proving

Resources How much data is needed to train a medical image deep learning system to achieve necessary high accuracy? Bringing Structure into Summaries: Crowdsourcing a Benchmark Corpus of Concept Maps SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine

We publicly release a new large-scale dataset, called SearchQA, for machine comprehension, or question-answering. Unlike recently released datasets, such as DeepMind CNN/DailyMail and SQuAD, the proposed SearchQA was constructed to reflect a full pipeline of general question-answering. That is, we start not from an existing article and generate a question-answer pair, but start from an existing question-answer pair, crawled from J! Archive, and augment it with text snippets retrieved by Google. Following this approach, we built SearchQA, which consists of more than 140k question-answer pairs with each pair having 49.6 snippets on average. Each question-answer-context tuple of the SearchQA comes with additional meta-data such as the snippet's URL, which we believe will be valuable resources for future research. We conduct human evaluation as well as test two baseline methods, one simple word selection and the other deep learning based, on the SearchQA. We show that there is a meaningful gap between the human and machine performances. This suggests that the proposed dataset could well serve as a benchmark for question-answering. A Large Self-Annotated Corpus for Sarcasm The NarrativeQA Reading Comprehension Challenge HappyDB: A Corpus of 100,000 Crowdsourced Happy Moments Building a Web-Scale Dependency-Parsed Corpus from CommonCrawl

We present DepCC, the largest-to-date linguistically analyzed corpus in English including 365 million documents, composed of 252 billion tokens and 7.5 billion of named entity occurrences in 14.3 billion sentences from a web-scale crawl of the Common Crawl project. A Large-scale Attribute Dataset for Zero-shot Learning Logical Entailment Dataset CoNaLa: The Code/Natural Language Challenge InteriorNet: Mega-scale Multi-sensor Photo-realistic Indoor Scenes Dataset Optical Illusions Images Dataset A dataset and architecture for visual reasoning with a working memory The Decompositional Semantics Initiative Rapid, simple, commonsensical annotations of meaning MS MARCO(Microsoft Machine Reading Comprehension) is a large scale dataset focused on machine reading comprehension, question answering, and passage ranking. . An Atlas of Machine Commonsense for If-Then Reasoning

Standard Benchmark Datasets of Annotated Semantic Relationships