VLQA (Visuo-Linguistic Question Answering)

A Dataset for Joint Reasoning over Visuo-Linguistic Context

What is VLQA?

VLQA is a dataset for joint reasoning over visuo-linguistic context. It consists of 9K image-passage-question-answers items with detailed annotations, which are meticulously crafted through combined automated and manual efforts. Questions in VLQA are designed to combine both visual and textual information, i.e. ignoring either of them would make the question unanswerable.

Solving this dataset requires an AI model that can (i) understand diverse kinds of images; from simple daily-life scenes and standard charts to complex diagrams (ii) understand complex texts and relate it to given visual information (ii) perform a variety of reasoning tasks and derive inferences.

VLQA Paper

For more details about VLQA dataset creation, annotations and dataset analysis please refer to the supplementary material in the above file.

Browse Examples

Download Dataset

Baselines Models

Note (As of September 2022): All of our experimentation was done during early days of transformers. Many baselines we implemented are now part of HuggingFace and might be convenient to use. Check out here.

Leaderboard Submission

If you would like your model to be part of our leaderboard, create a prediction.csv file containing two columns- 'qid' and 'pred_answer' for all test set instances. Then send the prediction.csv file to ssampa17@asu.edu with the brief model description.

Distribution and Usage

VLQA is curated from multiple online resources (books, encyclopedias, web-crawls, existing datasets, standardized tests etc.). We provide web reference to all such resources used in images, passages and question-answers pairs in our dataset (originally curated content might be altered on case-by-case basis to well fit the purpose of the dataset).

Creation of VLQA is purely research oriented and so does its distribution and future usage. VLQA is an ongoing effort and we expect the dataset to evolve. If you find our dataset or model helpful, please cite our paper :-)


title={Visuo-Linguistic Question Answering (VLQA) Challenge},
author={Shailaja Sampat and Yezhou Yang and Chitta Baral},

Shailaja Sampat , Yezhou Yang and Chitta Baral
School of Computing, Informatics, and Decision Systems Engineering (CIDSE)
Arizona State Univeristy

We are thankful to National Science Foundation (NSF) for supporting this research under grant IIS-1816039.

Webpage template inspired by SQuAD and RecipeQA leaderboards.
Icon template adapted from Freepik.