VLQA is a dataset for joint reasoning over visuo-linguistic context. It consists of 9K image-passage-question-answers items with detailed annotations, which are meticulously crafted through combined automated and manual efforts. Questions in VLQA are designed to combine both visual and textual information, i.e. ignoring either of them would make the question unanswerable.
Solving this dataset requires an AI model that can (i) understand diverse kinds of images; from simple daily-life scenes and standard charts to complex diagrams (ii) understand complex texts and relate it to given visual information (ii) perform a variety of reasoning tasks and derive inferences.
For more details about VLQA dataset creation, annotations and dataset analysis please refer to the supplementary material in the above file.
Note (As of September 2022): All of our experimentation was done during early days of transformers. Many baselines we implemented are now part of HuggingFace and might be convenient to use. Check out here.
If you would like your model to be part of our leaderboard, create a prediction.csv file containing two columns- 'qid' and 'pred_answer' for all test set instances. Then send the prediction.csv file to ssampa17@asu.edu with the brief model description.
VLQA is curated from multiple online resources (books, encyclopedias, web-crawls, existing datasets, standardized tests etc.). We provide web reference to all such resources used in images, passages and question-answers pairs in our dataset (originally curated content might be altered on case-by-case basis to well fit the purpose of the dataset).
Creation of VLQA is purely research oriented and so does its distribution and future usage. VLQA is an ongoing effort and we expect the dataset to evolve. If you find our dataset or model helpful, please cite our paper :-)
@misc{sampat2020visuo-linguistic,
title={Visuo-Linguistic Question Answering (VLQA) Challenge},
author={Shailaja Sampat and Yezhou Yang and Chitta Baral},
year={2020},
eprint={2005.00330},
archivePrefix={arXiv},
primaryClass={cs.CV}
}