VLQA is a dataset for joint reasoning over visuo-linguistic context. It consists of 9K image-passage-question-answers items with detailed annotations, which are meticulously crafted through combined automated and manual efforts. Questions in VLQA are designed to combine both visual and textual information, i.e. ignoring either of them would make the question unanswerable. Each dataset item is a 4-tuple <I, P, Q, A>. I: one or more image(s) as a visual context P: a short text passage as a linguistic context Q: a question requiring joint reasoning over given visuo-linguistic context A: answer choices with exactly one correct answer Below are few examples from our dataset. Hover over the examples for more view options (Zoom In / Zoom Out / Fit To Screen / Full Screen View).
Link to download the VLQA dataset is available here.
We provide following annotations for each sample in VLQA Dataset. For detailed explanation of each entry, please refer to Supplementary Material in the paper.
{
"question_id": 1,
"images_path": ["./images/train/1.png","./images/train/2.png"],
"multiple_images" : True,
"passage": "This is a sample text passage.",
"question": "What can be correctly ?",
"answer_choices": ["choice0", "choice1", "choice2", "choice3"],
"correct_answer_choice": 0,
"image_type": "Templated",
"image_subtype": "Pie",
"answer_type": "4way_text",
"multistep_inference": True,
"reasoning_type": ["Deductive","Math"],
"external_knowledge_required": True,
"external_knowledge_type": "Commonsense",
"external_knowledge_text": "This is external knowledge required.",
"ocr_tokens": ["text","tokens","inside","image"],
"image_source": "http://www.image/obtained/from/url/xyz",
"passage_source": "wikipedia",
"difficulty_level": "easy",
"split": "train"
}