VLQA

What is VLQA?

VLQA is a dataset for joint reasoning over visuo-linguistic context. It consists of 9K image-passage-question-answers items with detailed annotations, which are meticulously crafted through combined automated and manual efforts. Questions in VLQA are designed to combine both visual and textual information, i.e. ignoring either of them would make the question unanswerable.

Each dataset item is a 4-tuple <I, P, Q, A>.
I: one or more image(s) as a visual context
P: a short text passage as a linguistic context
Q: a question requiring joint reasoning over given visuo-linguistic context
A: answer choices with exactly one correct answer

Below are few examples from our dataset. Hover over the examples for more view options (Zoom In / Zoom Out / Fit To Screen / Full Screen View).

Link to download the VLQA dataset is available here.

We provide following annotations for each sample in VLQA Dataset. For detailed explanation of each entry, please refer to Supplementary Material in the paper.


      { 

        "question_id": 1, 

        "images_path": ["./images/train/1.png","./images/train/2.png"], 

        "multiple_images" : True, 

        "passage": "This is a sample text passage.", 

        "question": "What can be correctly ?", 

        "answer_choices": ["choice0", "choice1", "choice2", "choice3"], 

        "correct_answer_choice": 0, 

        "image_type": "Templated", 

        "image_subtype": "Pie", 

        "answer_type": "4way_text", 

        "multistep_inference": True, 

        "reasoning_type": ["Deductive","Math"], 

        "external_knowledge_required": True, 

        "external_knowledge_type": "Commonsense", 

        "external_knowledge_text": "This is external knowledge required.", 

        "ocr_tokens": ["text","tokens","inside","image"], 

        "image_source": "http://www.image/obtained/from/url/xyz", 

        "passage_source": "wikipedia", 

        "difficulty_level": "easy", 

        "split": "train" 

      }

VLQA (Visuo-Linguistic Question Answering)

A Dataset for Joint Reasoning over Visuo-Linguistic Context

What is VLQA?

Shailaja Sampat , Yezhou Yang and Chitta Baral
School of Computing, Informatics, and Decision Systems Engineering (CIDSE)
Arizona State Univeristy

We are thankful to National Science Foundation (NSF) for supporting this research under grant IIS-1816039.

Webpage template inspired by SQuAD and RecipeQA leaderboards.

Icon template adapted from Freepik.

VLQA (Visuo-Linguistic Question Answering)

A Dataset for Joint Reasoning over Visuo-Linguistic Context

What is VLQA?

Shailaja Sampat , Yezhou Yang and Chitta Baral School of Computing, Informatics, and Decision Systems Engineering (CIDSE) Arizona State Univeristy

We are thankful to National Science Foundation (NSF) for supporting this research under grant IIS-1816039.

Webpage template inspired by SQuAD and RecipeQA leaderboards.

Icon template adapted from Freepik.

Shailaja Sampat , Yezhou Yang and Chitta Baral
School of Computing, Informatics, and Decision Systems Engineering (CIDSE)
Arizona State Univeristy