FIQ: Fundamental Question Generation with the Integration of Question Embeddings for Video Question Answering

Ju-Young Oh1, Ho-Joong Kim1, and Seong-Whan Lee1 *This research was supported by the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant, funded by the Korea government (MSIT) (No. RS-2019- II190079 (Artificial Intelligence Graduate School Program (Korea University)), and No. RS-2024-00457882 (AI Research Hub Project)).1J.-Y. Oh, H.-J. Kim, and S.-W. Lee are with the Department of Artificial Intelligence, Korea University, Anam-dong, Seongbuk-ku, Seoul 02841, Korea. {juyoungoh, hojoong_kim, sw.lee}@korea.ac.kr
Abstract

Video question answering (VQA) is a multimodal task that requires the interpretation of a video to answer a given question. Existing VQA methods primarily utilize question and answer (Q&A) pairs to learn the spatio-temporal characteristics of video content. However, these annotations are typically event-centric, which is not enough to capture the broader context of each video. The absence of essential details such as object types, spatial layouts, and descriptive attributes restricts the model to learning only a fragmented scene representation. This issue limits the model’s capacity for generalization and higher-level reasoning. In this paper, we propose a fundamental question generation with the integration of question embeddings for video question answering (FIQ), a novel approach designed to strengthen the reasoning ability of the model by enhancing the fundamental understanding of videos. FIQ generates Q&A pairs based on descriptions extracted from videos, enriching the training data with fundamental scene information. Generated Q&A pairs enable the model to understand the primary context, leading to enhanced generalizability and reasoning ability. Furthermore, we incorporate a VQ-CAlign module that assists task-specific question embeddings with visual features, ensuring that essential domain-specific details are preserved to increase the adaptability of downstream tasks. Experiments on SUTD-TrafficQA demonstrate that our FIQ achieves state-of-the-art performance compared to existing baseline methods.

Index Terms:
Spatio-temporal information, Video question answering, Multimodal

I INTRODUCTION

Video question answering (VQA) is a multimodal task [1] that combines computer vision and natural language processing. It requires the model to answer given questions based on the understanding of dynamic events in a video. VQA has acquired significant attention for its importance of tasks and broad applications in various fields, including education, health care, and surveillance systems [2]. Despite the substantial advancements of existing works in recent years and the wide applicability of the task, the alignment of natural languages and visual features still remains as a challenge. Recent studies have demonstrated significant advancements in this area, with various works  [3, 4, 5] achieving notable results by aligning both modalities.

Refer to caption
Figure 1: The existing dataset only focuses on event-centric information of video, but not on fundamental information of video such as shape, color, and direction of objects.

The existing VQA methods employ CLIP-based encoders to leverage their image-text alignment capability from their pretrained knowledge of large-scale data. Even though video-based encoders [6, 7, 8, 9] which are specialized for video data exist, visual-text alignment requires both pretrained visual and text encoders, and CLIP provides both of them. FrozenBiLM [10] introduces a lightweight module that connects the frozen image encoder from CLIP and the frozen bidirectional language model for effective multimodal reasoning through masked language modeling. ViLA [11] proposes QFormer-Distiller, a module to enhance the alignment of both modalities by teaching Q-Former from BLIP [12]. While CLIP offers strong cross-modal features, it is pretrained on static images and thus relies heavily on textual annotations to supply spatio-temporal context. However, the current VQA dataset usually provides event-centric textual annotations, frequently omitting the fundamental scene attributes such as object identity, shape, or color. Although event-centric annotations already provide semantic cues, they serve only partial scene representations, thereby limiting a model to acquire only a fragmentary understanding of each scene.

Fig. 1 shows an example in which a VQA model exclusively trained on event-centric annotations easily focuses on partial scenes. The model focuses on partial scenes where the collision occurs while preceding and subsequent frames are largely ignored. This behavior occurs because the event-centric annotations alone provide insufficient information to answer the corresponding questions. Such reliance on event-centric scenes restricts the ability of the model to both generalization and higher-level reasoning, which requires a broader understanding of the video context [13, 14].

As shown in Fig. 1, the model recognizes that an accident occurs, but fails to identify which vehicles are involved or what events happened before or after the accident. This partial understanding of video from the model is not enough for the model to establish causal or temporal relations between frames, leading to weaker generalization and limited higher-level reasoning. To alleviate this issue, enriching the fundamental attributes such as shape, color, and the direction of objects is significant in enhancing the reasoning ability of the model. These attributes are crucial for enabling the model to track other attributes over time, thereby facilitating a broader understanding of the video content, rather than limiting the comprehension to specific event-related frames.

We propose a fundamental question generation with the integration of question embeddings for video question answering (FIQ), which integrates the general Q&A pairs into the original dataset to enhance the fundamental video understanding. Additionally, we incorporate question embeddings as task-specific information through a VQ-CAlign module. Our approach integrates Q&A pairs that focus on the fundamental understanding of video such as object types and shapes in specific scenes to the original dataset, enabling the model to enhance the fundamental understanding of video for more advanced reasoning. Furthermore, we introduce the VQ-CAlign module to integrate the task-specific information using question embeddings. This module prevents the model from missing the necessary task-specific information to interpret the direction of the task after the integration of general information.

Our contributions are summarized as follows:

  • We propose FIQ which generates Q&A pairs to provide a fundamental understanding of the visual information, enhancing the generalizability of the model.

  • We construct the VQ-CAlign module to provide question embeddings to reflect task-specific information.

  • Our method achieves the state-of-the-art result on SUTD-TrafficQA dataset compared to other baseline models.

II RELATED WORKS

II-A Video Question Answering

VQA is the task of determining the semantic information of a video to generate an answer to a given query. There are two approaches designed to achieve a lower computational cost for the VQA task, adapter-based methods, and utilizing the text-based representations. Adapter-based methods [15, 16, 17] are proposed to reduce the computational cost of VQA tasks, thereby enabling large language models (LLM) to be adapted to downstream tasks without the need of finetuning the entire model. Tem-Adapter [18] presents a novel alignment method that utilizes auto-regression to integrate semantic and temporal information from the video domain to the image domain. While various works adapt textual information to enhance the understanding of spatial and temporal features from video, achieving a competitive understanding of video data solely through textual representations for VQA is not common. Vamos [19] is a text-based video understanding framework that achieves superior performance without relying on visual features by generating task-agnostic texts, emphasizing utilizing only textual data leads to enhanced performance. Similarly, ColPro [20] integrates three distinct task-specific prompts to prevent catastrophic forgetting during the training process. Despite the differences in objective, both methods demonstrate remarkable performance without relying on visual embeddings. Our approach aligns with these two previous methods by leveraging textual prompts to enhance the interpretability of video data, reflecting its spatial and temporal properties. Furthermore, we employ both a language model (LM) and a LLM to generate task-agnostic Q&A pair to provide an overview of the fundamental components of the video.

II-B General Question Generation

Question generation (QG) is a task aimed at automatically generating questions, aims to expand semantic diversity and extract insights beyond visually explicit content [21, 22]. There are two directions in QG, generating task-specific questions and general questions. QG tasks for providing task-specific information demonstrate state-of-the-art results in various methods since the question is one of the most task-specific inputs to provide guidance to a model about how to interpret the data. Prophet [23] and SGSH [24] propose knowledge base question generation (KBQG) methods which generates natural language questions using external knowledge beyond the given images. While QG is helpful in assisting with task-specific information, it could be used as an effective data augmentation method to provide more general questions such as colors, objects, and their number. VQ2AsuperscriptVQ2A\mathrm{VQ}^{2}\!\mathrm{A}roman_VQ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_A [25] supports robust multilingual capability and integrates multiple models to produce diverse Q&A pairs, by primarily using the T5 model. Additionally, an all-in-one QAG model [26] highlights the potential of textual captions to enrich visual question answering datasets by incorporating details not explicitly depicted in visual representations. Building on these approaches, we employ LMs to generate contextually rich and temporally informed questions. We utilize both LM and LLM to show a more adaptable approach.

Refer to caption
Figure 2: Overall architecture of FIQ. It consists of four pivotal sub-processes. Q&A pair which contains the general information of video first generated using language model such as T5 [27], and GPT [28]. The frozen text encoder takes these generate Q&A pairs with the original dataset as an input, and each question embeddings and answer candidate embeddings are passed to the Trans-Decoder and VQ-CAlign. The frozen image encoder takes video data as input, and extracted visual features are passed to VQ-CAlign with question embeddings. Both modalities are merged, and passed to the Ans-Decoder, which fuses visual and textual information to align the temporal information.

III METHOD

FIQ consists of four main processes: Fundamental question generation, textual representation refinement, integration of question embeddings, and visual representation alignment. Fig. 2 shows the overall process of FIQ. The subsequent subsections will provide a detailed explanation of each process.

III-A Preliminaries

The objective of the multi-choice VQA task is to identify the best answer afinalsubscript𝑎𝑓𝑖𝑛𝑎𝑙a_{final}italic_a start_POSTSUBSCRIPT italic_f italic_i italic_n italic_a italic_l end_POSTSUBSCRIPT from the options presented in the given question xqsubscript𝑥𝑞x_{q}italic_x start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT and the visual feature xvissubscript𝑥𝑣𝑖𝑠x_{vis}italic_x start_POSTSUBSCRIPT italic_v italic_i italic_s end_POSTSUBSCRIPT. The score for each answer candidate xcsubscript𝑥𝑐x_{c}italic_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is calculated, and the answer with the highest matching score will be the final answer of the VQA task. The final predicted answer a^finalsubscript^𝑎𝑓𝑖𝑛𝑎𝑙\hat{a}_{final}over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_f italic_i italic_n italic_a italic_l end_POSTSUBSCRIPT is derived as follows:

a^final=argmax(xc|xvis,xq).subscript^𝑎𝑓𝑖𝑛𝑎𝑙argmaxconditionalsubscript𝑥𝑐subscript𝑥𝑣𝑖𝑠subscript𝑥𝑞\hat{a}_{final}=\text{argmax}(x_{c}|x_{vis},x_{q}).over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_f italic_i italic_n italic_a italic_l end_POSTSUBSCRIPT = argmax ( italic_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT italic_v italic_i italic_s end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) . (1)

III-B Fundamental Question Generation

We employ VideoChat2 [29] for generating descriptions that cover both low-level features such as color and objects and high-level features such as motions or temporal orders from video. For extracted descriptions, we filter the repetitive same numbers that are not relevant to given video. From filtered descriptions, we employ LMs to generate Q&A pairs, such as T5 and GPT-4o-mini. Following VQ2AsuperscriptVQ2A\mathrm{VQ}^{2}\!\mathrm{A}roman_VQ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_A [25], we provide a prompt to follow three steps: candidate answer extraction, question generation, and answer validation.

III-B1 Candidate Answer Extraction

For the candidate answer extraction, we guide LM to extract candidate answers, including noun phrases, named entities, short open-class word sequences, boolean literals yes/no, and object counts including zero when no count is mentioned.

III-B2 Question Generation

The LM generates questions by rewriting the source sentence into an interrogative form for candidate answers. To encourage a diversity of questions, we instruct LM to cover various types of questions such as “How many..”, “Where are..”, and “Is there..” based on the given candidate answers. All questions and answers are generated under 77 tokens to ensure the maximum input length accepted by the pretrained CLIP text encoder.

Table I: Performance comparison with state-of-the-art methods on SUTD-TrafficQA and each (H) and (H) represent training prompts with and without adapter heads. (H) and (A) represent methods for adding prompts, respectively. Avg represents an average accuracy for all six tasks.
Methods SUTD-TrafficQA
B F R C I A Avg
Unsupervised CLIP [30] 25.6 20.1 34.0 30.8 22.8 28.8 26.5
CLIP [30] + Template 31.8 36.0 29.9 71.8 22.1 33.4 32.3
Totally finetuning 39.8 35.1 46.6 45.6 37.2 40.5 40.3
Partially finetuning 41.6 37.8 44.6 50.0 33.1 41.7 41.7
LoRA [31] 38.7 38.7 36.7 37.9 34.5 38.1 38.3
CLIP-Adapter [15] 35.8 32.0 35.4 42.3 33.1 32.1 34.8
Multi-layer Adapter [15] 30.5 26.6 26.5 38.5 28.3 25.8 29.1
Prompt learning (H) [32] 42.4 32.4 45.2 55.5 40.7 43.6 42.9
Prompt learning (H[32] 40.3 33.2 41.0 46.5 34.9 38.4 39.7
Prompt learning (A) [33] 41.7 31.5 40.1 48.4 33.1 41.4 41.1
Tem-Adapter [18] 45.5 37.2 45.8 54.5 35.1 48.3 46.0
FIQ 46.9 43.5 52.5 54.0 39.8 51.8 48.4

III-B3 Answer Validation

To validate generated Q&A pairs, we utilize a token-level F1 score [34] to validate whether the candidate’s answer matches the original sentence. When the score is lower than 0.54, we discard that sample to ensure its correctness. These generated Q&A pairs have a one-to-one pair for each question and one single answer. Since the SUTD-TrafficQA dataset requires a multi-choice format, every question should provide multiple options as an answer to ensure seamless integration with the original dataset. We designed the positive answer to be an answer to the question from the target video ID, and the rest of the negative answers to be sampled from different video IDs. To preserve the randomness of the negative answers, we randomly select three distinct video IDs and randomly pick one answer from their multiple available answers to serve as a negative option. We integrate these generated Q&A pairs into the original dataset for training.

III-C FIQ

III-C1 Textual Representation Refinement

We employ a frozen text encoder and a Trans-Decoder to focus on extracting meaningful information from the textual input. We use a frozen CLIP [30] text encoder to independently extract textual embeddings from both the questions and the answer candidates consisting of four candidate options. In parallel, we obtain visual representations from a frozen image encoder, denoted as xvisN×Dsubscript𝑥𝑣𝑖𝑠superscript𝑁𝐷x_{vis}\in\mathbb{R}^{N\times D}italic_x start_POSTSUBSCRIPT italic_v italic_i italic_s end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_D end_POSTSUPERSCRIPT, where N𝑁Nitalic_N is the number of video frames and D𝐷Ditalic_D is the feature dimension of the CLIP image encoder. We also extract answer candidate embeddings, denoted as xcT×Dsubscript𝑥𝑐superscript𝑇𝐷x_{c}\in\mathbb{R}^{T\times D}italic_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_D end_POSTSUPERSCRIPT, where T𝑇Titalic_T is the sequence length of the textual data. These two inputs, xvissubscript𝑥𝑣𝑖𝑠x_{vis}italic_x start_POSTSUBSCRIPT italic_v italic_i italic_s end_POSTSUBSCRIPT and xcsubscript𝑥𝑐x_{c}italic_x start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, are then passed into the Trans-Decoder to produce refined answer candidate embeddings, denoted as xctdT×Dsubscript𝑥𝑐𝑡𝑑superscript𝑇𝐷x_{ctd}\in\mathbb{R}^{T\times D}italic_x start_POSTSUBSCRIPT italic_c italic_t italic_d end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_D end_POSTSUPERSCRIPT.

III-C2 Integration of Question Embeddings

We apply learnable positional embeddings to capture the dynamic information of video. Learnable positional embeddings are expressed as follows:

xvpe=xvis+epos,subscript𝑥𝑣𝑝𝑒subscript𝑥𝑣𝑖𝑠subscript𝑒𝑝𝑜𝑠x_{vpe}=x_{vis}+e_{pos},italic_x start_POSTSUBSCRIPT italic_v italic_p italic_e end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_v italic_i italic_s end_POSTSUBSCRIPT + italic_e start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT , (2)

where eposN×Dsubscript𝑒𝑝𝑜𝑠superscript𝑁𝐷e_{pos}\in\mathbb{R}^{N\times D}italic_e start_POSTSUBSCRIPT italic_p italic_o italic_s end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_D end_POSTSUPERSCRIPT is the learnable positional embedding, and xvpeN×Dsubscript𝑥𝑣𝑝𝑒superscript𝑁𝐷x_{vpe}\in\mathbb{R}^{N\times D}italic_x start_POSTSUBSCRIPT italic_v italic_p italic_e end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_D end_POSTSUPERSCRIPT is the visual feature that contains the learnable positional information.

Even though general Q&A pairs provide a fundamental understanding of the video content, still providing task-specific information is important to perform the downstream tasks. We introduce the VQ-CAlign module to fuse the question embeddings with visual embeddings. VQ-CAlign consists of three main modules: self-attention, cross-attention, and feedforward network. VQ-CAlign takes xvissubscript𝑥𝑣𝑖𝑠x_{vis}italic_x start_POSTSUBSCRIPT italic_v italic_i italic_s end_POSTSUBSCRIPT and question embeddings xqT×Dsubscript𝑥𝑞superscript𝑇𝐷x_{q}\in\mathbb{R}^{T\times D}italic_x start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_T × italic_D end_POSTSUPERSCRIPT as inputs. VQ-CAlign is expressed as follows:

xfused=VQ-CAlign(xvpe,xq).subscript𝑥𝑓𝑢𝑠𝑒𝑑VQ-CAlignsubscript𝑥𝑣𝑝𝑒subscript𝑥𝑞x_{fused}=\text{VQ-CAlign}(x_{vpe},x_{q}).italic_x start_POSTSUBSCRIPT italic_f italic_u italic_s italic_e italic_d end_POSTSUBSCRIPT = VQ-CAlign ( italic_x start_POSTSUBSCRIPT italic_v italic_p italic_e end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ) . (3)

Inside the VQ-CAlign, the self-attention takes xvpesubscript𝑥𝑣𝑝𝑒x_{vpe}italic_x start_POSTSUBSCRIPT italic_v italic_p italic_e end_POSTSUBSCRIPT as an input and utilizes it as a query, key, and value. It captures the internal relationships and contextual information within the xvpesubscript𝑥𝑣𝑝𝑒x_{vpe}italic_x start_POSTSUBSCRIPT italic_v italic_p italic_e end_POSTSUBSCRIPT, and generates processed visual embeddings xselfN×Dsubscript𝑥𝑠𝑒𝑙𝑓superscript𝑁𝐷x_{self}\in\mathbb{R}^{N\times D}italic_x start_POSTSUBSCRIPT italic_s italic_e italic_l italic_f end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_D end_POSTSUPERSCRIPT as an output. Next, xselfsubscript𝑥𝑠𝑒𝑙𝑓x_{self}italic_x start_POSTSUBSCRIPT italic_s italic_e italic_l italic_f end_POSTSUBSCRIPT and xqsubscript𝑥𝑞x_{q}italic_x start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT are passed to the cross-attention module. For cross-attention module, xselfsubscript𝑥𝑠𝑒𝑙𝑓x_{self}italic_x start_POSTSUBSCRIPT italic_s italic_e italic_l italic_f end_POSTSUBSCRIPT serves as a query and xqsubscript𝑥𝑞x_{q}italic_x start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT serves as a key and value. The cross-attention connects xvissubscript𝑥𝑣𝑖𝑠x_{vis}italic_x start_POSTSUBSCRIPT italic_v italic_i italic_s end_POSTSUBSCRIPT and xqsubscript𝑥𝑞x_{q}italic_x start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT, allowing the model to focus on visual positions that are relevant to question embeddings, and generates xcaN×Dsubscript𝑥𝑐𝑎superscript𝑁𝐷x_{ca}\in\mathbb{R}^{N\times D}italic_x start_POSTSUBSCRIPT italic_c italic_a end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_D end_POSTSUPERSCRIPT as an output. Finally, xcasubscript𝑥𝑐𝑎x_{ca}italic_x start_POSTSUBSCRIPT italic_c italic_a end_POSTSUBSCRIPT passes the feedforward network as a final stage, and xfusedN×Dsubscript𝑥𝑓𝑢𝑠𝑒𝑑superscript𝑁𝐷x_{fused}\in\mathbb{R}^{N\times D}italic_x start_POSTSUBSCRIPT italic_f italic_u italic_s italic_e italic_d end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_D end_POSTSUPERSCRIPT is generated as a final output.

xfusedsubscript𝑥𝑓𝑢𝑠𝑒𝑑x_{fused}italic_x start_POSTSUBSCRIPT italic_f italic_u italic_s italic_e italic_d end_POSTSUBSCRIPT is added again with xctdsubscript𝑥𝑐𝑡𝑑x_{ctd}italic_x start_POSTSUBSCRIPT italic_c italic_t italic_d end_POSTSUBSCRIPT to fully reflect the textual information for the VQA task, integrating the task-specific information better. This process is expressed as follows:

xmix=xfused+xctd,subscript𝑥𝑚𝑖𝑥subscript𝑥𝑓𝑢𝑠𝑒𝑑subscript𝑥𝑐𝑡𝑑x_{mix}=x_{fused}+x_{ctd},italic_x start_POSTSUBSCRIPT italic_m italic_i italic_x end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT italic_f italic_u italic_s italic_e italic_d end_POSTSUBSCRIPT + italic_x start_POSTSUBSCRIPT italic_c italic_t italic_d end_POSTSUBSCRIPT , (4)

where xmixN×Dsubscript𝑥𝑚𝑖𝑥superscript𝑁𝐷x_{mix}\in\mathbb{R}^{N\times D}italic_x start_POSTSUBSCRIPT italic_m italic_i italic_x end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_D end_POSTSUPERSCRIPT is the fused output of xfusedsubscript𝑥𝑓𝑢𝑠𝑒𝑑x_{fused}italic_x start_POSTSUBSCRIPT italic_f italic_u italic_s italic_e italic_d end_POSTSUBSCRIPT and xctdsubscript𝑥𝑐𝑡𝑑x_{ctd}italic_x start_POSTSUBSCRIPT italic_c italic_t italic_d end_POSTSUBSCRIPT.

III-C3 Visual Representation Alignment

Sequentially, we adapt the Ans-Decoder to generate the future state by leveraging the historical information. Formally, it is described as:

xdec=Ans-Decoder(xmix,xvis,xvis),subscript𝑥𝑑𝑒𝑐Ans-Decodersubscript𝑥𝑚𝑖𝑥subscript𝑥𝑣𝑖𝑠subscript𝑥𝑣𝑖𝑠x_{dec}=\text{Ans-Decoder}(x_{mix},x_{vis},x_{vis}),italic_x start_POSTSUBSCRIPT italic_d italic_e italic_c end_POSTSUBSCRIPT = Ans-Decoder ( italic_x start_POSTSUBSCRIPT italic_m italic_i italic_x end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_v italic_i italic_s end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_v italic_i italic_s end_POSTSUBSCRIPT ) , (5)

where xdecN×Dsubscript𝑥𝑑𝑒𝑐superscript𝑁𝐷x_{dec}\in\mathbb{R}^{N\times D}italic_x start_POSTSUBSCRIPT italic_d italic_e italic_c end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_D end_POSTSUPERSCRIPT is the processed output from Ans-Decoder. Finally, xdecsubscript𝑥𝑑𝑒𝑐x_{dec}italic_x start_POSTSUBSCRIPT italic_d italic_e italic_c end_POSTSUBSCRIPT is used to calculate the cosine similarity by taking xdecsuperscriptsubscript𝑥𝑑𝑒𝑐top{x_{dec}}^{\top}italic_x start_POSTSUBSCRIPT italic_d italic_e italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT and xctdsubscript𝑥𝑐𝑡𝑑x_{ctd}italic_x start_POSTSUBSCRIPT italic_c italic_t italic_d end_POSTSUBSCRIPT as inputs, and we obtain xlogitNsubscript𝑥𝑙𝑜𝑔𝑖𝑡superscript𝑁x_{logit}\in\mathbb{R}^{N}italic_x start_POSTSUBSCRIPT italic_l italic_o italic_g italic_i italic_t end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT as an output of the answer with the best matching score to the question.

III-D Training and Inference

III-D1 Training

For the final loss, we adapt the addition of Hinge loss and MSE loss. The overall loss function is described as follows:

Lfinal=γLhinge(xlogit,xlogit+)+Lmse(xdec,xvis),subscript𝐿𝑓𝑖𝑛𝑎𝑙𝛾subscript𝐿𝑖𝑛𝑔𝑒subscriptsuperscript𝑥𝑙𝑜𝑔𝑖𝑡subscriptsuperscript𝑥𝑙𝑜𝑔𝑖𝑡subscript𝐿𝑚𝑠𝑒subscript𝑥𝑑𝑒𝑐subscript𝑥𝑣𝑖𝑠,L_{final}=\gamma L_{hinge}(x^{-}_{logit},x^{+}_{logit})+L_{mse}(x_{dec},x_{vis% })\text{,}italic_L start_POSTSUBSCRIPT italic_f italic_i italic_n italic_a italic_l end_POSTSUBSCRIPT = italic_γ italic_L start_POSTSUBSCRIPT italic_h italic_i italic_n italic_g italic_e end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_g italic_i italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_g italic_i italic_t end_POSTSUBSCRIPT ) + italic_L start_POSTSUBSCRIPT italic_m italic_s italic_e end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_d italic_e italic_c end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_v italic_i italic_s end_POSTSUBSCRIPT ) , (6)

where γ𝛾\gammaitalic_γ is a ratio to balance both loss functions and xlogit+subscriptsuperscript𝑥𝑙𝑜𝑔𝑖𝑡x^{+}_{logit}italic_x start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_g italic_i italic_t end_POSTSUBSCRIPT is the ground truth answer of xlogitsubscriptsuperscript𝑥𝑙𝑜𝑔𝑖𝑡{{x}^{-}_{logit}}italic_x start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_g italic_i italic_t end_POSTSUBSCRIPT. Hinge loss takes xlogitsubscript𝑥𝑙𝑜𝑔𝑖𝑡x_{logit}italic_x start_POSTSUBSCRIPT italic_l italic_o italic_g italic_i italic_t end_POSTSUBSCRIPT and xlogitsubscriptsuperscript𝑥𝑙𝑜𝑔𝑖𝑡{{x}^{-}_{logit}}italic_x start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l italic_o italic_g italic_i italic_t end_POSTSUBSCRIPT as inputs and MSE loss takes xdecsubscript𝑥𝑑𝑒𝑐x_{dec}italic_x start_POSTSUBSCRIPT italic_d italic_e italic_c end_POSTSUBSCRIPT and xvissubscript𝑥𝑣𝑖𝑠x_{vis}italic_x start_POSTSUBSCRIPT italic_v italic_i italic_s end_POSTSUBSCRIPT as inputs.

III-D2 Inference

In the inference, we only use the original dataset without generating the additional dataset. We first extract both visual and textual embeddings and align them in visual and textual respects as described in Sec. III-C1, Sec. III-C2 and Sec. III-C3. Lastly, we calculate the xlogitsubscript𝑥𝑙𝑜𝑔𝑖𝑡x_{logit}italic_x start_POSTSUBSCRIPT italic_l italic_o italic_g italic_i italic_t end_POSTSUBSCRIPT for the final prediction.

IV EXPERIMENTS

IV-A Setup

Table II: Ablation studies on the SUTD-TrafficQA by adding the VQ-CAlign and the dataset generated by T5 and GPT. Avg represents an average accuracy for all six tasks.
Methods SUTD-TrafficQA
B F R C I A Avg
Tem-Adapter [18] 45.5 37.2 45.8 54.5 35.1 48.3 46.0
VQ-CAlign 44.8 46.1 47.1 51.3 33.7 50.1 46.3
VQ-CAlign + T5 [27] 46.1 47.0 52.1 58.3 35.8 50.9 47.8
VQ-CAlign + GPT [28] 46.9 43.5 52.5 54.0 39.8 51.8 48.4

IV-A1 Hyperparameters

During preprocessing, we use CLIP [30] with a ViT/B-16 as the backbone and set the visual feature dimension as 512. We extract 8 clips of 16 consecutive frames from each video, a total of 128 frames. For training, we set the batch size as 32 and the epoch as 37. Additionally, we set the exponential moving average rate to 0.9999. We apply a cosine decay schedule with a rate of 2. We apply a learnable embedding with a dropout rate of 0.2 and a maximum sequence length of 128. For attention modules, we set the number of heads to 16.

IV-A2 Dataset

We conduct experiments on SUTD-TrafficQA dataset. The dataset consists of 10,080 videos and annotated 62,535 QA pairs annotated by humans, respectively. SUTD-TrafficQA focuses on traffic scenarios, requiring to understand of the specific traffic events and the causal relations related to them. It allows the evaluation of event understanding ability and causal inference for cognition of the model. SUTD-TrafficQA consists of six types of different reasoning tasks related to traffic scenarios.

Basic Understanding (B). This task requires the model to interpret the basic level of traffic scenarios, including feature-query, event-query, event classification, and counting.

Event Forecasting (F). This task evaluates the forecasting ability of the models based on the outcome of the current scene. The question for forecasting is given to a model, and the model reasons future events for a given video.

Reverse Reasoning (R). This task involves finding the events that happened prior to the beginning of a video segment.

Introspection (I). This task tests the model to check whether the model enables it to provide preventive advice related to the accident that occurs in the given video.

Attribution (A). This task evaluates the model’s ability to reason about the causes of traffic events by selecting the underlying factors from given answer candidates.

Counterfactual Inference (C). This task differs in its objective from previously introduced tasks, as it requires reasoning about hypothetical scenarios not given in the video. The task requires the model to reason about imagery events based on the given conditions in the question.

IV-B Main Results

Our goal is to generate Q&A pairs that contain fundamental information from the video to enhance the reasoning ability of the model. Even though SUTD-TrafficQA already contains Q&A pairs related to basic information, we add more of it by using LM. This integration of Q&A pairs shows overall improvement of the performance. Table I shows the comparison with state-of-the-art methods. Compared to other state-of-the-art methods, our method achieves outstanding performance on SUTD-TrafficQA. Our method shows overall improvement on five given tasks. Especially, it shows notable improvement in F, R, I, and A tasks. All four tasks require the understanding of events that actually happened in the video, which requires a factual inference to answer a question within the given information in the video.

These results show that our generated Q&A pairs serve the necessary fundamental information for reasoning tasks F, R, I, and A. Even though Q&A pairs related to basic attributes already exist in current data, existing data was not enough to strengthen the reasoning ability to perform tasks such as F, R, I, and A. These results show that although we prompt LM to generate fundamental questions, the resulting Q&A pairs inherently contain spatio-temporal information. This stems from the extracted descriptions reflecting the spatio-temporal context of the video as discussed in Section. III-B. From these results, extracted video descriptions enable the generated Q&A pairs to supply the fundamental information of temporal dynamics that are crucial for reasoning tasks.

Compared to other tasks, C remains almost the same accuracy because of its distinguished direction compared to other tasks. C requires hypothetical reasoning beyond the given context of the video rather than the understanding of the given spatio-temporal information of the video. Due to this inherent difference in reasoning requirements, task C shows a different accuracy trend.

Refer to caption
Figure 3: Comparison between different LM-based Q&A generation (T5, GPT) methods on SUTD-TrafficQA.

IV-C Ablation Studies

To demonstrate the effectiveness of our model, we conduct an ablation study on each key component. Table II shows the performance improvements by adding each module. We introduce a VQ-CAlign module to integrate question embeddings as task-specific features. Compared to the Tem-adapter [18] that is our baseline model, we observe a meaningful improvement in the accuracy by utilizing the VQ-CAlign module. Beyond architectural modifications, we incorporate an additional Q&A pair that contains the fundamental information to the dataset. To generate these fundamental Q&A pairs, we employed both T5 and GPT for comparative analysis. Although generated Q&A pairs generated from T5 shows an improved performance, its practical application in real-life scenarios is limited due to the availability of pretrained LLMs. We observe that GPT-generated Q&A pairs achieve 48.4% accuracy, which is the best performance compared to T5-generated Q&A pairs. It highlights LLM’s effectiveness in capturing the primary attributes of video data, compared to LM. Fig. 3 illustrates the overall accuracy improvements achieved by adding each module and the generated Q&A pairs. As shown in Fig. 3, all three settings of FIQ converge at epoch 20 which shows the fast convergence speed.

V CONCLUSION

In this paper, we propose FIQ, a framework that enhances video reasoning by introducing fundamental Q&A pair generation method and VQ-CAlign mechanisms. We generate fundamental Q&A pairs for assisting event-centric textual annotations by leveraging LMs to improve the model’s reasoning ability and generalizability. Additionally, the VQ-CAlign module integrates task-specific knowledge through question embeddings to better support downstream VQA tasks. Experiment results show that our approach significantly improves the accuracy for the reasoning-related tasks, which proves the integration of general knowledge of video enhances the reasoning ability of the model compared to other existing methods. In future work, we plan to generate a new dataset that reflects the question information inside the dataset as answer candidates.

References

  • [1] H.-J. Kim, J.-H. Hong, H. Kong, and S.-W. Lee, “TE-TAD: Towards full end-to-end temporal action detection via time-aligned coordinate expression,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2024, pp. 18 837–18 846.
  • [2] D.-G. Lee, H.-I. Suk, S.-K. Park, and S.-W. Lee, “Motion influence map for unusual human activity detection and localization in crowded scenes,” IEEE Trans. Circuits Syst. Video Technol., vol. 25, no. 10, pp. 1612–1623, 2015.
  • [3] L. Zong, J. Wan, X. Zhang, X. Liu, W. Liang, and B. Xu, “Video-context aligned transformer for video question answering,” in Proc. AAAI Conf. Artif. Intell. (AAAI), vol. 38, no. 17, 2024, pp. 19 795–19 803.
  • [4] J. Li, D. Li, S. Savarese, and S. Hoi, “BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” in Int. Conf. Mach. Learn. (ICML), vol. 202, 2023, pp. 19 730–19 742.
  • [5] Y. Lee, H.-J. Kim, and S.-W. Lee, “Text-infused attention and foreground-aware modeling for zero-shot temporal action detection,” in Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 37, 2024, p. 9864–9884.
  • [6] B. Lin et al., “Video-LLaVA: Learning united visual representation by alignment before projection,” in Conf. Empir. Methods Nat. Lang. Process. (EMNLP), 2024, pp. 5971–5984.
  • [7] H. Akbari et al., “VATT: Transformers for multimodal self-supervised learning from raw video, audio and text,” in Adv. Neural Inf. Process. Syst. (NeurIPS), 2021.
  • [8] C. Feichtenhofer, H. Fan, J. Malik, and K. He, “SlowFast networks for video recognition,” in Proc. Int. Conf. Comput. Vis. (ICCV), 2019, pp. 6201–6210.
  • [9] Z. Tong, Y. Song, J. Wang, and L. Wang, “VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training,” in Adv. Neural Inf. Process. Syst. (NeurIPS), 2022.
  • [10] A. Yang, A. Miech, J. Sivic, I. Laptev, and C. Schmid, “Zero-shot video question answering via frozen bidirectional language models,” in Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 35, 2022, pp. 124–141.
  • [11] X. Wang et al., “ViLA: Efficient video-language alignment for video question answering,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2024, p. 186–204.
  • [12] J. Li, D. Li, C. Xiong, and S. Hoi, “BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” in Int. Conf. Mach. Learn. (ICML), vol. 162, 2022, pp. 12 888–12 900.
  • [13] S.-W. Lee and H.-H. Song, “A new recurrent neural-network architecture for visual pattern recognition,” IEEE Trans. Neural Networks., vol. 8, no. 2, pp. 331–340, 1997.
  • [14] S.-W. Lee and S.-Y. Kim, “Integrated segmentation and recognition of handwritten numerals with cascade neural network,” IEEE Trans. Syst. Man Cybern., vol. 29, no. 2, pp. 285–290, 1999.
  • [15] P. Gao et al., “CLIP-Adapter: Better vision-language models with feature adapters,” in Int. J. Comput. Vis. (IJCV), vol. 132, 2023, p. 581–595.
  • [16] S. T. Wasim, M. Naseer, S. Khan, F. S. Khan, and M. Shah, “Vita-CLIP: Video and text adaptive clip via multimodal prompting,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2023, pp. 23 034–23 044.
  • [17] X. Li, D. Lian, Z. Lu, J. Bai, Z. Chen, and X. Wang, “GraphAdapter: Tuning vision-language models with dual knowledge graph,” in Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 36, 2023, pp. 13 448–13 466.
  • [18] G. Chen et al., “Tem-adapter: Adapting image-text pretraining for video question answer,” in Proc. Int. Conf. Comput. Vis. (ICCV), 2023, pp. 13 899–13 909.
  • [19] S. Wang, Q. Zhao, M. Q. Do, N. Agarwal, K. Lee, and C. Sun, “Vamos: Versatile action models for video understanding,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2024, pp. 142–160.
  • [20] C. Cai et al., “Empowering large language model for continual video question answering with collaborative prompting,” in Conf. Empir. Methods Nat. Lang. Process. (EMNLP), 2024, pp. 3921–3932.
  • [21] Y.-K. Lim, S.-H. Choi, and S.-W. Lee, “Text extraction in mpeg compressed video for content-based indexing,” in Proc. Int. Conf. Pattern Recognit., vol. 4, 2000, pp. 409–412.
  • [22] S.-W. Lee, J. H. Kim, and F. C. Groen, “Translation-, rotation-and scale-invariant recognition of hand-drawn symbols in schematic diagrams,” Int. J. Pattern Recognit. Artif. Intell., vol. 4, no. 01, pp. 1–25, 1990.
  • [23] Z. Shao, Z. Yu, M. Wang, and J. Yu, “Prompting large language models with answer heuristics for knowledge-based visual question answering,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2023, pp. 14 974–14 983.
  • [24] S. Guo, L. Liao, J. Zhang, Y. Wang, C. Li, and H. Chen, “SGSH: Stimulate large language models with skeleton heuristics for knowledge base question generation,” in Conf. North Am. Chapt. Assoc. Comput. Linguist. (NAACL), 2024, pp. 4613–4625.
  • [25] S. Changpinyo, D. Kukliansky, I. Szpektor, X. Chen, N. Ding, and R. Soricut, “All you may need for vqa are image captions,” in Conf. North Am. Chapt. Assoc. Comput. Linguist. (NAACL), 2022, pp. 1947–1963.
  • [26] A. Ushio, F. Alva-Manchego, and J. Camacho-Collados, “A practical toolkit for multilingual question and answer generation,” in Assoc. Comput. Linguist. (ACL), vol. 3, 2023, pp. 86–94.
  • [27] C. Raffel et al., “Exploring the limits of transfer learning with a unified text-to-text transformer,” in Journal of machine learning research, vol. 21, no. 140, 2020, pp. 1–67.
  • [28] J. Achiam et al., “Gpt-4 technical report,” in arXiv preprint arXiv:2303.08774, 2023.
  • [29] K. Li et al., “MVBench: A comprehensive multi-modal video understanding benchmark,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2024, pp. 22 195–22 206.
  • [30] A. Radford et al., “Learning transferable visual models from natural language supervision,” in Int. Conf. Mach. Learn. (ICML), vol. 139, 2021, pp. 8748–8763.
  • [31] E. J. Hu et al., “LoRA: Low-rank adaptation of large language models.” in Int. Conf. Learn. Represent. (ICLR), vol. 1, 2022, p. 3.
  • [32] K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision-language models,” in Int. J. Comput. Vis. (IJCV), vol. 130, 2022, pp. 2337–2348.
  • [33] M. Jia et al., “Visual prompt tuning,” in Proc. Eur. Conf. Comput. Vis. (ECCV), 2022, pp. 709–727.
  • [34] A. Wang, K. Cho, and M. Lewis, “Asking and answering questions to evaluate the factual consistency of summaries,” in Assoc. Comput. Linguist. (ACL), 2020, pp. 5008–5020.