Multi-Stage Image-Language Cross-Generative Fusion Network for Video-Based Referring Expression Comprehension,IEEE Transactions on Image Processing

当前位置： X-MOL 学术 › IEEE Trans. Image Process. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Multi-Stage Image-Language Cross-Generative Fusion Network for Video-Based Referring Expression Comprehension
IEEE Transactions on Image Processing ( IF 10.6 ) Pub Date : 2024-05-02 , DOI: 10.1109/tip.2024.3394260
Yujia Zhang ₁ , Qianzhong Li ₂ , Yi Pan ₁ , Xiaoguang Zhao ₁ , Min Tan ₃

Affiliation

Video-based referring expression comprehension is a challenging task that requires locating the referred object in each video frame of a given video. While many existing approaches treat this task as an object-tracking problem, their performance is heavily reliant on the quality of the tracking templates. Furthermore, when there is not enough annotation data to assist in template selection, the tracking may fail. Other approaches are based on object detection, but they often use only one adjacent frame of the key frame for feature learning, which limits their ability to establish the relationship between different frames. In addition, improving the fusion of features from multiple frames and referring expressions to effectively locate the referents remains an open problem. To address these issues, we propose a novel approach called the Multi-Stage Image-Language Cross-Generative Fusion Network (MILCGF-Net), which is based on one-stage object detection. Our approach includes a Frame Dense Feature Aggregation module for dense feature learning of adjacent time sequences. Additionally, we propose an Image-Language Cross-Generative Fusion module as the main body of multi-stage learning to generate cross-modal features by calculating the similarity between video and expression, and then refining and fusing the generated features. To further enhance the cross-modal feature generation capability of our model, we introduce a consistency loss that constrains the image-language similarity and language-image similarity matrices during feature generation. We evaluate our proposed approach on three public datasets and demonstrate its effectiveness through comprehensive experimental results.

中文翻译：

用于基于视频的引用表达理解的多阶段图像语言交叉生成融合网络

基于视频的引用表达理解是一项具有挑战性的任务，需要在给定视频的每个视频帧中定位引用对象。虽然许多现有方法将此任务视为对象跟踪问题，但它们的性能严重依赖于跟踪模板的质量。此外，当没有足够的注释数据来辅助模板选择时，跟踪可能会失败。其他方法基于对象检测，但它们通常仅使用关键帧的一个相邻帧进行特征学习，这限制了它们建立不同帧之间关系的能力。此外，改进多帧特征和指称表达的融合以有效定位指称仍然是一个悬而未决的问题。为了解决这些问题，我们提出了一种称为多阶段图像语言交叉生成融合网络（MILCGF-Net）的新方法，它基于单阶段目标检测。我们的方法包括帧密集特征聚合模块，用于相邻时间序列的密集特征学习。此外，我们提出了图像语言交叉生成融合模块作为多阶段学习的主体，通过计算视频和表情之间的相似度来生成跨模态特征，然后细化和融合生成的特征。为了进一步增强模型的跨模态特征生成能力，我们引入了一致性损失，在特征生成过程中约束图像-语言相似性和语言-图像相似性矩阵。我们在三个公共数据集上评估了我们提出的方法，并通过综合实验结果证明了其有效性。

更新日期：2024-05-02

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>