当前位置: X-MOL 学术IEEE Trans. Inform. Forensics Secur. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Attribute-Guided Cross-Modal Interaction and Enhancement for Audio-Visual Matching
IEEE Transactions on Information Forensics and Security ( IF 6.8 ) Pub Date : 2024-04-15 , DOI: 10.1109/tifs.2024.3388949
Jiaxiang Wang 1 , Aihua Zheng 2 , Yan Yan 3 , Ran He 4 , Jin Tang 1
Affiliation  

Audio-visual matching is an essential task that measures the correlation between audio clips and visual images. However, current methods rely solely on the joint embedding of global features from audio clips and face image pairs to learn semantic correlations. This approach overlooks the importance of high-confidence correlations and discrepancies of local subtle features, which are crucial for cross-modal matching. To address this issue, we propose a novel Attribute-guided Cross-modal Interaction and Enhancement Network (ACIENet), which employs multiple attributes to explore the associations of different key local subtle features. The ACIENet contains two novel modules: the Attribute-guided Interaction (AGI) module and the Attribute-guided Enhancement (AGE) module. The AGI module employs global feature alignment similarity to guide cross-modal local feature interactions, which enhances cross-modal association features for the same identity and expands cross-modal distinctive features for different identities. Additionally, the interactive features and original features are fused to ensure intra-class discriminability and inter-class correspondence. The AGE module captures subtle attribute-related features by using an attribute-driven network, thereby enhancing discrimination at the attribute level. Specifically, it strengthens the combined attribute-related features of gender and nationality. To prevent interference between multiple attribute features, we design a multi-attribute learning network as a parallel framework. Experiments conducted on a public benchmark dataset demonstrate the efficacy of the ACIENet method in different scenarios. Code and models are available at https://github.com/w1018979952/ACIENet .

中文翻译:

属性引导的跨模式交互和视听匹配增强

视听匹配是衡量音频剪辑和视觉图像之间相关性的一项重要任务。然而,当前的方法仅依赖于音频剪辑和面部图像对的全局特征的联合嵌入来学习语义相关性。这种方法忽视了高置信度相关性和局部细微特征差异的重要性,这对于跨模式匹配至关重要。为了解决这个问题,我们提出了一种新颖的属性引导跨模态交互和增强网络(ACIENet),它利用多个属性来探索不同关键局部微妙特征的关联。 ACIENet 包含两个新颖的模块:属性引导交互(AGI)模块和属性引导增强(AGE)模块。 AGI模块利用全局特征对齐相似性来指导跨模态局部特征交互,增强了同一身份的跨模态关联特征,并扩展了不同身份的跨模态区别特征。此外,交互特征和原始特征被融合,以确保类内可区分性和类间对应性。 AGE 模块通过使用属性驱动网络捕获微妙的属性相关特征,从而增强属性级别的辨别力。具体来说,它强化了性别和国籍的组合属性相关特征。为了防止多个属性特征之间的干扰,我们设计了一个多属性学习网络作为并行框架。在公共基准数据集上进行的实验证明了 ACIENet 方法在不同场景下的有效性。代码和型号可在https://github.com/w1018979952/ACIENet
更新日期:2024-04-15
down
wechat
bug