HiCMAE: Hierarchical Contrastive Masked Autoencoder for self-supervised Audio-Visual Emotion Recognition,Information Fusion

当前位置： X-MOL 学术 › Inform. Fusion › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

HiCMAE: Hierarchical Contrastive Masked Autoencoder for self-supervised Audio-Visual Emotion Recognition
Information Fusion ( IF 18.6 ) Pub Date : 2024-03-26 , DOI: 10.1016/j.inffus.2024.102382
Licai Sun , Zheng Lian , Bin Liu , Jianhua Tao

Audio-Visual Emotion Recognition (AVER) has garnered increasing attention in recent years for its critical role in creating emotion-aware intelligent machines. Previous efforts in this area are dominated by the supervised learning paradigm. Despite significant progress, supervised learning is meeting its bottleneck due to the longstanding data scarcity issue in AVER. Motivated by recent advances in self-supervised learning, we propose Hierarchical Contrastive Masked Autoencoder (HiCMAE), a novel self-supervised framework that leverages large-scale self-supervised pre-training on vast unlabeled audio-visual data to promote the advancement of AVER. Following prior arts in self-supervised audio-visual representation learning, HiCMAE adopts two primary forms of self-supervision for pre-training, namely masked data modeling and contrastive learning. Unlike them which focus exclusively on top-layer representations while neglecting explicit guidance of intermediate layers, HiCMAE develops a strategy to foster audio-visual feature learning and improve the overall quality of learned representations. Firstly, it incorporates between the encoder and decoder to encourage intermediate layers to learn more meaningful representations and bolster masked audio-visual reconstruction. Secondly, is also exerted on intermediate representations to narrow the audio-visual modality gap progressively and facilitate subsequent cross-modal fusion. Finally, during downstream fine-tuning, HiCMAE employs to comprehensively integrate multi-level features from different layers. To verify the effectiveness of HiCMAE, we conduct extensive experiments on 9 datasets covering both categorical and dimensional AVER tasks. Experimental results show that our method significantly outperforms state-of-the-art supervised and self-supervised audio-visual methods, which indicates that HiCMAE is a powerful audio-visual emotion representation learner. Codes and models are publicly available at .

中文翻译：

HiCMAE：用于自监督视听情感识别的分层对比屏蔽自动编码器

近年来，视听情感识别（AVER）因其在创建情感感知智能机器方面的关键作用而受到越来越多的关注。此前该领域的工作主要以监督学习范式为主。尽管取得了重大进展，但由于 AVER 长期存在的数据稀缺问题，监督学习正在遇到瓶颈。受自监督学习最新进展的推动，我们提出了分层对比掩模自动编码器（HiCMAE），这是一种新颖的自监督框架，利用对大量未标记的视听数据进行大规模自监督预训练来促进 AVER 的进步。遵循自监督视听表示学习的现有技术，HiCMAE 采用两种主要的自监督形式进行预训练，即屏蔽数据建模和对比学习。与只关注顶层表示而忽略中间层的显式指导不同，HiCMAE 开发了一种策略来促进视听特征学习并提高学习表示的整体质量。首先，它结合了编码器和解码器，以鼓励中间层学习更有意义的表示并支持屏蔽视听重建。其次，还应用于中间表示，以逐步缩小视听模态差距并促进后续的跨模态融合。最后，在下游微调过程中，HiCMAE 采用全面集成不同层的多级特征。为了验证 HiCMAE 的有效性，我们对涵盖分类和维度 AVER 任务的 9 个数据集进行了广泛的实验。实验结果表明，我们的方法显着优于最先进的监督和自监督视听方法，这表明 HiCMAE 是一种强大的视听情感表示学习器。代码和模型可在上公开获取。

更新日期：2024-03-26

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>