Global–Local Information Soft-Alignment for Cross-Modal Remote-Sensing Image–Text Retrieval,IEEE Transactions on Geoscience and Remote Sensing

当前位置： X-MOL 学术 › IEEE Trans. Geosci. Remote Sens. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Global–Local Information Soft-Alignment for Cross-Modal Remote-Sensing Image–Text Retrieval
IEEE Transactions on Geoscience and Remote Sensing ( IF 8.2 ) Pub Date : 2024-05-14 , DOI: 10.1109/tgrs.2024.3401031
Gang Hu ₁ , Zaidao Wen ₁ , Yafei Lv ₂ , Jianting Zhang ₂ , Qian Wu ₃

Affiliation

Cross-modal remote-sensing image–text retrieval (CMRSITR) is a challenging task that aims to retrieve target remote-sensing (RS) images based on textual descriptions. However, the modal gap between texts and RS images poses a significant challenge. RS images comprise multiple targets and complex backgrounds, necessitating the mining of both global and local information (GaLR) for effective CMRSITR. Existing approaches primarily focus on local image features while disregarding the local features of the text and their correspondence. These methods typically fuse global and local image features and align them with global text features. However, they struggle to eliminate the influence of cluttered backgrounds and may overlook crucial targets. To address these limitations, we propose a novel framework for CMRSITR based on a transformer architecture, which leverages global–local information soft alignment (GLISA) to enhance retrieval performance. Our framework incorporates a global image extraction module, which captures the global semantic features of image–text pairs and effectively represents the relationships among multiple targets in RS images. In addition, we introduce an adaptive local information extraction (ALIE) module that adaptively mines discriminative local clues from both RS images and texts, aligning the corresponding fine-grained information. To mitigate semantic ambiguities during the alignment of local features, we design a local information soft-alignment (LISA) module. In comparative evaluations using two public CMRSITR datasets, our proposed method achieves state-of-the-art results, surpassing not only traditional cross-modal retrieval methods by a substantial margin but also other contrastive language-image pretraining (CLIP)-based methods.

中文翻译：

跨模态遥感图文检索的全局-局部信息软对齐

跨模态遥感图像文本检索（CMRSITR）是一项具有挑战性的任务，旨在基于文本描述检索目标遥感（RS）图像。然而，文本和遥感图像之间的模态差距提出了重大挑战。 RS 图像包含多个目标和复杂背景，需要挖掘全局和局部信息 (GaLR) 才能实现有效的 CMRSITR。现有的方法主要关注局部图像特征，而忽略文本的局部特征及其对应关系。这些方法通常融合全局和局部图像特征，并将它们与全局文本特征对齐。然而，他们很难消除杂乱背景的影响，并且可能会忽略关键目标。为了解决这些限制，我们提出了一种基于变压器架构的 CMRSITR 新颖框架，该框架利用全局-局部信息软对齐（GLISA）来增强检索性能。我们的框架包含一个全局图像提取模块，该模块捕获图像文本对的全局语义特征，并有效地表示遥感图像中多个目标之间的关系。此外，我们引入了自适应局部信息提取（ALIE）模块，该模块可以从 RS 图像和文本中自适应地挖掘有区别的局部线索，从而对齐相应的细粒度信息。为了减轻局部特征对齐过程中的语义歧义，我们设计了局部信息软对齐（LISA）模块。在使用两个公共 CMRSITR 数据集的比较评估中，我们提出的方法取得了最先进的结果，不仅大大超过了传统的跨模态检索方法，而且还超过了其他基于对比语言图像预训练（CLIP）的方法。

更新日期：2024-05-14

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>