Visual language integration: A survey and open challenges,Computer Science Review

当前位置： X-MOL 学术 › Comput. Sci. Rev. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Visual language integration: A survey and open challenges
Computer Science Review ( IF 12.9 ) Pub Date : 2023-03-02 , DOI: 10.1016/j.cosrev.2023.100548
Sang-Min Park , Young-Gab Kim

With the recent development of deep learning technology comes the wide use of artificial intelligence (AI) models in various domains. AI shows good performance for definite-purpose tasks, such as image recognition and text classification. The recognition performance for every single task has become more accurate than feature engineering, enabling more work that could not be done before. In addition, with the development of generation technology (e.g., GPT-3), AI models are showing stable performances in each recognition and generation task. However, not many studies have focused on how to integrate these models efficiently to achieve comprehensive human interaction. Each model grows in size with improved performance, thereby consequently requiring more computing power and more complicated designs to train than before. This requirement increases the complexity of each model and requires more paired data, making model integration difficult. This study provides a survey on visual language integration with a hierarchical approach for reviewing the recent trends that have already been performed on AI models among research communities as the interaction component. We also compare herein the strengths of existing AI models and integration approaches and the limitations they face. Furthermore, we discuss the current related issues and which research is needed for visual language integration. More specifically, we identify four aspects of visual language integration models: multimodal learning, multi-task learning, end-to-end learning, and embodiment for embodied visual language interaction. Finally, we discuss some current open issues and challenges and conclude our survey by giving possible future directions.

中文翻译：

视觉语言整合：调查和开放挑战

随着深度学习技术的最新发展，人工智能 (AI) 模型在各个领域得到广泛应用。人工智能在图像识别和文本分类等明确目的任务中表现出色。每一项任务的识别性能都比特征工程更准确，可以完成更多以前无法完成的工作。此外，随着生成技术（如GPT-3）的发展，AI模型在各项识别和生成任务中均表现稳定。然而，没有多少研究关注如何有效地整合这些模型以实现全面的人机交互。每个模型的大小都随着性能的提高而增加，因此需要比以前更多的计算能力和更复杂的设计来训练。这种要求增加了每个模型的复杂性，需要更多的配对数据，使得模型集成变得困难。本研究提供了一项关于视觉语言集成的调查，采用分层方法来审查最近在研究社区中作为交互组件对 AI 模型执行的趋势。我们还比较了现有人工智能模型和集成方法的优势以及它们面临的局限性。此外，我们讨论了当前的相关问题以及视觉语言集成需要哪些研究。更具体地说，我们确定了视觉语言整合模型的四个方面：多模态学习、多任务学习、端到端学习和体现视觉语言交互的体现。最后，

更新日期：2023-03-02

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>