Active Code Learning: Benchmarking Sample-Efficient Training of Code Models,IEEE Transactions on Software Engineering

当前位置： X-MOL 学术 › IEEE Trans. Softw. Eng. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Active Code Learning: Benchmarking Sample-Efficient Training of Code Models
IEEE Transactions on Software Engineering ( IF 7.4 ) Pub Date : 2024-03-13 , DOI: 10.1109/tse.2024.3376964
Qiang Hu ₁ , Yuejun Guo ₂ , Xiaofei Xie ₃ , Maxime Cordy ₁ , Lei Ma ₄ , Mike Papadakis ₁ , Yves Le Traon ₁

Affiliation

The costly human effort required to prepare the training data of machine learning (ML) models hinders their practical development and usage in software engineering (ML4Code), especially for those with limited budgets. Therefore, efficiently training models of code with less human effort has become an emergent problem. Active learning is such a technique to address this issue that allows developers to train a model with reduced data while producing models with desired performance, which has been well studied in computer vision and natural language processing domains. Unfortunately, there is no such work that explores the effectiveness of active learning for code models. In this paper, we bridge this gap by building the first benchmark to study this critical problem - active code learning. Specifically, we collect 11 acquisition functions (which are used for data selection in active learning) from existing works and adapt them for code-related tasks. Then, we conduct an empirical study to check whether these acquisition functions maintain performance for code data. The results demonstrate that feature selection highly affects active learning and using output vectors to select data is the best choice. For the code summarization task, active code learning is ineffective which produces models with over a 29.64% gap compared to the expected performance. Furthermore, we explore future directions of active code learning with an exploratory study. We propose to replace distance calculation methods with evaluation metrics and find a correlation between these evaluation-based distance methods and the performance of code models.

中文翻译：

主动代码学习：代码模型样本高效训练的基准测试

准备机器学习 (ML) 模型的训练数据所需的昂贵人力阻碍了它们在软件工程 (ML4Code) 中的实际开发和使用，特别是对于预算有限的人来说。因此，以更少的人力有效地训练代码模型已成为一个紧迫的问题。主动学习是解决这个问题的一种技术，它允许开发人员用减少的数据训练模型，同时生成具有所需性能的模型，这在计算机视觉和自然语言处理领域已经得到了很好的研究。不幸的是，还没有这样的工作来探索代码模型主动学习的有效性。在本文中，我们通过构建第一个基准来研究这个关键问题——主动代码学习，从而弥补这一差距。具体来说，我们从现有作品中收集了 11 个获取函数（用于主动学习中的数据选择），并将它们调整为与代码相关的任务。然后，我们进行实证研究来检查这些获取函数是否保持代码数据的性能。结果表明，特征选择对主动学习有很大影响，使用输出向量来选择数据是最佳选择。对于代码摘要任务，主动代码学习是无效的，生成的模型与预期性能相比有超过 29.64% 的差距。此外，我们通过探索性研究探索主动代码学习的未来方向。我们建议用评估指标代替距离计算方法，并找到这些基于评估的距离方法与代码模型性能之间的相关性。

更新日期：2024-03-13

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>