当前位置: X-MOL 学术Complex Intell. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
OBMI: oversampling borderline minority instances by a two-stage Tomek link-finding procedure for class imbalance problem
Complex & Intelligent Systems ( IF 5.8 ) Pub Date : 2024-04-08 , DOI: 10.1007/s40747-024-01399-y
Qiangkui Leng , Jiamei Guo , Jiaqing Tao , Xiangfu Meng , Changzhong Wang

Mitigating the impact of class imbalance datasets on classifiers poses a challenge to the machine learning community. Conventional classifiers do not perform well as they are habitually biased toward the majority class. Among existing solutions, the synthetic minority oversampling technique (SMOTE) has shown great potential, aiming to improve the dataset rather than the classifier. However, SMOTE still needs improvement because of its equal oversampling to each minority instance. Based on the consensus that instances far from the borderline contribute less to classification, a refined method for oversampling borderline minority instances (OBMI) is proposed in this paper using a two-stage Tomek link-finding procedure. In the oversampling stage, the pairs of between-class instances nearest to each other are first found to form Tomek links. Then, these minority instances in Tomek links are extracted as base instances. Finally, new minority instances are generated, each of which is linearly interpolated between a base instance and one minority neighbor of the base instance. To address the overlap caused by oversampling, in the cleaning stage, Tomek links are employed again to remove the borderline instances from both classes. The OBMI is compared with ten baseline methods on 17 benchmark datasets. The results show that it performs better on most of the selected datasets in terms of the F1-score and G-mean. Statistical analysis also indicates its higher-level Friedman ranking.



中文翻译:

OBMI:通过两阶段 Tomek 链接查找程序对边缘少数实例进行过采样,以解决类别不平衡问题

减轻类别不平衡数据集对分类器的影响对机器学​​习社区提出了挑战。传统的分类器表现不佳,因为它们习惯性地偏向多数类。在现有的解决方案中,合成少数过采样技术(SMOTE)显示出巨大的潜力,旨在改进数据集而不是分类器。然而,SMOTE 仍然需要改进,因为它对每个少数实例的过采样相等。基于远离边界的实例对分类贡献较小的共识,本文提出了一种使用两阶段 Tomek 链接查找过程对边界少数实例(OBMI)进行过采样的改进方法。在过采样阶段,首先找到彼此最接近的类间实例对以形成 Tomek 链接。然后,Tomek 链接中的这些少数实例被提取为基础实例。最后,生成新的少数实例,每个实例都在基本实例和基本实例的一个少数邻居之间线性插值。为了解决过采样引起的重叠问题,在清理阶段,再次使用 Tomek 链接从两个类中删除边界实例。 OBMI 与 17 个基准数据集上的 10 种基线方法进行了比较。结果表明,就F 1-分数G 均值而言,它在大多数选定的数据集上表现更好。统计分析还表明其弗里德曼排名更高。

更新日期:2024-04-08
down
wechat
bug