Multi-purpose RNA language modelling with motif-aware pretraining and type-guided fine-tuning,Nature Machine Intelligence

当前位置： X-MOL 学术 › Nat. Mach. Intell. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Multi-purpose RNA language modelling with motif-aware pretraining and type-guided fine-tuning
Nature Machine Intelligence ( IF 23.8 ) Pub Date : 2024-05-13 , DOI: 10.1038/s42256-024-00836-4
Ning Wang , Jiang Bian , Yuchen Li , Xuhong Li , Shahid Mumtaz , Linghe Kong , Haoyi Xiong

Pretrained language models have shown promise in analysing nucleotide sequences, yet a versatile model excelling across diverse tasks with a single pretrained weight set remains elusive. Here we introduce RNAErnie, an RNA-focused pretrained model built upon the transformer architecture, employing two simple yet effective strategies. First, RNAErnie enhances pretraining by incorporating RNA motifs as biological priors and introducing motif-level random masking in addition to masked language modelling at base/subsequence levels. It also tokenizes RNA types (for example, miRNA, lnRNA) as stop words, appending them to sequences during pretraining. Second, subject to out-of-distribution tasks with RNA sequences not seen during the pretraining phase, RNAErnie proposes a type-guided fine-tuning strategy that first predicts possible RNA types using an RNA sequence and then appends the predicted type to the tail of sequence to refine feature embedding in a post hoc way. Our extensive evaluation across seven datasets and five tasks demonstrates the superiority of RNAErnie in both supervised and unsupervised learning. It surpasses baselines with up to 1.8% higher accuracy in classification, 2.2% greater accuracy in interaction prediction and 3.3% improved F1 score in structure prediction, showcasing its robustness and adaptability with a unified pretrained foundation.

中文翻译：

具有主题感知预训练和类型引导微调的多用途 RNA 语言建模

预训练的语言模型在分析核苷酸序列方面显示出了良好的前景，但使用单个预训练权重集在不同任务中表现出色的多功能模型仍然难以实现。在这里，我们介绍 RNAErnie，一种基于 Transformer 架构的以 RNA 为中心的预训练模型，采用两种简单但有效的策略。首先，RNAErnie 通过将 RNA 基序纳入生物学先验，并除了碱基/子序列水平的掩蔽语言建模之外还引入基序级随机掩蔽来增强预训练。它还将 RNA 类型（例如 miRNA、lnRNA）标记为停用词，在预训练期间将它们附加到序列中。其次，针对预训练阶段未见过的 RNA 序列的分布外任务，RNAErnie 提出了一种类型引导的微调策略，首先使用 RNA 序列预测可能的 RNA 类型，然后将预测的类型附加到以事后方式细化特征嵌入的序列。我们对七个数据集和五个任务的广泛评估证明了 RNAErnie 在监督和无监督学习方面的优越性。它超越了基线，分类准确率提高了 1.8%，交互预测准确率提高了 2.2%，结构预测 F1 得分提高了 3.3%，展示了其在统一预训练基础上的鲁棒性和适应性。

更新日期：2024-05-14

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文

全部期刊列表>>