A 5′ UTR language model for decoding untranslated regions of mRNA and function predictions,Nature Machine Intelligence

当前位置： X-MOL 学术 › Nat. Mach. Intell. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A 5′ UTR language model for decoding untranslated regions of mRNA and function predictions
Nature Machine Intelligence ( IF 23.8 ) Pub Date : 2024-04-05 , DOI: 10.1038/s42256-024-00823-9
Yanyi Chu , Dan Yu , Yupeng Li , Kaixuan Huang , Yue Shen , Le Cong , Jason Zhang , Mengdi Wang

The 5′ untranslated region (UTR), a regulatory region at the beginning of a messenger RNA (mRNA) molecule, plays a crucial role in regulating the translation process and affects the protein expression level. Language models have showcased their effectiveness in decoding the functions of protein and genome sequences. Here, we introduce a language model for 5′ UTR, which we refer to as the UTR-LM. The UTR-LM is pretrained on endogenous 5′ UTRs from multiple species and is further augmented with supervised information including secondary structure and minimum free energy. We fine-tuned the UTR-LM in a variety of downstream tasks. The model outperformed the best known benchmark by up to 5% for predicting the mean ribosome loading, and by up to 8% for predicting the translation efficiency and the mRNA expression level. The model was also applied to identifying unannotated internal ribosome entry sites within the untranslated region and improved the area under the precision–recall curve from 0.37 to 0.52 compared to the best baseline. Further, we designed a library of 211 new 5′ UTRs with high predicted values of translation efficiency and evaluated them via a wet-laboratory assay. Experiment results confirmed that our top designs achieved a 32.5% increase in protein production level relative to well-established 5′ UTRs optimized for therapeutics.

中文翻译：

用于解码 mRNA 非翻译区域和功能预测的 5′ UTR 语言模型

5'非翻译区（UTR）是信使RNA（mRNA）分子起始的调节区，在调节翻译过程中发挥着至关重要的作用，并影响蛋白质表达水平。语言模型展示了它们在解码蛋白质和基因组序列功能方面的有效性。在这里，我们介绍一个 5′ UTR 的语言模型，我们将其称为 UTR-LM。 UTR-LM 对来自多个物种的内源 5' UTR 进行预训练，并通过包括二级结构和最小自由能在内的监督信息进一步增强。我们在各种下游任务中对 UTR-LM 进行了微调。该模型在预测平均核糖体负载方面比最著名的基准高出高达 5%，在预测翻译效率和 mRNA 表达水平方面比最知名的基准高出高达 8%。该模型还应用于识别非翻译区域内未注释的内部核糖体进入位点，并且与最佳基线相比，精确回忆曲线下面积从 0.37 提高到 0.52。此外，我们设计了一个包含 211 个新 5' UTR 的文库，具有高翻译效率预测值，并通过湿实验室测定对其进行了评估。实验结果证实，相对于针对治疗优化的成熟 5' UTR，我们的顶级设计实现了蛋白质产量水平提高 32.5%。

更新日期：2024-04-05

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文

全部期刊列表>>