Assessment of a Large Language Model’s Responses to Questions and Cases About Glaucoma and Retina Management,JAMA Ophthalmology

当前位置： X-MOL 学术 › JAMA Ophthalmol. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Assessment of a Large Language Model’s Responses to Questions and Cases About Glaucoma and Retina Management
JAMA Ophthalmology ( IF 8.1 ) Pub Date : 2024-02-22 , DOI: 10.1001/jamaophthalmol.2023.6917
Andy S. Huang ₁ , Kyle Hirabayashi ₁ , Laura Barna _{1,

2} , Deep Parikh ₁ , Louis R. Pasquale ₁

Affiliation

ImportanceLarge language models (LLMs) are revolutionizing medical diagnosis and treatment, offering unprecedented accuracy and ease surpassing conventional search engines. Their integration into medical assistance programs will become pivotal for ophthalmologists as an adjunct for practicing evidence-based medicine. Therefore, the diagnostic and treatment accuracy of LLM-generated responses compared with fellowship-trained ophthalmologists can help assess their accuracy and validate their potential utility in ophthalmic subspecialties.ObjectiveTo compare the diagnostic accuracy and comprehensiveness of responses from an LLM chatbot with those of fellowship-trained glaucoma and retina specialists on ophthalmological questions and real patient case management.Design, Setting, and ParticipantsThis comparative cross-sectional study recruited 15 participants aged 31 to 67 years, including 12 attending physicians and 3 senior trainees, from eye clinics affiliated with the Department of Ophthalmology at Icahn School of Medicine at Mount Sinai, New York, New York. Glaucoma and retina questions (10 of each type) were randomly selected from the American Academy of Ophthalmology’s Commonly Asked Questions. Deidentified glaucoma and retinal cases (10 of each type) were randomly selected from ophthalmology patients seen at Icahn School of Medicine at Mount Sinai–affiliated clinics. The LLM used was GPT-4 (version dated May 12, 2023). Data were collected from June to August 2023.Main Outcomes and MeasuresResponses were assessed via a Likert scale for medical accuracy and completeness. Statistical analysis involved the Mann-Whitney U test and the Kruskal-Wallis test, followed by pairwise comparison.ResultsThe combined question-case mean rank for accuracy was 506.2 for the LLM chatbot and 403.4 for glaucoma specialists (n = 831; Mann-Whitney U = 27976.5; P &lt; .001), and the mean rank for completeness was 528.3 and 398.7, respectively (n = 828; Mann-Whitney U = 25218.5; P &lt; .001). The mean rank for accuracy was 235.3 for the LLM chatbot and 216.1 for retina specialists (n = 440; Mann-Whitney U = 15518.0; P = .17), and the mean rank for completeness was 258.3 and 208.7, respectively (n = 439; Mann-Whitney U = 13123.5; P = .005). The Dunn test revealed a significant difference between all pairwise comparisons, except specialist vs trainee in rating chatbot completeness. The overall pairwise comparisons showed that both trainees and specialists rated the chatbot’s accuracy and completeness more favorably than those of their specialist counterparts, with specialists noting a significant difference in the chatbot’s accuracy (z = 3.23; P = .007) and completeness (z = 5.86; P &lt; .001).Conclusions and RelevanceThis study accentuates the comparative proficiency of LLM chatbots in diagnostic accuracy and completeness compared with fellowship-trained ophthalmologists in various clinical scenarios. The LLM chatbot outperformed glaucoma specialists and matched retina specialists in diagnostic and treatment accuracy, substantiating its role as a promising diagnostic adjunct in ophthalmology.

中文翻译：

评估大型语言模型对青光眼和视网膜管理问题和案例的回答

重要性大型语言模型 (LLM) 正在彻底改变医疗诊断和治疗，提供超越传统搜索引擎的前所未有的准确性和易用性。它们融入医疗援助计划将成为眼科医生作为循证医学实践的辅助手段的关键。因此，与受过专科培训的眼科医生相比，法学硕士生成的响应的诊断和治疗准确性可以帮助评估其准确性并验证其在眼科亚专科中的潜在效用。目的比较法学硕士聊天机器人与专科培训的响应的诊断准确性和全面性对青光眼和视网膜专家进行眼科问题和真实患者病例管理方面的培训。设计、设置和参与者这项比较横断面研究招募了来自该部门附属眼科诊所的 15 名年龄 31 至 67 岁的参与者，其中包括 12 名主治医师和 3 名高级实习生纽约州西奈山伊坎医学院眼科博士。青光眼和视网膜问题（每种类型 10 个）是从美国眼科学会的常见问题中随机选择的。未确定的青光眼和视网膜病例（每种类型 10 例）是从西奈山伊坎医学院附属诊所的眼科患者中随机选择的。使用的法学硕士是 GPT-4（版本日期为 2023 年 5 月 12 日）。数据收集时间为 2023 年 6 月至 8 月。主要结果和措施通过李克特量表评估反应的医疗准确性和完整性。统计分析涉及曼惠特尼U测试和 Kruskal-Wallis 测试，然后进行成对比较。结果LLM 聊天机器人的综合问题案例平均准确度等级为 506.2，青光眼专家为 403.4（n = 831；Mann-Whitney）U= 27976.5；磷< .001），完整性的平均等级分别为 528.3 和 398.7（n = 828；Mann-WhitneyU= 25218.5；磷< .001）。LLM 聊天机器人的平均准确度排名为 235.3，视网膜专家的平均准确度排名为 216.1（n = 440；Mann-WhitneyU= 15518.0；磷= .17)，完整性的平均等级分别为 258.3 和 208.7 (n = 439；Mann-WhitneyU= 13123.5；磷= .005)。Dunn 测试揭示了所有成对比较之间的显着差异，除了专家与实习生在评级聊天机器人完整性方面的差异。总体成对比较表明，受训者和专家对聊天机器人的准确性和完整性的评价都比专家同行更好，专家指出聊天机器人的准确性存在显着差异（z= 3.23；磷= .007) 和完整性 (z= 5.86；磷< .001).结论和相关性本研究强调了法学硕士聊天机器人在各种临床情况下与受过专科培训的眼科医生相比在诊断准确性和完整性方面的相对熟练程度。LLM 聊天机器人在诊断和治疗准确性方面优于青光眼专家，并与视网膜专家相匹配，证实了其作为眼科有前途的诊断辅助手段的作用。

更新日期：2024-02-22

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文

全部期刊列表>>