Potential of GPT-4 for Detecting Errors in Radiology Reports: Implications for Reporting Accuracy,Radiology

当前位置： X-MOL 学术 › Radiology › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Potential of GPT-4 for Detecting Errors in Radiology Reports: Implications for Reporting Accuracy
Radiology ( IF 19.7 ) Pub Date : 2024-04-16 , DOI: 10.1148/radiol.232714
Roman Johannes Gertz ₁ , Thomas Dratsch ₁ , Alexander Christian Bunck ₁ , Simon Lennartz ₁ , Andra-Iza Iuga ₁ , Martin Gunnar Hellmich ₁ , Thorsten Persigehl ₁ , Lenhard Pennig ₁ , Carsten Herbert Gietzen ₁ , Philipp Fervers ₁ , David Maintz ₁ , Robert Hahnfeldt ₁ , Jonathan Kottlors ₁

Affiliation

Background

Errors in radiology reports may occur because of resident-to-attending discrepancies, speech recognition inaccuracies, and large workload. Large language models, such as GPT-4 (ChatGPT; OpenAI), may assist in generating reports.

Purpose

To assess effectiveness of GPT-4 in identifying common errors in radiology reports, focusing on performance, time, and cost-efficiency.

Materials and Methods

In this retrospective study, 200 radiology reports (radiography and cross-sectional imaging [CT and MRI]) were compiled between June 2023 and December 2023 at one institution. There were 150 errors from five common error categories (omission, insertion, spelling, side confusion, and other) intentionally inserted into 100 of the reports and used as the reference standard. Six radiologists (two senior radiologists, two attending physicians, and two residents) and GPT-4 were tasked with detecting these errors. Overall error detection performance, error detection in the five error categories, and reading time were assessed using Wald χ² tests and paired-sample t tests.

Results

GPT-4 (detection rate, 82.7%;124 of 150; 95% CI: 75.8, 87.9) matched the average detection performance of radiologists independent of their experience (senior radiologists, 89.3% [134 of 150; 95% CI: 83.4, 93.3]; attending physicians, 80.0% [120 of 150; 95% CI: 72.9, 85.6]; residents, 80.0% [120 of 150; 95% CI: 72.9, 85.6]; P value range, .522–.99). One senior radiologist outperformed GPT-4 (detection rate, 94.7%; 142 of 150; 95% CI: 89.8, 97.3; P = .006). GPT-4 required less processing time per radiology report than the fastest human reader in the study (mean reading time, 3.5 seconds ± 0.5 [SD] vs 25.1 seconds ± 20.1, respectively; P < .001; Cohen d = −1.08). The use of GPT-4 resulted in lower mean correction cost per report than the most cost-efficient radiologist ($0.03 ± 0.01 vs $0.42 ± 0.41; P < .001; Cohen d = −1.12).

Conclusion

The radiology report error detection rate of GPT-4 was comparable with that of radiologists, potentially reducing work hours and cost.

See also the editorial by Forman in this issue.

中文翻译：

GPT-4 在检测放射学报告中的错误方面的潜力：对报告准确性的影响

背景

由于住院医师与就诊者之间的差异、语音识别不准确以及工作量大，放射学报告可能会出现错误。大型语言模型，例如 GPT-4（ChatGPT；OpenAI），可能有助于生成报告。

目的

评估 GPT-4 在识别放射学报告中常见错误方面的有效性，重点关注性能、时间和成本效率。

材料和方法

在这项回顾性研究中，一家机构于 2023 年 6 月至 2023 年 12 月期间编制了 200 份放射学报告（放射线照相和横截面成像 [CT 和 MRI]）。 5 个常见错误类别（遗漏、插入、拼写、侧面混淆等）中的 150 个错误被有意插入到 100 份报告中并用作参考标准。六名放射科医生（两名高级放射科医生、两名主治医生和两名住院医生）和 GPT-4 的任务是检测这些错误。使用 Wald χ ²检验和配对样本t检验评估总体错误检测性能、五个错误类别中的错误检测以及阅读时间。

结果

GPT-4（检出率，82.7%；150 人中的 124 人；95% CI：75.8, 87.9）与放射科医生的平均检测表现相匹配，无论其经验如何（高级放射科医生，89.3% [150 人中的 134 人；95% CI：83.4， 93.3]；主治医生，80.0% [150 中的 120； 95 % CI：72.9，85.6]；住院医生，80.0% [150 中的 120；95% CI：72.9，85.6]；。一名高级放射科医生的表现优于 GPT-4（检出率，94.7%；150 人中的 142 人；95% CI：89.8, 97.3；P = .006）。与研究中最快的人类阅读器相比，GPT-4 每份放射学报告所需的处理时间更少（平均阅读时间分别为 3.5 秒 ± 0.5 [SD] 与 25.1 秒 ± 20.1；P < .001；Cohen d = −1.08）。使用 GPT-4 导致每份报告的平均校正成本低于最具成本效益的放射科医生（0.03 ± 0.01 美元 vs 0.42 ± 0.41 美元；P < .001；Cohen d = -1.12）。