Don't miss a thing from RSNA!

To ensure you receive all future messages, please add our new sender domain, info.rsna.org, to your contacts or safe sender list.

OK

Generative LLMs as Automatic Proofreaders of Radiology Reports

Fine-tuned AI models are enhancing accuracy and safety in radiology reporting


Yifan Peng, PhD
Peng

Specially trained generative large language models (LLMs) may hold the key to an efficient proofreading process for radiology reporting.

This potential is especially relevant given the limitations of current technologies. Though useful, speech recognition programs are prone to error, particularly when transcribing complex radiology terminology or when dealing with background noise.

Even small errors in radiology reports can have serious consequences for patients, and identifying these errors can be a challenge.

“By helping radiologists avoid errors, LLMs can indirectly help protect patients from preventable medical mistakes, ensuring overall safer care,” said Yifan Peng, PhD, associate professor in the Department of Population Health Sciences and Department of Radiology at Weill Cornell Medicine in New York.

In a study published in Radiology, Dr. Peng and colleagues explore the capabilities of LLMs for error detection. “We imagine these models acting like a smart proofreader sitting beside the radiologist, automatically scanning reports for possible errors and highlighting areas that might need a second look,” Dr. Peng said. “This could make reporting faster, more accurate and less stressful, especially in busy hospital settings.”

“Language has a complementary and equally important role compared with visual perceptual inputs, since radiologists need to write reports and communicate complex findings comprehensibly, accurately and efficiently,” wrote the authors of a related Radiology editorial, Cristina Marrocchio, MD, and Nicola Sverzellati, MD, PhD, from the Department of Medicine and Surgery at the University of Parma in Italy.

Fine-Tuning Improves Model Performance

Dr. Peng and his team evaluated the error detection performance of three LLMs. They built a two-part dataset, the first of which included 828 pairs of synthetic chest radiography reports generated using GPT-4; each pair consisted of an error-free report and a report containing errors.

The second part of the dataset sampled 307 reports from the MIMIC chest radiography (MIMIC-CXR) database and used GPT-4 to create 307 corresponding error-containing synthetic reports.

The researchers focused on four types of errors common in radiology reports: negation errors (misinterpreting “no” and “not”), left/right errors, interval change errors (confusing time-related findings), and transcription errors. This two-part dataset was divided into two groups, with approximately 80% of the data used for training the models, and the remaining 20% set aside to test how well the models performed on data they hadn’t seen before.

Models were tested for their ability to accurately identify errors in the test set as well as in some real-world radiology reports.

The researchers evaluated GPT-4-1106-Preview, Llama-3-70B-Instruct, and fine-tuned versions of Llama-3-70B-Instruct and BiomedBERT for further customization and greater accuracy, based on their performance analyzing the test set.

"Fine-tuning is the next step after a model learns general language patterns,” Dr. Peng said. “During fine-tuning, the model undergoes additional training using smaller, targeted datasets relevant to particular tasks."

The fine-tuned Llama-3-70B-Instruct model performed best, achieving the highest F1 scores across all error types. An F1 score is a measure of a model’s accuracy. It combines how well it finds errors and how correct it is when it identifies an error. The overall Llama-3-70B macro F1 score was 0.780, showing the model is generally strong.

The original “vanilla” Llama-3-70B-Instruct, however, had an F1 score of 0.538, suggesting that fine-tuning on relevant data significantly improves the model’s performance in a specialized task. GPT-4 and BiomedBERT each achieved F1 scores of 0.683 and 0.657, respectively.

ChatGPT

Quality Over Quantity for AI Accuracy

Like fine-tuning, prompt engineering is also increasingly recognized as a valuable means of optimizing LLMs’ ability to perform specific tasks.

Prompt engineering involves the user providing a model with task-specific examples along with their query.

“This can help the model learn how the task is to be performed,” Dr. Peng said. “For example, prompt engineering can demonstrate to the model a user’s preferred response format.”

Dr. Peng and colleagues tested three prompting strategies on the fine-tuned Llama-3-70B-Instruct model, including:

  • Zero-shot prompting, in which the model was provided with no examples with its prompt.
  • One-shot prompting, in which the model was provided with one example of a radiology report with a labeled error.
  • And, four-shot prompting, in which the model was provided four examples of reports, each with a different type of error.

Surprisingly, one-shot and four-shot prompting did not improve the model’s overall F1 score compared to zero-shot prompting.

“We expected that giving the model example reports would always improve its accuracy, but in some cases the zero-shot setup worked just as well or even better,” Dr. Peng said. “We found that when we did use examples, the quality and relevance of those examples mattered more than the quantity, showing that thoughtful design of prompts makes a big difference in LLM performance.”

LLMs may have many potential roles in all steps of radiology, Drs. Marrocchio and Sverzellati noted. However, a thorough evaluation of their performance and generalizability is essential.

LLM Performance Generalizes to Real-World Data

To further validate the models’ accuracy, the researchers tested the performance of GPT-4-1106-Preview and the fine-tuned Llama-3-70B-Instruct in detecting errors in real-world radiology reports.

From a dataset of 120,025 thoracoabdominal chest radiography reports from patients at New York-Presbyterian/Weill Cornell Medical Center, 55,339 reports were randomly selected and de-identified for analysis by the two models.

“In this assessment, the models independently ‘voted’ on whether a report contained errors, increasing the likelihood of accurate error detection,” Dr. Peng said. “This voting mechanism resulted in 606 reports flagged by both models as containing a specific type of error.”

Two board-certified radiologists reviewed 200 of these 606 flagged reports (50 of each error type). Ninety-nine reports were unanimously found to contain the errors identified by both models, displaying an accuracy rate of 0.495. However, there were 163 cases in which at least one of the two radiologists agreed with the models, indicating an error detection accuracy rate of 0.815.

“AI doesn’t need to replace doctors to make a big impact. When designed carefully and fine-tuned for a specific purpose, it can enhance accuracy, reduce errors and free up radiologists’ time for more complex tasks.”

— YIFAN PENG, PHD

Fig 8 Peng Bar graphs show radiologist-confirmed accuracy of error detection across different categories in radiology reports.

Bar graphs show radiologist-confirmed accuracy of error detection across different categories in radiology reports. (A) Errors confirmed by both radiologists. (B) Errors confirmed by at least one radiologist.

https://doi.org/10.1148/radiol.242575 ©️© RSNA 2025

AI Proofreaders: Powerful But Limited

Overall, the study demonstrates that LLMs can be trained to detect errors relevant for medical proofreading, but that their accuracy is highly dependent on the base model used, fine-tuning, and strategic prompt design.

“This study is an early but important step toward integrating trustworthy, locally controlled AI tools into clinical workflows,” Dr. Peng said. “Over time, the goal would be to check for consistency across other documents too—like lab results or prior studies—and to help standardize how findings are phrased. Eventually, these tools could even help make reports clearer and more patient friendly.”

While the preliminary outcomes of this study are promising, the authors acknowledge some limitations of their work.

“First, our models were focused specifically on chest X-ray reports, so we can’t assume they’ll perform the same way on other types of medical reports,” Dr. Peng said. “Also, the fine-tuning process may cause overfitting, meaning the model learns the training data too closely and might not generalize as well to new data. Lastly, implementing these models in real hospitals can be limited by computing requirements, cost and data privacy concerns.”

Fine-tuning demands substantial computing resources as well as large amounts of high-quality, annotated data. The researchers therefore opted to generate synthetic datasets, thus ensuring they had sufficient training data while also protecting patient privacy.

While the generation of synthetic reports helped address these issues, it introduced its own caveats, such as the potential for hallucinations, amplified human-generated bias, and poor generalizability, all of which would hamper a synthetically trained model’s performance on real-world data.

“It was a challenge to make sure that the synthetic (AI-generated) reports used for training were realistic,” Dr. Peng said. “To handle this, radiologists carefully reviewed some of these reports to confirm that the errors were valid and meaningful.”

Drs. Marrocchio and Sverzellati caution that before implementation in clinical practice, it is essential to give a thorough evaluation among LLMs in giving consistent answers, under varying conditions and greater reliability.

While cognizant of these enduring challenges, Dr. Peng is optimistic about the future of AI-based tools in medicine.

“AI doesn’t need to replace doctors to make a big impact. When designed carefully and fine-tuned for a specific purpose, it can enhance accuracy, reduce errors and free up radiologists’ time for more complex tasks,” Dr. Peng concluded. “We hope our research encourages others to explore how AI can be used responsibly and transparently in health care.”

For More Information

Access the Radiology article, “Generative Large Language Models Trained for Detecting Errors in Radiology Reports,” and the related editorial, “Will Generative Large Language Models Become Radiologists’ Invaluable Allies?

Read previous RSNA News articles on the use of LLMs in medical imaging: