AI Chest X-Ray Model Analysis Reveals Race and Sex Bias

Study highlights risk of using foundation models in medical imaging AI

Ben Glocker PhD

An AI chest X-ray foundation model for disease detection demonstrated racial and sex-related bias leading to uneven performance across patient subgroups and may be unsafe for clinical applications, according to a study published today in Radiology: Artificial Intelligence. The study aims to highlight the potential risks for using foundation models in the development of medical imaging artificial intelligence.

“There’s been a lot of work developing AI models to help doctors detect disease in medical scans,” said lead researcher Ben Glocker, PhD, professor of machine learning for imaging at Imperial College London in the U.K. “However, it can be quite difficult to get enough training data for a specific disease that is representative of all patient groups.”

Due to the difficulty of collecting large volumes of high-quality training data, the AI field has moved toward using deep-learning foundation models that have been trained for other purposes. Foundation models are AI neural networks that have been trained on large, often unlabeled datasets which handle jobs from translating text to analyzing medical images.

“Despite their increasing popularity, we know little about potential biases in foundation models that could affect downstream uses,” Dr. Glocker said.

Dr. Glocker’s research team compared the performance of a recently published chest X-ray foundation model and a reference model built by the team in evaluating 127,118 chest X-rays with associated diagnostic labels. The pre-trained foundation model was built with more than 800,000 chest X-rays from India and the U.S.

Work Starts with the Data, Not the Model

The researchers completed a comprehensive performance analysis to determine how well the models performed for individual subgroups. The 42,884 patients (mean age, 63; 23,623 male) in the study group included Asian, Black and white patients.

Bias analysis showed significant differences between features related to disease detection across biological sex and race.

“Our bias analysis showed that the foundation model consistently underperformed compared to the reference model,” Dr. Glocker said. “We observed a decline in disease classification performance and specific disparities in protected subgroups.”

Glocker RYAI Fig  3 AI Chest X-ray reveals race and sex bias

Comparison of disease detection performance across patient subgroups. Average classification performance across patient subgroups is shown in terms of Youden’s J statistic for the DenseNet-121 CheXpert model and three variants of the CXR foundation model. Classification performance is shown on four different labels of (A) ‘no finding’, (B) ‘pleural effusion’, (C) ‘cardiomegaly’, and (D) ‘pneumothorax’. The CXR foundation models consistently underperformed compared with the CheXpert model, with specific underperformance on the subgroup of female patients for ‘no finding’ and the subgroup of Black patients on ‘pleural effusion’. There was also a drastic decrease in overall performance across all subgroups for the CXR foundation models for ‘cardiomegaly’. CXR = chest radiography, MLP = multilayer perceptrons. ©RSNA 2023

Significant differences were found between male and female and Asian and Black patients in the features related to disease detection. Compared with the average model performance across all subgroups, classification performance on the ‘no finding’ label dropped between 6.8% and 7.8% for female patients, and performance in detecting ‘pleural effusion’ dropped between 10.7% and 11.6% for Black patients.

Listen as Dr. Glocker discusses his research.

“Dataset size alone does not guarantee a better or fairer model,” Dr. Glocker said. “We need to be very careful about data collection to ensure diversity and representativeness.”

He noted that it’s important that foundation models are published and shared.

“To minimize the risk of bias associated with the use of foundation models for clinical decision-making, these models need to be fully accessible and transparent,” he said.

Dr. Glocker is an advocate for comprehensive bias analysis as an integral part of the development and auditing of foundation models.

“AI is often seen as a black box, but that’s not entirely true,” he said. “We can open the box and inspect the features. Model inspection is one way of continuously monitoring and flagging issues that need a second look.”

The work doesn’t start with the AI model, it starts with the data used to build it, Dr. Glocker noted.

“As we collect the next dataset, we need to, from day one, make sure AI is being used in a way that will benefit everyone,” he said.

For More Information

Access the Radiology: Artificial Intelligence study, “Risk of Bias in Chest Radiography Deep Learning Foundation Models.”

Read previous RSNA News stories on medical imaging AI: