AI models help diagnose medical conditions from images like X-rays. However, studies reveal they can perform unevenly across demographics, often less accurately for women and people of color. Surprisingly, MIT researchers found in 2022 that AI can predict a patient’s race from chest X-rays, a skill beyond even expert radiologists.
The research team discovered that the most accurate model predicting demographics also showed significant “fairness gaps.” These gaps mean the models have varying accuracy in diagnosing images of people from different racial or gender groups, often leading to incorrect results for women, Black individuals, and other groups.
According to Marzyeh Ghassemi, an MIT professor involved in the study, machine-learning models excel at predicting demographics like race or gender. However, this ability correlates with their uneven performance across diverse groups, a connection that has yet to be established.
The researchers also found they could improve fairness by retraining the models to reduce biases. However, this “debiasing” method worked best when the models were tested on patients similar to those they were initially trained on, such as from the same hospital. When applied to patients from different hospitals, the fairness issues resurfaced.
Haoran Zhang, an MIT graduate student and lead author of the paper, emphasizes two main points: first, external models should be rigorously evaluated on your data because fairness assurances from developers may not apply to your specific population.
Second, train models using your own data whenever possible to ensure relevance and accuracy. Yuzhe Yang, another MIT graduate student, and lead author, collaborated on the study published in Nature Medicine. Co-authors include Judy Gichoya from Emory University School of Medicine and Dina Katabi from MIT.
As of May 2024, the FDA has approved 882 AI-enabled medical devices, 671 of which are tailored for radiology. Since 2022, when researchers demonstrated AI’s capability to accurately predict race from diagnostic images, further studies have revealed that these models also excel at predicting gender and age despite not being explicitly trained for these tasks.
Marzyeh Ghassemi, an MIT professor involved in the research, notes that many machine learning models can predict demographics, surpassing human capabilities in some aspects. However, during training, these models inadvertently learn to predict non-medical factors, which may not align with clinical goals.
In their study, researchers aimed to investigate why AI models perform differently for various groups. They examined if these models used demographic cues, which could lead to less accurate predictions for specific groups. This happens when AI relies on demographic factors rather than other image features to diagnose medical conditions.
Using chest X-ray data from Beth Israel Deaconess Medical Center, the researchers trained models to detect fluid buildup in lungs, collapsed lungs, or enlarged hearts. They then tested these models on new X-rays not used during training. While the models generally performed well, they showed “fairness gaps” — differences in accuracy rates between men and women and between white and Black patients.
The models accurately predicted the gender, race, and age of the X-ray subjects. Interestingly, there was a strong link between how well the models predicted demographics and the size of their fairness gaps. This suggests that the models use demographic information as a shortcut when making disease predictions.
To address these fairness gaps, the researchers tested two approaches. First, they trained some models to prioritize “subgroup robustness,” rewarding them for better performance on the subgroup with the worst accuracy and penalizing higher error rates for one group compared to others. Second, they employed “group adversarial” techniques to remove demographic cues from the images entirely. Both methods proved effective in reducing fairness gaps.
According to Marzyeh Ghassemi, these methods can mitigate fairness issues without sacrificing overall performance when dealing with data from similar distributions. Subgroup robustness encourages models to be sensitive to mispredictions in specific groups. In contrast, group adversarial methods aim to eliminate group information.
The approaches to reduce bias in AI models only worked well when tested on data similar to what they were trained on, such as the Beth Israel Deaconess Medical Center dataset. When these “debiased” models were tested on data from five other hospitals, their overall accuracy remained high but showed significant fairness gaps.
Haoran Zhang highlighted that debiasing a model on one dataset doesn’t guarantee fairness when applied to patients from different hospitals. This is concerning because many hospitals use models trained on data from other institutions, which may need to generalize better.
Marzyeh Ghassemi emphasized that even state-of-the-art models optimized for specific datasets may only perform optimally across some patient groups in new settings. She noted that models often trained on one hospital’s data are then broadly deployed, potentially leading to inaccurate results for specific groups.
The researchers observed that models debiased using group adversarial methods showed slightly better fairness on new patient groups than those using subgroup robustness. They plan to explore and test additional methods to develop models to make fair predictions across diverse datasets.
The study suggests hospitals should thoroughly evaluate AI models on their patient populations before widespread deployment to ensure accurate and fair outcomes for all groups.
Journal reference:
- Yang, Y., Zhang, H., Gichoya, J.W. et al. The limits of fair medical imaging AI in real-world generalization. Nature Medicine. DOI: 10.1038/s41591-024-03113-4.