Scientists at the National Institutes of Health discovered that an AI model could accurately handle a medical quiz consisting of clinical photographs and brief text summaries. However, physician graders found that the AI occasionally made mistakes in describing the images and in explaining the reasoning behind its answers. The study was published in npj Digital Medicine.
The study provides insight into the possible applications of AI in medicine. Researchers from Weill Cornell Medicine in New York City and the National Library of Medicine (NLM) of the National Institutes of Health led the investigation.
Integration of AI into health care holds great promise as a tool to help medical professionals diagnose patients faster, allowing them to start treatment sooner. However, as this study shows, AI is not advanced enough yet to replace human experience, which is crucial for accurate diagnosis.
Stephen Sherry, Ph.D., Acting Director, National Library of Medicine
Both the AI model and medical professionals responded to Image Challenge questions posed by the New England Journal of Medicine (NEJM). The online quiz required users to select the correct diagnosis from multiple-choice answers based on actual clinical photos and a brief narrative description of the patient’s symptoms and presentation.
Researchers provided the AI model with 207 image challenge questions, along with written justifications for each response. The justification needed to include a step-by-step explanation of the decision-making process, a summary of relevant medical knowledge, and a description of the image.
Nine medical specialists from various institutions participated in the study. They answered the questions in two formats: “closed-book” (without external resources) and “open-book” (using external resources).
Afterward, the doctors were given the correct answers, along with the AI’s responses and justifications. They were then asked to evaluate the AI model's ability to describe the image, compile relevant medical information, and provide a detailed explanation.
The findings revealed that both the AI model and the doctors performed well in identifying the correct diagnosis. Interestingly, in closed-book scenarios, the AI model identified the correct diagnosis more frequently than the doctors. However, in open-book conditions, doctors outperformed the AI model, especially on more challenging questions.
Despite making correct diagnoses, the AI model often struggled with describing the medical images and explaining its reasoning. For example, in one case involving a patient's arm with two lesions, the AI model failed to recognize that both lesions could be linked to the same diagnosis. The lesions were shown from different angles, giving the appearance of varying colors and shapes, which led the AI to misinterpret their connection.
These findings underscore the importance of conducting further evaluations of multi-modal AI technology before its implementation in clinical settings.
This technology has the potential to help clinicians augment their capabilities with data-driven insights that may lead to improved clinical decision-making. Understanding the risks and limitations of this technology is essential to harnessing its potential in medicine.
Zhiyong Lu, Ph.D., Study Corresponding Author and Senior Investigator, National Library of Medicine
The study utilized the GPT-4V (Generative Pre-trained Transformer 4 with Vision) AI model, a multimodal AI system capable of processing both text and images. Despite its compact size, the study demonstrates the potential of such models to assist doctors in making medical decisions. However, the researchers emphasize the need for further research to evaluate how these models compare to human diagnostic abilities.
Co-authors of the study include the NIH’s National Eye Institute and Clinical Center; the University of Pittsburgh; UT Southwestern Medical Center in Dallas; New York University Grossman School of Medicine; Harvard Medical School and Massachusetts General Hospital; Case Western Reserve University School of Medicine; the University of California San Diego; and the University of Arkansas.
Journal Reference:
Jin, Q., et al. (2024) Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine. Npj Digital Medicine. doi.org/10.1038/s41746-024-01185-7.