In Evaluating General-Purpose LLMs for Patient-Facing Use: Dermatology-Centered Systematic Review and Meta-Analysis (medRxiv, 2025), the data tells a fascinating story: large language models (LLMs) are improving in medical reasoning, empathy, and safety - but they’re not perfect, and trust takes time to earn. Which, come to think of it, sounds a lot like the long human history of hoping for miracle healers.
Long before stethoscopes, scalpels, and sterile gloves, our first “doctors” were magicians - or at least, that’s what everyone believed. Prehistoric healers waved bones, mumbled incantations, and applied sometimes questionable herbal pastes. Yet enough patients recovered to keep the legend alive.
Fast forward a few millennia and not much has changed… except the props. The bone rattle has been replaced by a diagnostic app. The “spirit-cleansing smoke” is now an MRI scan. And our new shamans? They’re called AI engineers.
Just like in the old days, we still crave the miracle cure, the instant fix, the all-knowing healer. Our dream is a tireless personal doctor who remembers every ache, every allergy, every bit of medical literature (plus the plot of every episode of Grey’s Anatomy).
When ChatGPT burst into the public spotlight in late 2022, some were fascinated and some were wary. Could a chatbot really diagnose a rash? Suggest a safe treatment? Explain it all in plain language?
Early studies, including those reviewed in the paper, painted a mixed picture. In 2022, the mood was skeptical. By 2023, optimism surged as newer models like GPT-4, Claude, and Gemini started showing measurable gains in accuracy, empathy, and communication. But by 2025, the mood had shifted again - not to cynicism, but toward a more critical view.
The truth is, AI in medicine is a lot like the magic of old: it works impressively well in certain contexts, but not always when or how you expect. LLMs are now better at interpreting images, offering solid medication safety advice, and even admitting when they don’t know - a kind of digital humility our ancestors probably wished their witch doctors had. But they still have limits. Even when an AI aces a medical board exam and offers great second opinions, patients using it alone don’t necessarily make better decisions.
That’s why the paper calls for evaluator-aware, patient-in-the-loop frameworks - ways of measuring not just whether the AI gets the right answer, but whether it helps real people make better choices. Because in healthcare, as in magic, the spell only works if it actually helps the patient in the real world.
REFERENCE
Irene S. Gabashvili Evaluating General-Purpose LLMs for Patient-Facing Use: Dermatology-Centered Systematic Review and Meta-Analysis medRxiv 2025.08.11.25333149; doi: https://doi.org/10.1101/2025.08.11.25333149 Posted August 11, 2025
No comments :
Post a Comment