Please don’t trust your chatbot for medical advice

Remember how I used to say that large language models are “frequently wrong, never in doubt”, and how I warned three years ago on 60 Minutes that they were purveyors of “authoritative bullshit” that should not be trusted?

That’s still true – and it very much applies in medicine.

And that matters, a lot. Because a large fraction of the population has begun to turn to chatbots for medical advice.

Two relevant new studies are reported today in the Washington Post, in a damning article.

The first new study, published by BMJ (affiliated with the British Medical Association) in a peer reviewed journal, and entitled “Generative artificial intelligence-driven chatbots and medical misinformation: an accuracy, referencing and readability audit”, studied five popular chatbots (Gemini, DeepSeek, Meta AI, ChatGPT and Grok), about one year ago, prompting each with 10 questions about things ranging from cancer to vaccines and nutrition, in open-ended dialogues, and reporting that nearly half of the responses were highly problematic. Worse, “chatbot outputs were consistently expressed with confidence and certainty”. The responses were also filled with hallucinations and fabricated citations.

All of this – the hallucinations, mistakes, and overconfidence – is entirely typical of LLMs, and entirely problematic in medicine. As the authors put it, in somewhat academic language, but entirely accurately, “continued deployment without public education and oversight risks amplifying misinformation.”

The second new study, published in JAMA Network Open, affiliated with the American Medical Association, called “Large Language Model Performance and Clinical Reasoning Tasks” looked at 21 frontier models across 29 questions, and reported that “despite progress, current LLMs remain limited in early diagnostic reasoning and cannot yet be relied on for unsupervised patient-facing clinical decision-making.”

And the Post article actually only reported part of the new scientific literature on LLMs and medicines. Two other new studies that they missed only add to the concerns.

One, published in Nature Medicine, was called “Reliability of LLMs as medical assistants for the general public: a randomized preregistered study”. This one focused on “whether LLMs can assist members of the public in identifying underlying conditions and choosing a course of action”. Again the results were both clear and troubling. LLMs “identified relevant conditions in fewer than 34.5% of cases… no better than [a] control group”. Here the problem wasn’t so much that the LLMs lacked access to proper information — the same study showed that the models could do better in the hands of trained physicians — but that patients don’t know how to guide the LLMs to the right places.

In a recurring theme, we see that LLMs don’t know what they don’t know; they work decently well with the information they’ve got but don’t know how to conduct clinical interviews, and in the hands of the lay public can easily give bad advice because the proper questions never get asked, either by the patient or the LLMs. (An expert doctor might use the LLM to better effect, by asking the right questions.)

Still another new study, also published recently in Nature Medicine, entitled ChatGPT Health performance in a structured test of triage recommendations, found that “Among gold-standard emergencies, the system undertriaged 52% of cases” and concluded that “These findings reveal missed high-risk emergencies and inconsistent activation of crisis safeguards, raising safety concerns that warrant prospective validation before consumer-scale deployment of artificial intelligence triage systems.”

As a scientist, I am always looking for converging evidence. Four studies in four journals published in the space of a few months reaching essentially the same conclusion is a crystal clear indicator that chatbots, especially when used by amateurs, simply cannot be trusted.

On a personal note, my friend Ben Riley lost his father recently, and Teddy Rosenbluth of The New York Times wrote a long, moving article about how his father was mislead by A.I regarding his leukemia.

I hope you will get a chance to read that, and also Ben’s own blog about the sad situation.

There will always be better models, but for now, and until proven otherwise, we should not take the apparent “confidence” of large language models — itself an illusion of how they are trained — to mean that we should trust large language with our lives.

Leave a Comment