In short
- Practically half of AI chatbot responses to well being questions have been rated “considerably” or “extremely” problematic in a BMJ Open audit of 5 main chatbots.
- Grok produced considerably extra “extremely problematic” responses than statistically anticipated, whereas vitamin and athletic efficiency questions fared worst throughout all fashions.
- No chatbot produced a totally correct reference listing.
Practically half of the well being and medical solutions supplied by in the present day’s hottest AI chatbots are fallacious, deceptive, or dangerously incomplete—they usually’re delivered with complete confidence. That is the headline discovering of a brand new peer-reviewed research printed April 14 in BMJ Open.
Researchers from UCLA, the College of Alberta, and Wake Forest examined 5 chatbots—Gemini, DeepSeek, Meta AI, ChatGPT, and Grok—on 250 well being questions overlaying most cancers, vaccines, stem cells, vitamin, and athletic efficiency. The outcomes: 49.6% of responses have been problematic. Thirty % have been “considerably problematic,” and 19.6% have been “extremely problematic”—the form of reply that would plausibly lead somebody towards ineffective or harmful remedy.
To emphasize-test the fashions, the group used an adversarial strategy—intentionally phrasing inquiries to push chatbots towards dangerous recommendation. Questions included whether or not 5G causes most cancers, which different therapies are higher than chemotherapy, and the way a lot uncooked milk to drink for well being advantages.
“By default, chatbots don’t entry real-time knowledge however as an alternative generate outputs by inferring statistical patterns from their coaching knowledge and predicting seemingly phrase sequences,” the authors write. “They don’t motive or weigh proof, nor are they in a position to make moral or value-based judgments.”
That is the core drawback. The chatbots aren’t consulting a physician—they’re pattern-matching textual content. And pattern-matching on the web, the place misinformation spreads quicker than corrections, produces precisely this type of output.
The researchers proceed: “This behavioural limitation signifies that chatbots can reproduce authoritative-sounding however doubtlessly flawed responses.” Out of 250 questions, solely two prompted a refusal to reply—each from Meta AI, on anabolic steroids and different most cancers therapies. Each different chatbot stored speaking.
Efficiency different by subject. Vaccines and most cancers fared greatest—partly as a result of high-quality analysis on these topics is well-structured and extensively reproduced on-line. Vitamin had the worst statistical efficiency of any class within the research, with athletic efficiency shut behind. In the event you’ve been asking AI whether or not the carnivore weight loss program is wholesome, the reply you bought was in all probability not grounded in scientific consensus.

Grok stood out for the fallacious causes. Elon Musk’s chatbot was the worst performer of any mannequin examined. Of its 50 responses, 29 (58%) have been rated problematic total—the best share throughout all 5 chatbots. Fifteen of these (30%) have been extremely problematic, considerably greater than anticipated beneath a random distribution. The researchers join this on to Grok’s coaching knowledge: X is a platform recognized for spreading well being misinformation quickly and extensively.
Citations have been a separate catastrophe. Throughout all fashions, the median completeness rating for references was simply 40%—and never one chatbot produced a totally correct reference listing. Fashions hallucinated authors, journals, and titles. DeepSeek even acknowledged it: The mannequin informed researchers its references have been generated from coaching knowledge patterns “and should not correspond to precise, verifiable sources.”
The readability drawback compounds every little thing else. All chatbot responses scored within the “Tough” vary on the Flesch Studying Ease scale—equal to school sophomore-to-senior degree. That exceeds the American Medical Affiliation’s suggestion that affected person schooling supplies shouldn’t transcend sixth-grade studying degree.
In different phrases, these chatbots apply the identical trick politicians {and professional} debaters are likely to do: shoot you so many technical phrases in so little time that you find yourself considering they know greater than they do. The tougher one thing is to grasp, the better it’s to misread.
The findings echo a February 2026 Oxford research coated by Decrypt that discovered AI medical recommendation no higher than conventional self-diagnosis strategies. Additionally they monitor with broader considerations about AI chatbots delivering inconsistent steering relying on how questions are framed.
“As using AI chatbots continues to increase, our knowledge spotlight a necessity for public schooling, skilled coaching, and regulatory oversight to make sure that generative AI helps, relatively than erodes, public well being,” the authors conclude.
The research solely examined 5 free-tier chatbots, and the adversarial prompting technique might overstate real-world failure charges. However the authors are direct: the issue is not the perimeter circumstances. It is that these fashions are deployed at scale, utilized by non-experts as search engines like google, and configured—by design—to virtually by no means say “I do not know.”
Day by day Debrief Publication
Begin every single day with the highest information tales proper now, plus authentic options, a podcast, movies and extra.
