General Discussion
Related: Editorials & Other Articles, Issue Forums, Alliance Forums, Region ForumsA very pro-AI account on both Bluesky and X posted about a "disturbing" Stanford paper on LLMs' failures at reasoning
The account is @godofprompt on both platforms. But they stopped using Bluesky quickly after getting almost no response there when they have 233,000 followers on X.
Their post on this disturbing (for AI fans) new study is only on X. Since it's a social media post I'll quote it in full, but break the X link with spaces. The tweet below was posted at 2:40 AM ET this morning.
https:// x. com/ godofprompt/ status/2020764704130650600
@godofprompt
🚨 Holy shit Stanford just published the most uncomfortable paper on LLM reasoning Ive read in a long time.
This isnt a flashy new model or a leaderboard win. Its a systematic teardown of how and why large language models keep failing at reasoning even when benchmarks say theyre doing great.
The paper does one very smart thing upfront: it introduces a clean taxonomy instead of more anecdotes. The authors split reasoning into non-embodied and embodied.
Non-embodied reasoning is what most benchmarks test and its further divided into informal reasoning (intuition, social judgment, commonsense heuristics) and formal reasoning (logic, math, code, symbolic manipulation).
Embodied reasoning is where models must reason about the physical world, space, causality, and action under real constraints.
Across all three, the same failure patterns keep showing up.
> First are fundamental failures baked into current architectures. Models generate answers that look coherent but collapse under light logical pressure. They shortcut, pattern-match, or hallucinate steps instead of executing a consistent reasoning process.
> Second are application-specific failures. A model that looks strong on math benchmarks can quietly fall apart in scientific reasoning, planning, or multi-step decision making. Performance does not transfer nearly as well as leaderboards imply.
> Third are robustness failures. Tiny changes in wording, ordering, or context can flip an answer entirely. The reasoning wasnt stable to begin with; it just happened to work for that phrasing.
One of the most disturbing findings is how often models produce unfaithful reasoning. They give the correct final answer while providing explanations that are logically wrong, incomplete, or fabricated.
This is worse than being wrong, because it trains users to trust explanations that dont correspond to the actual decision process.
Embodied reasoning is where things really fall apart. LLMs systematically fail at physical commonsense, spatial reasoning, and basic physics because they have no grounded experience.
Even in text-only settings, as soon as a task implicitly depends on real-world dynamics, failures become predictable and repeatable.
The authors dont just criticize. They outline mitigation paths: inference-time scaling, analogical memory, external verification, and evaluations that deliberately inject known failure cases instead of optimizing for leaderboard performance.
But theyre very clear that none of these are silver bullets yet.
The takeaway isnt that LLMs cant reason.
Its more uncomfortable than that.
LLMs reason just enough to sound convincing, but not enough to be reliable.
And unless we start measuring how models fail not just how often they succeed well keep deploying systems that pass benchmarks, fail silently in production, and explain themselves with total confidence while doing the wrong thing.
Thats the real warning shot in this paper.
Paper: Large Language Model Reasoning Failures
Link to the paper: https://arxiv.org/abs/2602.06176
SheltieLover
(78,397 posts)highplainsdem
(60,895 posts)SheltieLover
(78,397 posts)Ty for sharing!
Iris
(16,861 posts)That's not my understanding of how it works at all.
EdmondDantes_
(1,553 posts)Some of the more gung ho AI people do. And it's definitely part of the vernacular around AI. There's lots of similar words used to describe what an AI is doing when an answer is being generated
highplainsdem
(60,895 posts)reasoning. Some people are still really impressed by the AI supposedly showing its reasoning, like a schoolkid showing their work solving a math problem.
But they found out nearly a year ago that the new "reasoning" AI models actually hallucinated more than older AI models that don't show their reasoning. See this thread I posted last April and the article it's about:
OpenAI's new reasoning AI models hallucinate more
https://www.democraticunderground.com/100220267171
https://techcrunch.com/2025/04/18/openais-new-reasoning-ai-models-hallucinate-more/
Third-party testing by Transluce, a nonprofit AI research lab, also found evidence that o3 has a tendency to make up actions it took in the process of arriving at answers. In one example, Transluce observed o3 claiming that it ran code on a 2021 MacBook Pro outside of ChatGPT, then copied the numbers into its answer. While o3 has access to some tools, it cant do that.
This new study took a more thorough look at the reasoning failures.
I thought the stunned and apparently scared reaction from the pro-AI account was worth posting here, especially this:
This is worse than being wrong, because it trains users to trust explanations that dont correspond to the actual decision process.
-snip-
The takeaway isnt that LLMs cant reason.
Its more uncomfortable than that.
LLMs reason just enough to sound convincing, but not enough to be reliable.
And unless we start measuring how models fail not just how often they succeed well keep deploying systems that pass benchmarks, fail silently in production, and explain themselves with total confidence while doing the wrong thing.
As a genAI nonbeliever, my first response to reading that had been to laugh at anyone not already understanding that.
It's been known for years that genAI models make lots of mistakes while still sounding convincing and authoritative. That's why even AI companies peddling this inherently flawed tech admit it's important to check AI answers because they're often wrong.
But people who like to use AI tend to push that warning aside, and those gullible people are even more impressed when an AI "shows its reasoning."
This new paper exposes just how foolish it is for an AI user to do that.
This is worse than being wrong, because it trains users to trust explanations that dont correspond to the actual decision process.
You have to be gullible to let a machine that can't reason "train" you to trust it. But it's been known for years that the more someone uses chatbots, the less likely they are to bother checking the AI results. Plus chatbots are designed to persuade and manipulate - to keep AI users engaged - and there are a lot of gullible AI users out there. Which is why we hear more and more stories about what's often called AI psychosis, where a chatbot gradually pushes a too-trusting user into delusions that can result in breakdowns and even suicide.
AI fans like to believe that isn't likely to happen to them. They also like to believe their favorite AI models really are trustworthy. This new study blows up that assumption of trustworthiness.
But because it blows up those assumptions, there will probably be a lot of AI users who will refuse to read it or believe the conclusions.
Happy Hoosier
(9,453 posts)AI does a great job of finding and collating evidence. What it is terrible at, at least right now, is assessing how reliable individual pieces of evidence are. This can lead to AI coming to obviously erroneous conclusions. Its pretty easy to get AI to conclude the Earth is flat, for example.
cachukis
(3,760 posts)My thinking on the fallacy of AI becoming intelligent is its inability for humility.
It is working to become Spock, but without a Kirk rebuttal.
I'm thinking it is going to be a more advanced computer building on video games for human engagement on the day to day.
It will advance and fail on practical application, but it will get better over time.
I fear the laziness of most brains will succumb to its easy solutions. We are seeing how the entertainment value of social media has taken over credentialed knowledge.
People are now creating their own Chat friends.
I fear the technological advantages for the business goals will have long term systemic disruptions in social systems that will be difficult to escape.
Can we draw guidelines to handle the dichotomy?
Worry the money runs the show.
Response to cachukis (Reply #3)
Whiskeytide This message was self-deleted by its author.
Whiskeytide
(4,647 posts)if Spock really had no humanity or humility or compassion, he likely would have simply killed all of the crew on the Enterprise because they kept creating illogical and dramatic plot lines for the show.
cachukis
(3,760 posts)He professed to be objective, facts were facts.
My point was that Kirk's humanity did drive the resolution to the conflicts and Spock had to accept the logic that led to the results.
I suggest AI is limited to a regurgitation using LLM, rather than postulating with a human experience.
It will process analytically, countless problem solution dichotomies, but if we allow it to replace the human experience, we deserve our end.
rog
(931 posts)... using a LLM to help summarize, organize, simplify, etc sources that the user supplies without asking the model to reason.
highplainsdem
(60,895 posts)LLM results.
And if you check LLM results, as you always should, you aren't saving much time and might be spending more time on that task.
hatrack
(64,544 posts)You spend just as much time as before, except instead of getting your coding done, you spend that time checking for errors in AI coding.
highplainsdem
(60,895 posts)hatrack
(64,544 posts)Once you've developed those muscles (so to speak) they're use or lose.
rog
(931 posts)I only use AI to organize, summarize, consolidate, etc, using only a limited set of data, which I provide. I think that's something computers do very well. I don't depend on computers for 'reasoning' - that's my task.
highplainsdem
(60,895 posts)keyword or date shouldn't require AI at all. I'm not sure what sort of consolidation you'd want the AI to do. But I wouldn't trust AI for a summary, because summarizing does require an ability to reason.
And no matter how limited a data set is, an LLM is capable of hallucinating.
rog
(931 posts)AI did a great job of confirming my own initial reading of a CT scan that was done a couple of weeks ago, and also organizing the information into an easily digested form. With that information I was able to formulate a list of relevant questions regarding the report *before* the consultation I had today. Again, I did not ask for a diagnosis -- it was just an aid to confirm my own reading of very technical information, and I was able to look up a few things I wanted to confirm further. My surgeon was very appreciative that I was prepared and organized, and that I was able to understand and respond to what he was telling me regarding this test.
BTW, the CT scan debunked a couple of concerning findings in a previous ultrasound, so this was a very good consultation all around. Modern technology is pretty amazing.
I would never recommend relying on AI as a primary source, nor would I rely on it for accurate diagnosis, but it remains a valuable tool, in addition to other modern technology. In addition, I'm skeptical and not a little nervous that this huge hospital seems to be relying on AI more and more, and very quickly. It's obviously getting better, but as the article we're discussing points out, it can't be trusted. I would hope that the really good docs don't trust it either, even as it continues to improve.
hunter
(40,490 posts)I really don't understand how anyone attributes "intelligence" to these automated plagiarism machines.
There are some aspects of this paper that bother me. For example, I think it's absurd to talk about such things as "LLM Reasoning Failures" when there's no reasoning going on at all.
Are we all so conditioned by our education that we think answering questions or writing short essays for an exam is some kind of "reasoning?" It's not.
I'll give an example: Sometimes I meet Evangelical Christian physicians who tell me they don't "believe in" evolution. They might even "believe" that the earth is merely thousands of years old and not billions. They've obviously passed Biology exams to become physicians, they've witnessed the troublesome quirks of the human body that can only be explained by evolution, yet they've never applied any of that to their own internal model of reality. There's an empty space where those models ought to exist. ( Or possibly they are lying to themselves, which is the worst sort of lie. )
With AI it's all empty space. The words go in and the words come out without anything in between.
Whenever I write I'm always concerned that I'm letting the language in my head do my thinking for me; that I'm being the meat based equivalent of an LLM. If I'm doing that I don't really have anything to say. I want all my writing to represent my own internal models of reality as shaped by my own experiences.
LLMs don't have any experiences.
highplainsdem
(60,895 posts)sakabatou
(45,952 posts)odins folly
(567 posts)The Dunning Kruger effect. Its a machine. It takes in info, compares that to what it has been fed before and regurgitates an answer. But it doesnt have the ability to reason and speculate a correct answer, and it shows confidence that that is the correct answer.
And its programming doesnt allow it to believe it could be wrong. It doesnt and cant know what it doesnt know.
purr-rat beauty
(1,118 posts)...half the SB commercials looked cheap
Plus they were boooooooring
617Blue
(2,239 posts)Happy Hoosier
(9,453 posts)At least jot yet. They cannot effectively assess the proper weight of evidence. They just try to determine what actual rational actors think.
hunter
(40,490 posts)There is no "they" in AI and it's not "trying" to do anything.
It's as sentient as the filter paper in your chemistry lab or coffee maker and there are still gaping holes and tears in that filter paper that let a lot of nonsense through.
Patching those holes and tears will never make the machines "intelligent."
I think we've still got a long way to go before we create a machine that's actually intelligent. It's one of those technologies that hangs just beyond our grasp like fusion power plants or manned trips to Mars. I think it's going to remain so for a long, long time no matter how many hucksters are trying to sell us futures in it today.
Happy Hoosier
(9,453 posts)First of all, "they" is a just a pronoun that can be used n the plural. When I say "they," I am referring to the AI models. I use the same to refer any group of things. Example: Q: Where are the cans of paint we bought? A: THEY are in the garage.
Secondly, we refer to software "trying" to do stuff all the time. It's a shorthand. Non-conscious actors cannot have any intentions, of course, but software CAN be written to have objectives and to attempt to achieve those objectives. This doesn't imply conscious intention. For example, some software my team has developed, "tries," that is "attempts" to resolve ambiguities in GPS measurements, to achieve a highly accurate fix. It calculates probabilities, and picks a particular fix and then "tries" to evaluate this decision to see if it was correct. Not intention here. It's math. All software is math. But so are our brains. There WILL be a time when the complexity of this process is so complex that something like consciousness and intention will emerge. I think that because I am a methodological naturalist (I don't think souls exist). But we're not there yet.
highplainsdem
(60,895 posts)A new review underscores the breadth of the problem, and shows that close to a trillion dollars hasnt changed that
Gary Marcus
Feb 10, 2026
As you may know, I have been harping on reasoning as a core challenge for deep learning for well over a decade, at least since a December 2012 New Yorker article:
-snip-
By now many, many others have pounded on the same point.
Silicon Valleys response has always been to dismiss us. We got this covered, the Valley CEOs will tell you. Pay no attention to Gary or any of the other academics from Subbarao Kambhampati to Judea Pearl to Ernest Davis to Ken Forbus to Melanie Mitchell to Yann LeCun to Francesca Rossi, and many others (including more recently Ilya Sutskever), who are also skeptical. Ignore the big AAAI survey that said that LLMs wont get us to AGI. Surely that Apple study must be biased too. Give us money. Lots of money. Lots and lots of money. Scale Scale Scale. AGI is coming next year!
Well, AGI still hasnt come (even though they keep issuing the same promises, year after year). LLMs still hallucinate and continue to make boneheaded errors.
-snip-
More at the link.
