As many people know, chatbots have a proclivity for lying. They are perhaps one of the worst use cases for AI, trained to produce sentences that sound authoritative but could be presenting completely fabricated information; models are biased towards providing an answer even when they are not confident. Now, researchers at OpenAI say that supervising and disciplining chatbots only makes the problem worse, as the chatbots will put in more effort to hide their behavior.
In a blog post regarding the work, OpenAI researchers describe using its GPT-4o model to “supervise” another of its large language models, disciplining it when it tried to lie. But that did not work, as the model would still lie, “only now its cheating is undetectable by the monitor because it has learned to hide its intent in the chain-of-thought.”
Newer “thinking” models use multi-step reasoning to answer queries. If a user asks for an estimate on how much Americans spend on pet food each year, for instance, these models will break down the query into multiple steps—how many dogs are in America, how much does it cost to feed the average dog, and so on and so forth.
These models also often disclose their logic, or “chain-of-thought,” to the user so they can see how it came to its answer. Funnily enough, they will often cop to making up facts and details. In the “pre-training” step, when companies like OpenAI initially teach a model what types of answers they should produce, the models seem to be learning that it is easier to earn the cookie by taking shortcuts. Think of the runner in a marathon who hops in a car and skips most of the race.
Users on social media have demonstrated Anthropic’s Claude, when asked to conduct a data analysis from various research papers, will admit that it sometimes inserts filler data instead of actually looking through all the papers.
OpenAI’s researchers highlighted an instance in their testing in which the model, asked to write some tests to ensure a piece of code worked, surreptitiously decided to write some bad tests and then intentionally skip over them so the code would ultimately “pass.” With GPT-4o as a supervisor, the model would do this but not disclose it.
almost lost $100 million because @AnthropicAI‘s Claude snuck in ‘generate random data’ as a fallback into my market maker code without telling me pic.twitter.com/J3MLgsl5Lq
— Martin Shkreli (@MartinShkreli) March 5, 2025
AI companies have been trying to resolve the pernicious issue of models lying or “hallucinating,” as it is called in the field, and finally reach AGI, or the point where AI could surpass human ability. But OpenAI’s researchers are essentially saying that after tens of billions of investments, they still do not know how to control the models to behave appropriately. “If strong supervision is directly applied to the chain-of-thought, models can learn to hide their intent while continuing to misbehave,” they added. For now, companies should not implement supervision of models which seems like not exactly a great solution. Ergo, let them keep lying for now or else they will just gaslight you.
tfw claude code spent 739 seconds “manifesting,” failed to make the change you asked for, broke 3 other things that used to work fine, and then charged you $11.14 pic.twitter.com/Ap2JLQ0uI8
— adam 🇺🇸 (@personofswag) March 19, 2025
The research should serve as a reminder to be careful when relying on chatbots, especially when it comes to critical work. They are optimized for producing a confident-looking answer but do not care much about factual accuracy. “As we’ve trained more capable frontier reasoning models, we’ve found that they have become increasingly adept at exploiting flaws in their tasks and misspecifications in their reward functions, resulting in models that can perform complex reward hacks in coding tasks,” the OpenAI researchers concluded.
Several reports have suggested that most enterprises have yet to find value in all the new AI products coming onto the market, with tools like Microsoft Copilot and Apple Intelligence beset with problems, with scathing reviews detailing their poor accuracy and lack of real utility. According to a recent report from Boston Consulting Group, a survey of 1,000 senior executives across 10 major industries found that 74% showed any tangible value from AI. What makes it all the more galling is that these “thinking” models are slow, and quite a bit more expensive than smaller models. Do companies want to pay $5 for a query that will come back with made-up information?
There is always a lot of hype in the tech industry for things then you step out of it and realize most people still are not using it. For now, it is not worth the hassle, and credible sources of information are more important than ever.
+ There are no comments
Add yours