The problem of hallucinations — artificial intelligence (AI) models that assert falsehoods under a veneer of being authoritative — has led some scholars to conclude that generative AI simply cannot detect nor correct its errors.
In a paper last October, researchers at Google’s DeepMind argued that “LLMs are not yet capable of self-correcting their reasoning.”
Also: If AI is so amazing, why does ChatGPT meltdown over this simple image edit task?
However, ChatGPT creator OpenAI disagrees with this assertion — and last week the firm offered a version of GPT-4, called CriticGPT, that it claims can help find and correct mistakes to improve the overall accuracy of the model.
The results are encouraging for human teams who clean up code assisted by AI. However, the results also suggest there’s no getting around hallucinations from the bots doing the helping.
Also: Generative AI can’t find its own errors. Do we need better prompts?
The setting for CriticGPT is programming code writing: the researchers propose CriticGPT as a second neural net that caches the occasions when ChatGPT makes mistakes in the code it generates.
They focus on code writing because, as they put it, computer code is “crisp” — it has clear right and wrong answers. Also, OpenAI as an organization hopes to use generative AI as “an alignment research assistant”, to automate some of the establishment of guardrails for the emerging technology. Code-writing is already a big user of generative AI, so it’s a valuable target to go after.
In the paper posted on the arXiv pre-print server, “LLM Critics Help Catch LLM Bugs,” lead author Nat McAleese of OpenAI and colleagues describe what they call, “the first demonstration of a simple scalable oversight method that helps humans more comprehensively spot problems in real-world RLHF data.”
RLHF (reinforcement learning from human feedback) refers to a well-known practice of subjecting chatbots to responses from humans to make their output more acceptable. It’s one of the ways OpenAI and others have established guardrails to try and prevent unwanted behavior.
In this case, CriticGPT is subjected to the feedback of human contract programmers who review CriticGPT’s generated critiques of programming code. The humans rate the generated critics for their relevance, specificity, comprehensiveness, and more. CriticGPT is trained to refine critiques based on human feedback to approach a higher approval score.
Also: Is AI lying to us? These researchers built an LLM lie detector of sorts to find out
However, McAleese and team took an extra step. They stuck in some deliberate bugs in the code CriticGPT reviews by having some human contractors deliberately insert mistakes. The researchers wanted the contractors to explain their bugs and for CriticGPT to absorb those explanations and learn to associate bugs with explanations.
The hope was that CriticGPT would improve as it produces descriptions of bugs that approach what the human contractors have written about already-known bugs.
The result of the training, write McAleese and team, is that ChatGPT finds more bugs than human code reviewers. CriticGPT “greatly improves the rate at which inserted bugs are caught, with both LLM critics (prompted ChatGPT and CriticGPT) catching many more bugs than the human annotators,” they write.
They note even the human contractors prefer what the machine generates in code analysis versus what their fellow humans write.
“Critiques written by CriticGPT are substantially preferred by contractors over critiques from prompted ChatGPT and over human-written critiques sourced from our group of contractors according to the overall rating.”
The AI model helps human contractors to make their bug critiques richer, a kind of AI-augments-humans result that should please everyone: “Human+CriticGPT teams write substantially more comprehensive critiques than humans alone and that CriticGPT improves comprehensiveness over ChatGPT on both human detected and inserted bugs.”
As the authors write in a companion blog post, “CriticGPT’s suggestions are not always correct, but we find that they can help trainers to catch many more problems with model-written answers than they would without AI help.”
Also: Can AI code? In baby steps only
But there is a catch. Just as ChatGPT and various AI models can “hallucinate” incorrect statements, it turns out that CriticGPT can also claim to identify bugs that aren’t there.
“We do find, however, that the rate of nitpicks and hallucinated bugs is much higher for models than for humans, though CriticGPT is able to substantially reduce this rate over ChatGPT,” they write.
That’s a dilemma: the better the AI model is at catching bugs, the more it seems to hallucinate bugs: “Unfortunately, it is not obvious what the right tradeoff between hallucinations and bug detection is for an overall RLHF system that uses critiques to enhance model performance.”
And it’s not easy to find the middle ground, they note, because, “An ideal experiment would run entirely separate critique-enhanced RLHF data collection loops for each precision/recall point; but this is prohibitively expensive.”
In the breach, McAleese and team hit upon a compromise. Force Sampling Beam Search tries to lift the most valuable of CriticGPT’s critiques while minimizing the number of spurious critiques.
Among the potential pitfalls of OpenAI’s approach is that the training of Critic GPT is built upon humans inserting deliberate bugs. That approach, write McAleese and team, differs from the distribution of natural LLM errors.
“Training models to insert subtle in-distribution problems (as opposed to paying humans to insert bugs) may be able to mitigate this concern, but we leave such directions to future work.”
Also: From AI trainers to ethicists: AI may obsolete some jobs but generate new ones
Hence, the problem will always revolve around how to bootstrap the automation without having some human help.
Another issue — and one not mentioned by the authors — is that, as with all things OpenAI, neither the new CriticGPT model nor its training data are publicly available: it’s all closed, there’s no source code for examination, no data sets that others can download. That closure means there is little to no way for outside ethics or security experts to vet the corrections made by the CriticGPT model.
With no oversight from any party outside OpenAI, the saying goes, who will watch the watchers?
+ There are no comments
Add yours