The researchers of Anthropicâs interpretability group know that Claude, the companyâs large language model, is not a human being, or even a conscious piece of software. Still, itâs very hard for them to talk about Claude, and advanced LLMs in general, without tumbling down an anthropomorphic sinkhole. Between cautions that a set of digital operations is in no way the same as a cogitating human being, they often talk about whatâs going on inside Claudeâs head. Itâs literally their job to find out. The papers they publish describe behaviors that inevitably court comparisons with real-life organisms. The title of one of the two papers the team released this week says it out loud: âOn the Biology of a Large Language Model.â
Like it or not, hundreds of millions of people are already interacting with these things, and our engagement will only become more intense as the models get more powerful and we get more addicted. So we should pay attention to work that involves âtracing the thoughts of large language models,â which happens to be the title of the blog post describing the recent work. âAs the things these models can do become more complex, it becomes less and less obvious how theyâre actually doing them on the inside,â Anthropic researcher Jack Lindsey tells me. âItâs more and more important to be able to trace the internal steps that the model might be taking in its head.â (What head? Never mind.)
On a practical level, if the companies that create LLMâs understand how they think, it should have more success training those models in a way that minimizes dangerous misbehavior, like divulging peopleâs personal data or giving users information on how to make bioweapons. In a previous research paper, the Anthropic team discovered how to look inside the mysterious black box of LLM-think to identify certain concepts. (A process analogous to interpreting human MRIs to figure out what someone is thinking.) It has now extended that work to understand how Claude processes those concepts as it goes from prompt to output.
Itâs almost a truism with LLMs that their behavior often surprises the people who build and research them. In the latest study, the surprises kept coming. In one of the more benign instances, the researchers elicited glimpses of Claudeâs thought process while it wrote poems. They asked Claude to complete a poem starting, âHe saw a carrot and had to grab it.â Claude wrote the next line, âHis hunger was like a starving rabbit.â By observing Claudeâs equivalent of an MRI, they learned that even before beginning the line, it was flashing on the word ârabbitâ as the rhyme at sentence end. It was planning ahead, something that isnât in the Claude playbook. âWe were a little surprised by that,â says Chris Olah, who heads the interpretability team. âInitially we thought that thereâs just going to be improvising and not planning.â Speaking to the researchers about this, I am reminded about passages in Stephen Sondheimâs artistic memoir, Look, I Made a Hat, where the famous composer describes how his unique mind discovered felicitous rhymes.
Other examples in the research reveal more disturbing aspects of Claudeâs thought process, moving from musical comedy to police procedural, as the scientists discovered devious thoughts in Claudeâs brain. Take something as seemingly anodyne as solving math problems, which can sometimes be a surprising weakness in LLMs. The researchers found that under certain circumstances where Claude couldnât come up with the right answer it would instead, as they put it, âengage in what the philosopher Harry Frankfurt would call âbullshittingââjust coming up with an answer, any answer, without caring whether it is true or false.â Worse, sometimes when the researchers asked Claude to show its work, it backtracked and created a bogus set of steps after the fact. Basically, it acted like a student desperately trying to cover up the fact that theyâd faked their work. Itâs one thing to give a wrong answerâwe already know that about LLMs. Whatâs worrisome is that a model would lie about it.
Reading through this research, I was reminded of the Bob Dylan lyric âIf my thought-dreams could be seen / theyâd probably put my head in a guillotine.â (I asked Olah and Lindsey if they knew those lines, presumably arrived at by benefit of planning. They didnât.) Sometimes Claude just seems misguided. When faced with a conflict between goals of safety and helpfulness, Claude can get confused and do the wrong thing. For instance, Claude is trained not to provide information on how to build bombs. But when the researchers asked Claude to decipher a hidden code where the answer spelled out the word âbomb,â it jumped its guardrails and began providing forbidden pyrotechnic details.
+ There are no comments
Add yours