The last two weeks before the deadline were frantic. Though officially some of the team still had desks in Building 1945, they mostly worked in 1965 because it had a better espresso machine in the micro-kitchen. âPeople werenât sleeping,â says Gomez, who, as the intern, lived in a constant debugging frenzy and also produced the visualizations and diagrams for the paper. Itâs common in such projects to do ablationsâtaking things out to see whether what remains is enough to get the job done.
âThere was every possible combination of tricks and modulesâwhich one helps, which doesnât help. Letâs rip it out. Letâs replace it with this,â Gomez says. âWhy is the model behaving in this counterintuitive way? Oh, itâs because we didnât remember to do the masking properly. Does it work yet? OK, move on to the next. All of these components of what we now call the transformer were the output of this extremely high-paced, iterative trial and error.â The ablations, aided by Shazeerâs implementations, produced âsomething minimalistic,â Jones says. âNoam is a wizard.â
Vaswani recalls crashing on an office couch one night while the team was writing the paper. As he stared at the curtains that separated the couch from the rest of the room, he was struck by the pattern on the fabric, which looked to him like synapses and neurons. Gomez was there, and Vaswani told him that what they were working on would transcend machine translation. âUltimately, like with the human brain, you need to unite all these modalitiesâspeech, audio, visionâunder a single architecture,â he says. âI had a strong hunch we were onto something more general.â
In the higher echelons of Google, however, the work was seen as just another interesting AI project. I asked several of the transformers folks whether their bosses ever summoned them for updates on the project. Not so much. But âwe understood that this was potentially quite a big deal,â says Uszkoreit. âAnd it caused us to actually obsess over one of the sentences in the paper toward the end, where we comment on future work.â
That sentence anticipated what might come nextâthe application of transformer models to basically all forms of human expression. âWe are excited about the future of attention-based models,â they wrote. âWe plan to extend the transformer to problems involving input and output modalities other than textâ and to investigate âimages, audio and video.â
A couple of nights before the deadline, Uszkoreit realized they needed a title. Jones noted that the team had landed on a radical rejection of the accepted best practices, most notably LSTMs, for one technique: attention. The Beatles, Jones recalled, had named a song âAll You Need Is Love.â Why not call the paper âAttention Is All You Needâ?
The Beatles?
âIâm British,â says Jones. âIt literally took five seconds of thought. I didnât think they would use it.â
They continued collecting results from their experiments right up until the deadline. âThe English-French numbers came, like, five minutes before we submitted the paper,â says Parmar. âI was sitting in the micro-kitchen in 1965, getting that last number in.â With barely two minutes to spare, they sent off the paper.
+ There are no comments
Add yours