OpenAI, Google, Meta and Anthropic all rely deeply on content from premium publishers to train the large language models, or LLMs, at the heart of their AI efforts, even as these companies have regularly underplayed their use of such copyrighted content, according to new research released this week from online publishing giant Ziff Davis.
Ziff Davis owns CNET, as well a host of other brands, including IGN, PCMag, Mashable and Everyday Health.
A paper detailing the research and authored by Ziff Davis’ George Wukoson, lead attorney on AI, and Chief Technology Officer Joey Fortuna, reports that AI companies intentionally filtered out low-quality content in favor of high-quality, human-made content to train their models. Given that AI companies want their models to perform well, it makes sense they’d favor quality content in their training data. AI companies used websites’ domain authority, or essentially their ranking in Google search, to make those distinctions. Generally, sources that filter higher on Google tend to be of higher quality and trustworthiness.
The companies behind popular AI chatbots like ChatGPT and Gemini have been secretive about where they’re sourcing the information that powers the answers the bots are giving you. That’s not helpful for consumers, who don’t get visibility into the sources, their reliability, and whether the training data might be biased or perpetuate harmful stereotypes.
But it’s also a point of significant dispute with publishers, who say AI companies are basically pirating the copyrighted work they own, without permission or compensation. Though OpenAI has licensed content from some publishers as it transforms from a nonprofit into a for-profit company, other media companies are suing the maker of ChatGPT for copyright infringement.
“Major LLM developers no longer disclose their training data as they once did. They are now more commercial and less transparent,” Wukoson and Fortuna wrote.
OpenAI, Google, Meta and Anthropic didn’t immediately respond to requests for comment.
Publishers including The New York Times have sued Microsoft and OpenAI for copyright infringement, while Wall Street Journal and New York Post publisher Dow Jones is suing Perplexity, another generative AI startup, on similar grounds.
Big Tech has seen tremendous valuations amid the AI revolution. Google is currently valued at about $2.2 trillion, and Meta is valued at about $1.5 trillion, in part because of their work with generative AI. Investors currently value startups OpenAI and Anthropic at $157 billion and $40 billion, respectively. News publishers, meanwhile, are struggling and have been forced into waves of layoffs over the past few years. News publishers are struggling in a highly competitive online media environment, trying to navigate through the noise of online search, AI-generated “slop” and social media to find audiences.
Meta CEO Mark Zuckerberg said creators and publishers “overestimate the value of their specific content,” in an interview with The Verge earlier this year.
Meanwhile, some AI companies have inked licensing deals with publishers to feed their LLMs with up-to-date news articles. OpenAI signed a deal with the Financial Times, DotDash Meredith, Vox and others earlier this year. Meta and Microsoft have also cut deals with publishers. Ziff Davis hasn’t signed a similar deal.
Based on an analysis of disclosures made by AI companies for their older models, Wukoson and Fortuna found that URLs from top-end publishers such as Axel Springer (Business Insider, Politico), Future PLC (TechRadar, Tom’s Guide), Hearst (San Francisco Chronicle, Men’s Health), News Corp (The Wall Street Journal), The New York Times Company, The Washington Post and others, made up 12.04% of training data, at least for the OpenWebText2 dataset. OpenWebText2 was used to train GPT-3, which is the underlying technology for ChatGPT, though the latest version of ChatGPT isn’t directly built on top of GPT-3 and is its own thing.
Neither OpenAI, Google, Anthropic nor Meta have disclosed training data used to train their most recent models.
Each of several trends discussed in the research paper “reflects decisions made by LLM companies to prioritize high-quality web text datasets in training LLMs, resulting in revolutionary technological breakthroughs driving enormous value for those companies,” Wukoson and Fortuna wrote.
+ There are no comments
Add yours