The Race to Block OpenAIâs Scraping Bots Is Slowing Down

Itâs too soon to say how the spate of deals between AI companies and publishers will shake out. OpenAI has already scored one clear win, though: Its web crawlers arenât getting blocked by top news outlets at the rate they once were.

The generative AI boom sparked a gold rush for dataâand a subsequent data-protection rush (for most news websites, anyway) in which publishers sought to block AI crawlers and prevent their work from becoming training data without consent. When Apple debuted a new AI agent this summer, for example, a slew of top news outlets swiftly opted out of Appleâs web scraping using the Robots Exclusion Protocol, or robots.txt, the file that allows webmasters to control bots. There are so many new AI bots on the scene that it can feel like playing whack-a-mole to keep up.

OpenAIâs GPTBot has the most name recognition and is also more frequently blocked than competitors like Google AI. The number of high-ranking media websites using robots.txt to âdisallowâ OpenAIâs GPTBot dramatically increased from its August 2023 launch until that fall, then steadily (but more gradually) rose from November 2023 to April 2024, according to an analysis of 1,000 popular news outlets by Ontario-based AI detection startup Originality AI. At its peak, the high was just over a third of the websites; it has now dropped down closer to a quarter. Within a smaller pool of the most prominent news outlets, the block rate is still above 50 percent, but itâs down from heights earlier this year of almost 90 percent.

But last May, after Dotdash Meredith announced a licensing deal with OpenAI, that number dipped significantly. It then dipped again at the end of May when Vox announced its own arrangementâand again once more this August when WIREDâs parent company, CondÃ© Nast, struck a deal. The trend toward increased blocking appears to be over, at least for now.

These dips make obvious sense. When companies enter into partnerships and give permission for their data to be used, theyâre no longer incentivized to barricade it, so it would follow that they would update their robots.txt files to permit crawling; make enough deals and the overall percentage of sites blocking crawlers will almost certainly go down. Some outlets unblocked OpenAIâs crawlers on the very same day that they announced a deal, like The Atlantic. Others took a few days to a few weeks, like Vox, which announced its partnership at the end of May but which unblocked GPTBot on its properties toward the end of June.

Robots.txt is not legally binding, but it has long functioned as the standard that governs web crawler behavior. For most of the internetâs existence, people running webpages expected each other to abide by the file. When a WIRED investigation earlier this summer found that the AI startup Perplexity was likely choosing to ignore robots.txt commands, Amazonâs cloud division launched an investigation into whether Perplexity had violated its rules. Itâs not a good look to ignore robots.txt, which likely explains why so many prominent AI companiesâincluding OpenAIâexplicitly state that they use it to determine what to crawl. Originality AI CEO Jon Gillham believes that this adds extra urgency to OpenAIâs push to make agreements. âItâs clear that OpenAI views being blocked as a threat to their future ambitions,â says Gillham.

Source link

Breaking News

Ratchet & Clank Rift Apart Is Only $30 Right Now On PS5 For Black Friday

Best Black Friday streaming deals 2024: Huge discounts are live for nearly all popular platforms

This tiny phone accessory gives you a thermal vision superpowers, and it’s $70 off right now

The 40+ best Black Friday 2024 deals for robot vacuum: Sales are live now

This Sony Bravia is the best TV you’ve never heard of – and it’s on sale for Black Friday

The Top 10 Restaurant and Food Trends of 2024, According to Food & Wine

This SNK MVS Mini Arcade Bundle Just Hit Its Lowest Price Ever For Black Friday

I replaced my TV with this long-throw projector, and it’s absolutely worth it – especially for $340 off

One of the best earbuds I’ve listened to are not by Bose or Apple (and are $80 off for Black Friday)

Ratchet & Clank Rift Apart Is Only $30 Right Now On PS5 For Black Friday

Best Black Friday streaming deals 2024: Huge discounts are live for nearly all popular platforms

This tiny phone accessory gives you a thermal vision superpowers, and it’s $70 off right now

The 40+ best Black Friday 2024 deals for robot vacuum: Sales are live now

The Race to Block OpenAIâs Scraping Bots Is Slowing Down

More From Author

Ratchet & Clank Rift Apart Is Only $30 Right Now On PS5 For Black Friday

Best Black Friday streaming deals 2024: Huge discounts are live for nearly all popular platforms

This tiny phone accessory gives you a thermal vision superpowers, and it’s $70 off right now

+ There are no comments

Cancel reply

Amid whispers of Ubisoft potentially being bought out, the Assassin’s Creed developer says it reviews “strategic options” regularly

Before Call of Duty: Black Ops 6, Play Sifu and More on Xbox Game Pass

You May Also Like:

Ratchet & Clank Rift Apart Is Only $30 Right Now On PS5 For Black Friday

Best Black Friday streaming deals 2024: Huge discounts are live for nearly all popular platforms

This tiny phone accessory gives you a thermal vision superpowers, and it’s $70 off right now

The 40+ best Black Friday 2024 deals for robot vacuum: Sales are live now

This Sony Bravia is the best TV you’ve never heard of – and it’s on sale for Black Friday

The Top 10 Restaurant and Food Trends of 2024, According to Food & Wine

This SNK MVS Mini Arcade Bundle Just Hit Its Lowest Price Ever For Black Friday

I replaced my TV with this long-throw projector, and it’s absolutely worth it – especially for $340 off

Breaking News

Top Tagged

+ There are no comments

Amid whispers of Ubisoft potentially being bought out, the Assassin’s Creed developer says it reviews “strategic options” regularly

Before Call of Duty: Black Ops 6, Play Sifu and More on Xbox Game Pass