AI feels to many like all take and no give. But there’s now so much money in AI, and the technological state of the art is changing so fast that many site owners can’t keep up. And the fundamental agreement behind robots.txt, and the web as a whole — which for so long amounted to “everybody just be cool” — may not be able to keep up either.
Koster cautioned against arguing about whether robots are good or bad — because it doesn’t matter, they’re here and not going away.
Amazon’s crawlers traipse the web looking for product information, and according to a recent antitrust suit, the company uses that information to punish sellers who offer better deals away from Amazon.
AI companies like OpenAI are crawling the web in order to train large language models that could once again fundamentally change the way we access and share information.
The ability to download, store, organize, and query the modern internet gives any company or developer something like the world’s accumulated knowledge to work with. In the last year or so, the rise of AI products like ChatGPT, and the large language models underlying them, have made high-quality training data one of the internet’s most valuable commodities. That has caused internet providers of all sorts to reconsider the value of the data on their servers, and rethink who gets access to what. Being too permissive can bleed your website of all its value; being too restrictive can make you invisible. And you have to keep making that choice with new companies, new partners, and new stakes all the time.
There are a few breeds of internet robot. You might build a totally innocent one to crawl around and make sure all your on-page links still lead to other live pages; you might send a much sketchier one around the web harvesting every email address or phone number you can find. But the most common one, and the most currently controversial, is a simple web crawler. Its job is to find, and download, as much of the internet as it possibly can.
In the last year or so, though, the rise of AI has upended that equation. For many publishers and platforms, having their data crawled for training data felt less like trading and more like stealing.
BBC director of nations Rhodri Talfan Davies wrote last fall, announcing that the BBC would also be blocking OpenAI’s crawler. The New York Times blocked GPTBot as well, months before launching a suit against OpenAI alleging that OpenAI’s models “were built by copying and using millions of The Times’s copyrighted news articles, in-depth investigations, opinion pieces, reviews, how-to guides, and more.” A study by Ben Welsh, the news applications editor at Reuters, found that 606 of 1,156 surveyed publishers had blocked GPTBot in their robots.txt file.
There are also crawlers used for both web search and AI. CCBot, which is run by the organization Common Crawl, scours the web for search engine purposes, but its data is also used by OpenAI, Google, and others to train their models. Microsoft’s Bingbot is both a search crawler and an AI crawler. And those are just the crawlers that identify themselves — many others attempt to operate in relative secrecy, making it hard to stop or even find them in a sea of other web traffic. For any sufficiently popular website, finding a sneaky crawler is needle-in-haystack stuff.
In large part, GPTBot has become the main villain of robots.txt because OpenAI allowed it to happen. The company published and promoted a page about how to block GPTBot and built its crawler to loudly identify itself every time it approaches a website. Of course, it did all of this after training the underlying models that have made it so powerful, and only once it became an important part of the tech ecosystem.
But robots.txt is not a legal document — and 30 years after its creation, it still relies on the good will of all parties involved. Disallowing a bot on your robots.txt page is like putting up a “No Girls Allowed” sign on your treehouse — it sends a message, but it’s not going to stand up in court. Any crawler that wants to ignore robots.txt can simply do so, with little fear of repercussions.
As the AI companies continue to multiply, and their crawlers grow more unscrupulous, anyone wanting to sit out or wait out the AI takeover has to take on an endless game of whac-a-mole. They have to stop each robot and crawler individually, if that’s even possible, while also reckoning with the side effects. If AI is in fact the future of search, as Google and others have predicted, blocking AI crawlers could be a short-term win but a long-term disaster.
Even as AI companies face regulatory and legal questions over how they build and train their models, those models continue to improve and new companies seem to start every day.