Reddit has announced that it is updating its robots exclusion protocol (robots.txt file), which determines whether automated robots are allowed to crawl the site or not.
Traditionally, the robots.txt file was used to permit search engines to scan the site and direct people to the content. However, with the increasing use of artificial intelligence, websites are being scraped and used to train models without referencing the actual source of the content.
In addition to the updated robots.txt file, Reddit will continue to enforce access rate limiting and block unknown bots and crawlers from accessing its platform. The company informed TechCrunch that the rate or blocking of bots and crawlers will be determined if they do not adhere to Reddit’s content policy and do not have an agreement with the platform.
Reddit states that the update should not impact the majority of users or well-intentioned entities, such as researchers and organizations like the Internet Archive. Instead, the update aims to deter artificial intelligence companies from training their large language models using it’s content. Of course, artificial intelligence bots can ignore Reddit’s robots.txt file.
This announcement comes a few days after an investigation conducted by Wired magazine, which found that the emerging AI-powered search company, “Perplexity,” was stealing and scraping content. Wired found that “Perplexity” was disregarding requests not to scrape their website, despite having blocked the company in its robots.txt file. Perplexity’s CEO, Aravind Srinivas, responded to these allegations, stating that the robots.txt file is not a legal framework.
The upcoming changes on Reddit will not affect companies that have agreements with the platform. For example, Reddit has a $60 million deal with Google that allows the search giant to train its AI models using content from the Reddit platform. With these changes, Reddit points to other companies wishing to use the platform’s data for AI training that they will need to pay.
“stated in a blog post: ‘Anyone accessing Reddit content must adhere to our policies, which include those designed to protect Reddit users. We are selective in choosing who we collaborate with and trust to have broad access to Reddit’s content.’
The announcement did not come as a surprise, as Reddit issued a new policy a few weeks ago aimed at guiding how commercial entities and other partners access and use Reddit data.