Reddit will block the Internet Archive

cpvr

Well-Known Member
Full GL Member
Credits
19,142
Reddit says that it has caught AI companies scraping its data from the Internet Archive’s Wayback Machine, so it’s going to start blocking the Internet Archive from indexing the vast majority of Reddit. The Wayback Machine will no longer be able to crawl post detail pages, comments, or profiles; instead, it will only be able to index the Reddit.com homepage, which effectively means Internet Archive will only be able to archive insights into which news headlines and posts were most popular on a given day.
”Internet Archive provides a service to the open web, but we’ve been made aware of instances where AI companies violate platform policies, including ours, and scrape data from the Wayback Machine,” spokesperson Tim Rathschmidt tells The Verge.
The Internet Archive’s mission is to keep a digital archive of websites on the internet and “other cultural artifacts,” and the Wayback Machine is a tool you can use to look at pages as they appeared on certain dates, but Reddit believes not all of its content should be archived that way.“Until they’re able to defend their site and comply with platform policies (e.g., respecting user privacy, re: deleting removed content) we’re limiting some of their access to Reddit data to protect redditors,” Rathschmidt says.
The limits will start “ramping up” today, and Reddit says it reached out to the Internet Archive “in advance” to “inform them of the limits before they go into effect,” according to Rathschmidt. He says Reddit has also “raised concerns” about the ability of people to scrape content from the Internet Archive in the past.
Reddit has a recent history of cutting off access to scraper tools as AI companies have begun to use (and abuse) them en masse, but it’s willing to provide that data if companies pay. Last year, Reddit struck a deal with Google for both Google Search and AI training data early last year, and a few months later, it started blocking major search engines from crawling its data unless they pay. It also said its infamous API changes from 2023, which forced some third-party apps to shut down, leading to protests, were because those APIs were abused to train AI models.
Reddit also struck an AI deal with OpenAI, but it sued Anthropic in June, claiming Anthropic was still scraping from Reddit even after Anthropic said it wasn’t scraping anymore.
The Internet Archive didn’t immediately respond to a request for comment.

Source: Reddit will block the Internet Archive
 
As I said over on Administrata, this isn't a good thing. I'm hoping people will stop using Reddit to host their communities, because once their subreddits are gone all the content will be too since the Internet Archive won't be able to back up the posts. I doubt people will, but I bet the people on subreddits that preserve websites and other media isn't happy about this.
 
Welcome to the chaos ages everyone.
 
With forums (or any website that you host, really), you can use the almighty htaccess to define rules to tell the AI bots to "TAKE A HIKE!" and deny access to them! That's what I use to prevent AI companies from inflating view counters, and overwhelming resources! Look around for htaccess scripts (can you call them that?) that will ban "bad bots"! It's much more effective than the robots.txt since the bad bots (and even some of the good ones if you want them blocked) never honor them! A good weapon (though I'm not sure) is to force a HTTP 410 Gone (a much more powerful variant of the classic 404 Not Found response, or so I read) response to the bad bots, since that may actually cause them to delete whatever they scraped, since a 410 is informing a bot that the webpage is actually gone, and to never scrape it, and may delete it! Though again, I'm not sure if what I said is fact or not! Do your own research! By the way, I like to get creative and direct those naughty bots to a lovely HTTP 418 I'm a teapot rather than a 403 Forbidden! Keeps them out, but they get teatime on the plus side!

But again, look up the stuff yourself because I'm not sure of the validity of what I posted! Maybe some of you can even correct me! Enlighten me, will 'ya?
 
Maybe I’m biased from what I’ve seen regarding drama over specific subreddits, but it’s hard not to think this is for nefarious reasons. I’ve seen times where they did shady shit, deleted it, and got pissed someone had it archived.
 
Back
Top