Tech
Preserving The Web Is Not The Problem. Losing It Is.
from the libraries-matter dept
Recent reporting by Nieman Lab describes how some major news organizations—including The Guardian, The New York Times, and Reddit—are limiting or blocking access to their content in the Internet Archive’s Wayback Machine. As stated in the article, these organizations are blocking access largely out of concern that generative AI companies are using the Wayback Machine as a backdoor for large-scale scraping.
These concerns are understandable, but unfounded. The Wayback Machine is not intended to be a backdoor for large-scale commercial scraping and, like others on the web today, we expend significant time and effort working to prevent such abuse. Whatever legitimate concerns people may have about generative AI, libraries are not the problem, and blocking access to web archives is not the solution; doing so risks serious harm to the public record.
The Internet Archive, a 501(c)(3) nonprofit public charity and a federal depository library, has been building its archive of the world wide web since 1996. Today, the Wayback Machine provides access to thirty years’ worth of web history and culture. It has become an essential resource for journalists, researchers, courts, and the public.
For three decades the Wayback Machine has peacefully coexisted with the development of the web, including the websites mentioned in the article. Our mission is simple: to preserve knowledge and make it accessible for research, accountability, and historical understanding.
As tech policy writer Mike Masnick recently warned, blocking preservation efforts risks a profound unintended consequence: “significant chunks of our journalistic record and historical cultural context simply… disappear.” He notes that when trusted publications are absent from archives, we risk creating a historical record biased against quality journalism.
There is no question that generative AI has changed the landscape of the world wide web. But it is important to be clear about what the Wayback Machine is, and what it is not.
The Wayback Machine is built for human readers. We use rate limiting, filtering, and monitoring to prevent abusive access, and we watch for and actively respond to new scraping patterns as they emerge.
We acknowledge that systems can always be improved. We are actively working with publishers on technical solutions to strengthen our systems and address legitimate concerns without erasing the historical record.
What concerns me most is the unintended consequence of these blocks. When libraries are blocked from archiving the web, the public loses access to history. Journalists lose tools for accountability. Researchers lose evidence. The web becomes more fragile and more fragmented, and history becomes easier to rewrite.
Generative AI presents real challenges in today’s information ecosystem. But preserving the time-honored role of libraries and archives in society has never been more important. We’ve worked alongside news organizations for decades. Let’s continue working together in service of an open, referenceable, and enduring web.
Mark Graham is the Director of the Wayback Machine at the Internet Archive
Filed Under: ai, archives, journalism, libraries, preserving history, scraping, wayback machine
Companies: internet archive