Major News Outlets Block the Wayback Machine, Drawing Pushback From Journalists Who Rely on the Archival Tool
Earlier this month, USA Today published a standout investigative report exposing how U.S. Immigration and Customs Enforcement (ICE) delayed releasing critical data about the harmful impacts of its detention policies. Reporters behind the story used the Internet Archive’s Wayback Machine to aggregate and analyze ICE’s own public detention records, tracing how the agency’s policies shifted under the Trump administration. The investigation is just one of thousands of examples of how the Wayback Machine — which crawls and permanently stores snapshots of public web pages — preserves information for the public good. Even so, Wayback Machine director Mark Graham calls the situation “a little ironic.”
USA Today Co., the media conglomerate previously known as Gannett that owns the USA Today national masthead and more than 200 additional news outlets, blocks the Wayback Machine from archiving any of its work. “They’re able to pull together their story research because the Wayback Machine exists. At the same time, they’re blocking access,” Graham says.
This is not an isolated case. Earlier this year, Nieman Lab reported that a growing number of major journalism organizations have moved to restrict the Wayback Machine from archiving their reporting. Analysis from artificial intelligence detection startup Originality AI found 23 leading news sites currently block ia_archiverbot, the crawler the Internet Archive uses for the Wayback project. Social platform Reddit also blocks the crawler. Other outlets use less direct restrictions: The Guardian does not block the crawler itself, but it excludes its content from the Internet Archive’s public API and removes its articles from the Wayback Machine’s user interface, making archived versions of its reporting far harder for ordinary users to access.
Outlets defend their policies with overlapping justifications. USA Today Co. spokesperson Lark-Marie Anton emphasizes “this effort is not about specifically blocking the Internet Archive,” and frames the policy as part of a broader company push to block all unapproved scraping bots. Robert Hahn, The Guardian’s director of business affairs and licensing, says the outlet has been in talks with the Internet Archive over “concerns over potential misuse by AI companies of content sets crawled for preservation purposes.”
That AI-focused concern has become the most common justification for blocks. Other publishers argue they are acting to stop AI firms from using the Internet Archive’s massive trove of content to train generative AI models without permission, in violation of copyright. New York Times spokesperson Graham James says “the issue is that Times content on the Internet Archive is being used by AI companies in violation of copyright law to directly compete with us.” (The Times declined to clarify whether this unauthorized use has already occurred, or if it is a hypothetical future risk.) Reddit has also cited AI-related concerns as the reason for its block.
Publishers and AI developers are currently locked in a widespread legal battle over the legality of AI training on copyrighted content without explicit permission; more than 100 separate copyright lawsuits over the issue are pending in U.S. courts. Because the Wayback Machine hosts one of the largest organized collections of public web content in existence, it is seen as a particularly attractive data source for AI firms.
Now, working journalists are pushing back against this trend. This week, advocacy groups including the Electronic Frontier Foundation and Fight for the Future rallied journalists around the Wayback Machine’s mission. The coalition gathered more than 100 signatures from active journalists who recognize the tool’s irreplaceable value, and presented an open letter of support to the Internet Archive. Signatories range from high-profile broadcast journalist Rachel Maddow to independent reporters including Kat Tenbarge of Spitfire News and Taylor Lorenz of User Mag.
“In previous generations, journalists would turn to the physical archives of a local newspaper or of a local public library to access historical reporting and follow the threads of the present back into history,” the letter reads. “With many newspapers closed, and no clear path for local public libraries to preserve digital-only reporting, the work of safeguarding journalism’s record increasingly falls to the Internet Archive.”
Laura Flynn, a signatory and supervising podcast producer at The Intercept, says the Internet Archive has been an “essential tool” throughout her career, playing a key role in fact-checking and tracking down hard-to-find old audio clips. Micco Caporale, a writer at the Chicago Reader and another signatory, says the Wayback Machine is critical for his work writing about legacy bands and cultural figures, giving him access to old fan sites and early coverage that would otherwise be lost forever.
Caporale adds the tool has also been transformative for his work as a union organizer. “I’ve also been using the Wayback Machine a ton in my union organizing work to find old job listings so we know what the company claimed to hire people for vs. what duties they actually assigned or to see how different positions have been retooled at different points,” Caporale says. “These posts also help us keep track of pay fluctuations across the organization over time.”
The Internet Archive, now 30 years old, has archived more than a trillion public web pages. The non-profit has already navigated several major legal battles since 2020; most recently, it settled a lawsuit with a group of major music publishers that sought up to $700 million in damages over the Archive’s Great 78s project, which preserves vintage recorded music. While no major financial penalty is currently at stake from the wave of crawler blocks, the growing trend still poses a serious threat to the Archive’s core mission.
No other widely available public tool matches the scope and accessibility of the Wayback Machine. If it continues to be locked out of major news sources, its preservation work could erode to the point that early digital historical records become far harder to access, or are lost entirely. This risk is not abstract: the Wayback Machine has already been used to hold the outlets that now block it accountable. In 2016, The New York Times faced public scrutiny for unacknowledged editorial changes it made to an article about then-U.S. senator and presidential candidate Bernie Sanders. Those revisions were first tracked and documented using the Wayback Machine. If a similar controversy arose today, media accountability reporters would likely struggle to access older versions of Times articles the same way.
A weakened, restricted Wayback Machine is not just a setback for accountability journalism — it also harms the U.S. legal system, as archived pages from the tool are regularly cited as evidence in court cases across the country.
Graham says he has not given up hope that publishers currently blocking the crawler will eventually reverse their decisions. He notes the non-profit is “in conversation” with the Times and other outlets that have implemented blocks. But for now, Graham says, “there’s no question that the general locking-down of more and more of the public web is impacting society’s ability to understand what's going on in our world.”
Updated: 4/14/26, 12:25 pm EST: This story has been updated to include a citation from Nieman Lab.