Data scraping, AI and the battle for the open web

October 2025 | SPOTLIGHT | RISK MANAGEMENT

Financier Worldwide Magazine

October 2025 Issue

On 13 July 2025, The Economist declared, “AI is killing the internet”, echoing a growing refrain from some online retailers and digital news platforms. They argue that artificial intelligence’s (AI’s) ability to summarise internet search results harms their business by delivering answers without requiring users to click a link or visit their websites.

For users, getting the information they need without wading through click-bait links, pop-up ads or recycled content is a benefit. But for businesses whose revenues are based on pay per click advertising and affiliate marketing, AI search is a threat.

To protect these revenue streams, some organisations are pursuing efforts to block web crawlers and replace the pay per click model with pay per crawl fees. Advocates for an open web contend these initiatives undermine openness, decentralisation and non-discrimination – principles core to the worldwide web’s creation.

For decades, small and medium-sized enterprises (SMEs) have relied on data scraping for competitive intelligence, customer sentiment, market development and financial analysis. Academic institutions rely on web crawling and data scraping for research. Non-profits routinely crawl and scrape data for societal good, such as archiving public websites, monitoring hate speech and helping to prevent human trafficking.

Against a backdrop of evolving economic models and developing legal and regulatory frameworks, the diversity of stakeholder priorities highlights the tension between competing values.

Two of the leading industry initiatives to restrict crawling and scraping rely on machine-readable coding to prevent automated data collection programmes (also known as bots) from accessing public websites.

Cloudflare, a San Francisco-based company that provides content-delivery services to approximately 20 percent of all websites, recently launched a bot-blocking tool that denies select crawlers access to its customers’ websites unless they pay a per-crawl fee. Customers can choose which bots to block but the tool defaults to bots Cloudflare labels ‘AI Search’.

The purpose, as Cloudflare explained, is to build a new economic model that compensates publishers for access to their publicly available web data. Matthew Prince, chief executive of Cloudflare, described his preferred outcome as “a world where humans get content for free, and bots pay a tonne for it”, underscoring the monetisation shift.

Farzaneh Badiei, founder of Digital Medusa and former director of social media governance initiative at Yale Law School, counters that Cloudflare’s new model “sets a dangerous precedent for how we control access to online knowledge, and risks reinforcing gatekeeping structures that threaten the open Internet”.

Similarly, Luke Hogg, director of outreach, and Tim Hwang, general counsel, at the Foundation for American Innovation, criticise Cloudflare for capitalising on “fear of a new technology to undermine the very principles of the open internet”. They also point to competition concerns raised by centralising control and reinforcing data silos.

Separately, a working group at the Internet Engineering Task Force (IETF) is drafting protocols for website operators to specify whom they approve to use their public web data for AI training and whom they wish to disallow. The Working Group originally focused on developing vocabulary to express “opt-outs” contemplated under article 4 of the European Union’s (EU’s) Directive on Copyright in the Digital Single Market.

Under article 4, copyrighted data may be used for AI training under a “legitimate interest” test unless a copyright owner expressly reserves their rights. But expressing a digital rights reservation across platforms is not simple. One solution identified by the Working Group is to create standardised, machine-readable, opt-out vocabulary that can be appended to a well-established IETF internet protocol known as the ‘robot exclusion standard’ or robots.txt.

First proposed in 1994, robots.txt was intended to ensure technical stability of websites by notifying crawlers which files should be avoided. It explicitly rejected any notion of enforceability. Later, it was modified to improve access to websites by internet search engines but continued to exist as a voluntary means to share information. In the US, courts have repeatedly acknowledged its limitations as an enforcement mechanism, treating robots.txt files as requests rather than technical barriers.

The EU, however, is poised to take a different approach. Under the EU AI Act, AI developers who sign on to the EU AI Act Code of Practice will be required to comply with all robots.txt requests, including any modifications to robots.txt adopted by the IETF. For this reason, a decision by the IETF Working Group to adopt new robots.txt vocabulary takes on greater legal significance. Although no company is required to commit to the Code, companies such as Google and OpenAI have signalled their intent to do so.

Supporters of AI opt-outs argue that without them, publishers will withdraw valuable content from the open web. In their research paper ‘Consent in Crisis: The Rapid Decline of the AI Data Commons’, Shayne Longpre and his co-authors found that, after the introduction of ChatGPT, efforts by website owners to restrict access to public web data increased significantly.

This conclusion was based on an increase in the number of robots.txt files, the number of websites with prohibitions on data scraping in their terms of service and new paywalls. More recently, several news agencies, including The Washington Post and Business Insider, noted sharp declines in online traffic since the advent of AI search tools.

How much of this decline is due to AI search, as opposed to a generational shift to platforms like X and TikTok for news, remains to be seen. But publisher sentiment is clear. As the chief executive of Snopes.com notes on the Cloudflare website: “If the shift toward AI continues to erode web traffic, I worry that most premium publishers will have no choice but to adopt a subscription-only model. The whole Internet behind a paywall isn’t good for anyone.”

Opponents of restrictive standards warn that pay per crawl requirements favour large companies that can afford costly licence agreements and will create barriers for SMEs. As Mr Hogg and Mr Hwang explain: “smaller sites that lack the clout to cut deals could simply vanish from the datasets that power the next wave of technology”. In addition to exacerbating economic disparities, pay-per-crawl may make it harder for non-profits, government watchdogs and academics to operate.

The current IETF proposal may also lead to copyright overreach. When data hosts use robots.txt files to restrict use of content in particular files, they are likely to block the use of information that is not copyrightable in the first place, such as product and pricing data and customer reviews. Copyright overreach may also occur when a platform provider hosts data it does not own and for which it holds no copyright. Examples include user-created posts and content creators wish to share without restriction. In such cases, the inclusion of a robots.txt file may be comparable to a false copyright takedown notice.

In the US, the law generally supports the right to responsibly scrape public web data. For example, courts have held that the Computer Fraud and Abuse Act does not bar scraping publicly available data, even when instructions in a robots.txt file were ignored and a CAPTCHA bypassed, that a website’s terms of service prohibiting scraping did not give rise to liability where the data scraping tool was not logged on to the website while scraping publicly available information, and that copying copyrighted materials for AI training is transformative and generally constitutes fair use.

As the importance and prevalence of data scraping have grown, industry standards for responsible data scraping have emerged to address the challenges publishers and creators have identified. Organisations such as the Alliance for Responsible Data Collection (ARDC) have developed ‘Technical Standards and Governance Guidance for Responsible Data Collection’ based on input from a range of for-profit and non-profit entities that crawl, scrape or rely upon scraped web data.

Examples of responsible data collection include standards that promote the protection of website and platform stability, include the use of rate limits to prevent harm to websites, and the retention of documents that record the dates and times of data collections, the domains from which data is sourced, the purpose of the collection and behaviours such as whether robots.txt instructions were followed. In addition, ARDC guidelines provide a compliance framework for organisations that conduct frequent scraping or crawling, such as universities or market research entities.

For larger organisations, policies defining acceptable use, approval processes and provisions for investigating complaints ensure a consistent approach tailored to each organisation’s unique purpose and needs. Organisations like ARDC provide alternative solutions to the challenges of surging demand for data: rather than blocking access and innovation, they provide practical guidance for organisations committed to responsible scraping and crawling.

The debate over data scraping, AI access and the open web is more than a clash between website operators and AI developers. It is a reflection of the importance of public web data to the global economy.

As the influence of AI on society expands exponentially, courts, policymakers and standards organisations will need to carefully weigh the range of interests at stake and the potential for unintended consequences. Decisions made today to address one set of pressing concerns may have enduring impacts on the future of the public web.

Jo Levy is a partner at The Norton Law Firm. She can be contacted on +1 (510) 239 3588 or by email: jlevy@nortonlaw.com.

Jo Levy

The Norton Law Firm