Come 15 September, multipurpose crawlers used by the likes of Google, Microsoft and Apple will be blocked by default according to Cloudflare’s new rules.
IT and network services provider Cloudflare has announced new rules designed to give website owners more control over the types of web crawlers that will be allowed on or blocked from their sites – along with plans to block multipurpose crawlers by default on ad-supported pages.
Traditionally, search engines and websites maintained a sort of “symbiotic relationship”, as Cloudflare puts it, whereby web owners allowed search engines to crawl their sites and in return, search engines sent users back to their pages.
The company explained that this crawl-to-referral process, when balanced, would help sites generate the pageviews needed to sustain advertising, affiliate revenue and subscriptions.
However, the rise of AI crawlers and agents changed things, as AI chatbots scrape sites to synthesise answers and bypass original sources – often leading to imbalanced crawl-to-referral ratios. Cloudflare’s own research from last year noted ratios ranging from 118:1 up to nearly 50,000:1 – meaning an AI crawler could have scraped a site tens of thousands of times and only sent back a single user.
Nowadays, many of these crawlers are used for multiple purposes – including AI training and search indexing – which puts website owners in a difficult position, as turning off all automation and crawler access to their sites could diminish their chances of showing up on search results.
Cloudflare hopes to tackle this issue with its new rules, which include options for managing crawler access by establishing three categories of crawler purposes: Search, Agent and Training.
‘Search’ refers to crawlers that are used for search indexing, ‘Agent’ refers to automated behaviours used by the likes of chatbots and browser-use agents, and ‘Training refers to crawlers that scrape content for fine-tuning AI models.
With these three classifications, website owners will be able to selectively allow or block crawlers that are used for each of the three classifications – meaning that if a web owner wanted to allow Search crawlers but block Agent and Training crawlers, they will now be able to do so
As part of these new rules, Cloudflare will also block Training and Agent crawlers by default on pages that display ads.
The default block settings, which will apply to any new domain onboarded to Cloudflare from 15 September, won’t apply to crawlers used for search indexing, while multipurpose crawlers – specifically those used for both search and training purposes – will be allowed or blocked “according to all of their behaviours”.
As a result, multipurpose crawlers used by the likes of Google, Microsoft and Apple will be blocked by default come 15 September.
“We believe it should be simple for all website owners to manage access for these three AI-centred use cases,” read a blogpost by Cloudflare. “We believe that bot operators should separate their crawlers because that creates more transparency for website owners, allowing them to better understand why a given crawler is visiting them as well as to better manage the access they extend to that crawler.
“If a company runs automation that builds Search indexes, acts as an Agent, and collects data to Train their models, then we strongly encourage that company to separate the automation into three separate crawlers.”
In the lead-up to the September default deadline, Cloudflare customers can opt out of the default settings if they want to.
Cloudflare’s new rules are the latest in the company’s attempts to curb crawler misuse.
This time last year, the company introduced new crawler controls for website owners, including a ‘pay per crawl’ system designed to integrate with existing web infrastructure and leverage HTTP status codes and established authentication mechanisms to create a framework for paid content access.
The year before that, Cloudflare introduced a tool that allowed website owners to block all bots at once.
Don’t miss out on the knowledge you need to succeed. Sign up for the Daily Brief, Silicon Republic’s digest of need-to-know sci-tech news.
You must be logged in to post a comment Login