Author:  Ted Hardie
Contact: ted.ietf@gmail.com
Title: Crawlers, adversaries, and exclusions: thoughts on content ingest and creator's rights


Abstract:  


This paper looks briefly at the history of information retrieval in the early web and explores the relationships between content creators and search engines that developed. It then examines the differences between those and the arelationships that have emerged from the ingestion of content to train generative AI models.  It concludes with some thoughts on the implications of those differences for evolving mechanisms like robots.txt to cover the new use case.


1. Metadata on the early Web


Prior to the widespread deployment of the Web, many of the resources on the Internet were in curated collections, maintained and interlinked by librarians or their equivalents in research institutions and universities.  When the Web overtook gopher,  the population making content available shifted to subject matter experts for specific content, rather than those more used to providing institutional context.  When I was the head of NASA's webmaster's working group during the mid-1990s, we estimated that there were more than four thousand servers inside the network, run by individuals or small teams intent on sharing information on specialized topics with external collaborators or the public at large.  It was in that highly decentralized era that many of the patterns related to search and discovery on the web were set.   


One consequence of this shift to the Web was that methods for finding and describing information needed to be reconsidered as well.  Veronica, an early search engine for gopher, relied on menu information; WAIS, which used free text search, presumed the creation of index files for some critical functionality.  One venue for exploring the difficulty of translating those approaches to the web was the Dublin Core Workshop series (documented in RFC 2413 and updated in RFC 5013, after the creation of the Dublin Core Metadata Initiative).  It is interesting to note that one of the core metadata elements was "rights", described in section of RFC 5013 as "Information about rights held in and over the resource. Typically, rights information includes a statement about various property rights associated with the resource, including intellectual property rights."  Had the Dublin Core become a bedrock part of search and discovery, we might presume that the rights related to any piece of content could be expressed to any type of retrieval system relatively easily.


That is not, however, what happened.  Search engines which used metadata for classification discovered early on that content creators could and did use metadata tags which were completely unrelated to the actual content.  This set up an adversarial relationship between content creators, who were trying to draw attention to their sites by being very liberal in the descriptions of the content, and search engine operators, who were trying to provide relevant content. While an entire system was eventually specified for using metadata to direct queries within cooperating systems (see RFC 2651, RFC 2655, and RFC 2656), it was a niche approach by the time the specifications appeared.  Instead sites were crawled by search engines which followed each hyperlink, collected all the data available, and produced their own indexes.  


As early as 1994, this occasionally created excessive load on the content servers.  In response, Martijn Kosters created robots.txt as a mechanism for specifying how crawlers would be permitted to behave at a particular site.  This became a de facto standard (ultimately also published as RFC 9309, after having been in use for nearly 20 years).  While it has expanded considerably over time from simple rate-limiting, its core mechanism remains the same:  a product token matching the user-agent string is used to identify a matching set of rules, which the relevant crawler is expected to obey.  A default set of behaviors is also generally included. 


While this has not been without flaws, this approach worked through a very long period in part because there were implicitly other remedies available:  the IP address ranges associated with a crawler could be blocked; a crawler failing to obey the rules could be fingerprinted by the content server and null or garbage data returned; a crawler that ignored limits related to confidential information might find itself dealing with relevant regulatory authorities (in the US, the California Consumer Privacy Act or the Computer Fraud and Abuse Act; in Europe, the GDPR).  It also worked in part because the two parties had an interest in the other's success; while there were adversarial elements, the core of the relationship between content and search was cooperative, as web sites wanted the traffic the search engines supplied.


2. Exclusion and use for training model data.


As we look at the newer use case, there are some obvious similarities.  There is a set of crawlers and a set of content sites.  In some cases over-eager crawlers have created excessive load on some content sites.  Re-using robots.txt to handle over-eager crawlers seems like a return to the roots of the standard and a natural extension to the existing, well-understood system. 


But the relationship between the sites and crawlers looks pretty different than it was during the distributed web era in which robots.txt emerged.  First, much of the content is platform based, and in some cases the platform owners and the AI model developers are the same entity.  Developing rules for all of the content on a platform and sharing it via a site-wide file seems likely to result either in serious scaling problems or a flattening of individual preferences into site-wide norms, which may not match the desires of the content creators.  Especially in the case where the platform owner and AI model developer are the same, any conflict about the desired usage of content is heavily weighted toward the platform owner. Related issues have already occurred.  Updated terms of service for specific platforms have caused multiple user communities concern; slack, discord, X, and others have each tried to update their terms to grant rights for model development.


Secondly, the web content creators historically collaborated with search engines in order to garner attention.  Because that attention has value, the search engines relied on a fundamental level of cooperation, even if they had to deal with some adversarial behavior attempting to gather more attention than the content actually warranted.  


In the current situation, however, the adversarial relationship runs the other way.  The content has value, and the content creator or publisher must deal with adversarial behavior by the crawler or the developer of the AI system.  Where web search engines bring attention to the content, generative AI models are intended to synthesize new utterances in responses to prompts--they do not bring attention to the origins of the data.  In some cases, the new synthesized utterances replace the work that would have been done by content creators, lowering the potential earnings of the creator, rather than providing an avenue of discovery or attention-based monetization.  The needs for exclusion have thus also changed, to a need to protect the rights of content creators within a system where the content also remains available for its current use. 


3. Evolution, scaling, and change


At the time of writing, a site manager or content creator can use robots.txt to manage some crawlers with these two very different purposes, albeit at the cost of significant additional work.  In order to do that within the confines of RFC 9309, each crawler's intent must be identified and the file updated to allow or deny different parts of the site based on that intent.  The exact same content for which an "allow" is appropriate for search engines may now need to be marked as "deny" for AI model training crawlers.  This approach relies on each crawler's intent being easy to identify, as well as the site being constructed in a way that cleanly separates out data which should be permitted from that which should be denied for this new purpose. As it always has, it also relies on the crawler voluntarily following the robots.txt standard.  There are reports that a number of these crawlers, including Anthropic's claudebot, do not.  Similarly, claudebot apparently changes source IPs from within the AWS cloud IP space so often that IP-level blocking doesn't persist. This is a very basic breakdown in the fundamental level of cooperation that has existed between crawlers and content creators in the past.


This approach also fails when a search engine chooses to combine the crawlers for search and for AI model training.  The primary example of this is Google, which has used its search engine data to train AI models.  As result, many sites which previously opted-in to its search have been opted-in to its training of models.  To allow sites to distinguish between the two use cases, Google has indicated support for a new directive, "google-extended", through which a site could opt-out of LLM training and similar uses while remaining available to search.  Which uses is, however, subject to Google's interpretation and Google continued to use search data for the Search Generative Experience at least for a time.  The penetration of the new directive is also limited; six months after its announcement originality.ai found only 10% of the top thousand websites were using it[1].


These experiences indicate there likely would be some value to creating a standard extension to robots.txt for AI model uses, but that it would be limited.  For large-scale content providers who wish to opt-out completely, a single, standard approach would simplify the need to track whether AI model training was within each crawler's intent.  It would also give evidence of their preference when taking non-technical steps, such as a cease-and-desist letter. As a next step, it makes sense.   


There are, however, lots of cases for which this "opt-out completely" switch is not sufficient.  The guidelines for using NASA images and media[2] provide an interesting test case, because NASA content is generally not subject to copyright in the United States.  It is free for most uses, including display, the creation of simulations, and the Web.  While that would seem to make it possible to assume a blank permit for this new use case, a quick look at the actual guidelines shows that things are not so simple.  NASA may host 3rd party content which is subject to copyright; making that content accessible within the NASA site will have been agreed by the copyright holder, but the other rights would need to be sought from them.  That, in turn, means blanket permission to access specific directories may be difficult to grant.  NASA also forbids the use of its imagery when it is intended to imply an endorsement and requires additional steps when it includes an identifiable person. Both of these imply significant exclusions or additional steps as well.  Because much of the site content changes quite rapidly, requiring NASA to either separately disallow each individual exception or re-architect to partition its use would be an extremely daunting requirement.[3]


The NASA case is interesting in part because it is likely to be the inverse of what would be typical other sites, where the presence of material subject to copyright might mean disallowing almost all content from crawlers but allowing a few exceptions.  It highlights that robots.txt use of longest match and exception lists to handle mixed content will be difficult to get right and even harder to maintain.  Continuing to evolve along these lines may give some relief, but the core issue is that the rights associated with content generally adhere to the content itself rather than where it sits in the directory structure of a website.  That tension is not resolved with extensions along these lines.


4. The steps beyond the next step


That tension also raises the question that I hope this workshop can address:  what's the step beyond extending robots.txt to make blank exclusion easier?  What will let us have a richer vocabulary than "permit that directory" and "deny that content in it"?


The answer, when it comes, will only be partly technical.  We need to start from an agreement on what specific rights held in relation to a piece of content mean for different use cases.  That's likely a question for legislators and the courts, though good-faith efforts by those creating models will definitely help.   Those answers can drive technical work on binding the new, richer vocabulary to different types of media.  That will be especially difficult for streaming media and other synthesized content, but it can be done and, with time and care, done in ways that will be easy for new content to use[4].


Which brings us to a question that is among the hardest:  what about content for which there are no assertions, either in robots.txt or this new binding?  Again, probably ultimately a question for legislators or the courts, but in this author's opinion the only sensible starting point is to assume that if the permission has not been explicitly granted that it is absent. That is, ultimately, the nature of consent.  Unfortunately, that stance runs counter to the business interests of almost everyone attempting to build generative AI models from public data, and we run a real risk that courts or legislatures will hold that opt-out is sufficient.  


That  would provide cover for non-compliant crawlers if robots.txt is the only mechanism available, exactly because the generative AI systems do not bring attention to the source of the content.  A piece of content that was opt-outed at one site or in one instance may also be present elsewhere on the Web, so it will be difficult or impossible to assess whether a crawler was not complying or had sourced the content elsewhere, from a site with no opt-out.  The inclusion of equivalents to the cartographer's "trap streets"[5] can help to some degree, but the better long-term solution is based on an opt-in approach in which affirmative consent for the use must be given. Then, if content is shown to have been used by a generative model, the content creator can require the model's maintainer to show the evidence of that consent, so that they can proceed against the party which manufactured that consent.


Section 5: Conclusion


There has been a fundamental change in the nature of the relationship between content creator and crawler.  While it may be possible to extend existing systems like robots.txt to manage some aspects of the new crawler behavior in the short term, we must recognize that fundamental change in order to tackle the long-term issues.  Much of the work needed to tackle those issues will not be technical, as it will require a delineation of what uses may be assumed for content made available via the Web and what rights may be asserted to limit those uses.  The technical work of binding those assertions to the media is secondary to that, and it will likely require fundamentally different approaches than that of the existing robots exclusion standard.
________________
[1] https://www.businessinsider.com/google-extended-ai-crawler-bot-ai-training-data-2024-3
[2] https://www.nasa.gov/nasa-brand-center/images-and-media/
[3] Note that the author worked as a contractor for NASA many years ago but has no current connection to the agency.  This conclusion is therefore speculative, though based on the agency's reaction to the kidcode proposals (https://datatracker.ietf.org/doc/draft-borenstein-kidcode/), which would have created a similar unfunded mandate.
[4] It is technically feasible for HTML or JS now, by including Dublin Core-stye metadata in undisplayed portions of the content.  For other media types, the binding might require either something similar to multipart/mixed or an externalized assertion bound to a hash or content identifier.  In both cases, these assertions might be also signed.  Getting this right requires engineering work, but that work will fail if the assertions do not match the needs.
[5] https://en.wikipedia.org/wiki/Trap_street