Internet-Draft robots-proposal November 2024
Jimenez Expires 10 May 2025 [Page]
Workgroup:
ai-control
Internet-Draft:
draft-jimenez-tbd-robotstxt-update-00
Published:
Intended Status:
Informational
Expires:
Author:
J. Jimenez
Ericsson

Robots.txt update proposal

Abstract

This document proposes updates to the robots.txt standard to accommodate AI-specific crawlers, introducing a syntax for user-agent identification and policy differentiation. It aims to enhance the management of web content access by AI systems, distinguishing between training and inference activities.

About This Document

This note is to be removed before publishing as an RFC.

Status information for this document may be found at https://datatracker.ietf.org/doc/draft-jimenez-tbd-robotstxt-update/.

Discussion of this document takes place on the ai-control Working Group mailing list (mailto:[email protected]), which is archived at https://mailarchive.ietf.org/arch/browse/ai-control/. Subscribe at https://www.ietf.org/mailman/listinfo/ai-control/.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on 10 May 2025.

Table of Contents

1. Introduction

The current robots.txt standard inadequately filters AI crawlers due to its reliance on a "user-agent name" based approach and limited syntax. It is difficult to differentiate based on the intended use of data, such as storage, indexing, training, or inference.

We submitted the following proposal to the AI-Control WS: https://www.ietf.org/slides/ slides-aicontrolws-ai-robotstxt-00.pdf based on further discussion, the following text may describe a solution to the problems described in the WS.

1.1. Terminology

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.

This specification makes use of the following terminology:

Crawler:

A traditional web crawler. Also crawlers that relate to AI companies but that do not use the gathered content to train any model, LLMs or otherwise, as their purpose is purely real-time data integration for inference.

AI Crawler:

A specialized type of crawler employed by AI companies, which utilizes the gathered content exclusively for training purposes rather than for inference.

1.2. User-Agent Update

Crawlers are normally identify with the HTTP user-agent request header, the source IP address of the request or reverse DNS hostname of it.

A draft that defines a syntax for user-agents would be necessary. The syntax has to be extendable, so that not only AI but potentially other crawlers can use it. it should not be mandatory for clients to implement as it should be backwards compatible.

An absolutely minimal syntax would be similar to what we see in the wild, most AI companies use the -ai characters at the end of the user agent name to indicate that the crawler is used for ingesting the content into an AI system, for example:

  User-agent: company1-ai
  User-agent: company2-ai

Otherwise we could reuse identifiers like URNs Namespace (e.g., urn:rob:...), CRIs or cryptographically derived identifiers ... there are dozens of options on the IETF so it is a matter of choosing the right one.

The -ai syntax would indicate that the crawler using it is interested in training. In this draft we treat inference as a separate process akin to normal web-crawling and thus already covered.

This approach different from draft-canel-robots-ai-control, as it does not require a new field in the robot.txt ABNF as shown below:

User-Agent-Purpose: EXAMPLE-PURPOSE-1

1.3. Robots.txt Update

RFC9309 ABNF should be updated to address the new User-agent syntax. If we continue with the -ai convention above, we could use regex to indicate different policies to AI crawlers. For example:

  • Disallow all AI-training

User-Agent: .*?-ai$ Disallow: /
  • Allow all images for training but disallow training on /maps for all AI agents that do AI training.

User-Agent: .*?-ai$ Allow: /images
Disallow: /maps*
  • Allow /local for cohere-ai

User-Agent: cohere-ai Allow: /local

This proposal is also different that the new control rules DisallowAITraining and AllowAITraining proposed by draft-canel-robots-ai-control. From a semantic perspective, it is problematic to create specific purpose-oriented lines that fullfill such as DisallowThisProperty and DisallowAnotherProperty that have the same meaning and effect as the existing verbs Disallow and Allow.

In our proposal the information about the agent's purpose is on the User-Agent itself, which enables to filter out AI training agents using simple regex and the existing semantics.

Acknowledgements

The author would like to thank Jari Arkko for his review and feedback on short notice.

Normative References

[RFC2119]
Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, , <https://www.rfc-editor.org/rfc/rfc2119>.
[RFC8174]
Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, , <https://www.rfc-editor.org/rfc/rfc8174>.

Author's Address

Jaime Jimenez
Ericsson