Internet-Draft | UTF8=ACCEPT | September 2024 |
Gulbrandsen & Yao | Expires 22 March 2025 | [Page] |
This document specifies rules for email addresses that are flexible enough to express the addresses typically used with SMTPUTF8, while avoiding confusing or risky elements.¶
This is one of a pair of documents: this contains recommendations for what addresses humans should use, such address provisioning sytems can restrain themselves to addresses that email valdidators accept. (This set can also be described in other ways, including "simple to cut-and-paste".) Its companion defines simpler rules, accepts more addresses, and is intended for software like MTAs.¶
NOTE: The term 'nice' is not ideal here, and must be reconsidered or replaced before this is issued as an RFC. "Well-formed" is a candidate. "Nice" will do for now: better to argue about substance than wording.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 22 March 2025.¶
Copyright (c) 2024 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
[RFC6530]-[RFC6533] and [RFC6854]-[RFC6858] extend various aspects of the email system to support non-ASCII both in localparts and domain parts. In addition, some email software supports unicode in domain parts by using encoded domain parts in the SMTP transaction ("RCPT TO:[email protected]") and presenting the unicode version (dømi.fo in this case) in the user interface.¶
The email address syntax extension is in [RFC6532], and allows almost all UTF8 strings as localparts. While this certainly allows everything users want to use, it is also flexible enought to allow many things that users and implementers find surprising and sometimes worrying.¶
The flexibility has caused considerable reluctance to support the full syntax in contexts such as web form address validation.¶
This document attempts to describe rules that:¶
includes the addresses that users generally want to use for themselves and organizations want to provision for their employees.¶
includes the addresses that have been used significantly, even if not exactly what users wanted.¶
excludes confusable things.¶
excludes things that have been described as security risks.¶
Looks safe at first glance to implementers (including ones with little unicode expertise).¶
These goals are somewhat aspirational. For example, lower-case L and upper-case i are confusable and cannot realistically be disallowed, the Protocol PoLIce would arrest us all.¶
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.¶
Script, in this document, refers to the unicode script property (see [UAX24]). Each code point is assigned to one script ("a" is Latin), except that some are assigned to "Common" or a few other special values. Fraktur and /etc/rc.local aren't scripts in this document, but Latin is.¶
Latin refers those code points that have the script property "Latin" in Unicode. Orléans in France and Münster in Germany both have Latin names in this document. It also refers to combinations of those code points and combining characters, and to strings that contain no code points from other scripts.¶
Han, Cyrillic etc. refer to those code points that have the respective script property in Unicode, as well as to strings that contain no code points from other scripts.¶
ASCII refers to the first 128 code points within unicode, which includes the letters A-Z but not É or Ü. It also refers to strings that contain only ASCII code points.¶
Non-ASCII refers to unicode code points except the first 128, and also to strings that contain at least one such code point.¶
By way of example, the address info@dømi.fo is latin and non-ASCII, its localpart is latin and ASCII, and its domain part is latin and non-ASCII. 中国 is a Han string in this document, but 阿Q正传 is neither a Latin string nor a Han string, because it contains a Latin Q and three Han code points.¶
This section is informative.¶
In the countries the author has visited, the norm is that some users and organizations want to use their daily script(s), but not others. An Indian student in Japan or an Indian immigrant in an Arabic country will generally have his/her name spelt with either Latin letters or the host country's script, because that's how the university or company works.¶
While the syntax defined in [RFC6532] technically supports addresses with an Indic localpart and a Japanese or Arabic domain part, organizations such as a university or a company don't want to use that kind of naming: The script that's used for the organization's domain is the one it wants to use when provisioning email addresses.¶
Note that even when an organization emphatically doesn't want to provision localparts from scripts other than its own, that organization generally wants the ability to correspond worldwide.¶
There are a few countries in which ASCII localparts are used with non-ASCII domains (reputedly because of email software that supports [RFC3490]/[RFC5891] but not [RFC6532]). So far, the author has only seen this in countries that use left-to-right scripts such as Cyrillic and Han.¶
In some countries, ASCII non-letters is used together with non-Latin scripts. Arabic is an example: the Arabic digits 0-9 are used often (more so in some countries than others). Chinese domain names use the ASCII dot instead of the Chinese full-width dot. ASCII punctuation (notably the hyphen) is used with several scripts.¶
In the author's experience, left-to-right and right-to-left writing is mixed in only a few cases (that are relevant to email addresses):¶
This section is informative.¶
The use of non-ASCII in domain names is restricted by several factors: IDNA2008 rules, registry rules and web browsers.¶
For a domain such as example.com, IDNA2008 restricts both labels (slightly), ICANN policy restricts "com" and the .com registry's policy restricts "example", and the browsers' rules apply to both. For a domain such as e1.example.com, IDNA2008 and the browsers' rules apply to "e1" as well. Neither ICANN's nor the .com registry's rules apply to "e1".¶
Using non-ASCII more or less requires IDNA2008, ie. if a the domain name is to be practically usable, it needs to be usable with IDNA2008, which restricts the set of code points slightly.¶
The LGR, Label Generation Rules, are a set of rules largely developed by per-language and per-script committees and collected by ICANN. Most notably, a script or language has to be used by a current community in order to get an LGR. An LGR contains that which is currently used by the community.¶
Registering a domain name in a public TLD requires adhering to the registry's policy, which is generally either an LGR, a small registry-chosen set of code points, or restricted by rules outside the DNS (such as for .bank, which is effectively restricted by what names banking regulators accept).¶
The merged repertoire of all code points in all LGRs is called the Common LGR, and contains more than 30,000 code points at the time of writing.¶
The main browsers have rules for which domains are displayed in a human-readable manner in the address bar. (One author has a runic domain, all of the main browsers display xn--something in the address bar.) Each of the three big browser vendors maintains its own rule set. At the time of writing, all three may be described as broadly similar to the Common LGR and different in detail.¶
Based on the above descriptions, the following rules are formulated for nice email addresses.¶
A nice address MUST NOT contain an a-label (e.g. xn--dmi-0na).¶
If the address matches the grammar in [RFC5322], then in order to be nice it MUST also match the grammar specified by the W3C WHATWG [TYPE_EMAIL] spec.¶
A nice address MUST contain code points within the PRECIS IdentifierClass.¶
A nice address MUST consist entirely of a sequence of composite characters, ZWJ and ZWNJ. ("c" followed by "combining hook below" is an example of a composite character, "d" is another example; see [RFC6365] for the definition.)¶
If an address contains any right-to-left code points, then it MAY contain ASCII digits and MUST NOT contain any other left-to-right code points.¶
If an address contains any non-ASCII code point, then one of the following conditions MUST apply:¶
6.1. All code points share one unicode script property, or have the script property Common, or are ASCII digits (or ./@).¶
6.2. The localpart consists entirely of ASCII, and the domain part consists of code points that share one unicode script property, or have the script property Common, or are ASCII digits (or ./@).¶
6.3. All code points are have the script properties Han or Common, or are ASCII.¶
The address [email protected] is nice, because 1) it does not contain any a-label, 2) it matches the WHATWG [TYPE_EMAIL] spec, 3) it consists entirely of permissible code points, 4) it consists of 19 composite characters, and the last two conditions do not apply.¶
The address dømi@dømi.fo is nice, because 1) it does not contain any a-label, 2) does not apply, 3) it consists entirely of permissible code points, 4) it consists of 12 composite characters, 5) does not apply and 6) it consists entirely of 'Latin' and 'Common' code points (and ./@).¶
The address U+200E '@' U+200F '.' U+200E is not nice, because 4) U+200E and U+200F are not parts of composite characters.¶
The address U+627 '1' '@' U+627 '2' '.' U+627 '3' is nice. (U+627 is an arabic letter, written right-to-left.) It is nice because 1) it does not contain any a-label, 2) does not apply, 3) it consists entirely of permissible code points, 4) it consists of 8 composite characters, 5) the only right-to-left code points used are ASCII digits and 6) all code points are 'Arabic' or ASCII digits (or @/).¶
(Note that it's nice even though top-level domain used does not exist, because '3' is not permitted by the current top-level domain name rules. Checking domain existence is simple if one assumes that internet access is available and the address is valid at the time, but this document does not assume assume either of those.)¶
The address [email protected] is not nice, because 1) it contains the a-label xn--dmi-0na.¶
The address 名字@例子.中国 is nice, because 1) it does not contain any a-label, 2) does not apply, 3) it consists entirely of permissible code points, 4) it consists of 8 composite characters, 5) does not apply and 6) it consists entirely of 'Han' code points (and ./@).¶
The address info@例子.中国 is nice, because 1) it does not contain any a-label, 2) does not apply, 3) it consists entirely of permissible code points, 4) it consists of 8 composite characters, 5) does not apply and 6) the localpart is ASCII and the domain part consists entirely of 'Han' code points (and .).¶
The address dømi@例子.中国 is not nice, because 6) it contains both 'Latin' and 'Han' code points and the localpart is non-ASCII.¶
The address 阿Q@阿Q正传@.中国 is nice, because 1) it does not contain any a-label, 2) does not apply, 3) it consists entirely of permissible code points, 4) it consists of composite characters, 5) does not apply and 6) the consists entirely of ASCII and Han code points (and .).¶
This document does not require any actions from the IANA.¶
When a program renders a unicode string on-screen or audibly and includes a substring supplied by a potentially malevolent source, the included substring can affect the rendering of a surprisingly large part of the overall string.¶
This document describes rules that make it difficult for an attacker to use email addresses for such an attack. Implementers should be aware of other possible vectors for the same kind of attack, such as subject fields and email address display-names.¶
If an address is signed using DKIM and (against the rules of this document) mixes left-to-right and right-to-left writing, parts of both the localpart and the domain part can be rendered on the same side of the '@'. This can create the appearance that a different domain signed the message.¶
The rules in this document permit a number of code points that¶
The authors wish to thank John C. Klensin, [your name here, please] [oh wow, the ack section is already outdated]¶
Dømi.fo and 例子.中国 are reserved by nic.fo and CNNIC for use in examples and documentation.¶
阿Q正传@ is a famous Chinese novella, 阿Q is the main character.¶
This section is informative. Each of the six conditions has a separate rationale.¶
A-labels are confusing for many readers, and can potentially be used to confuse and attack readers. Being visibly safe is one of the five goals.¶
Many existing address validators use the WHATWG rules; if this specification is exactly compatible with WHATWG [TYPE_EMAIL] for all the addresses that WHATWG covers, then it's possible to extend a WHATWG-compliant validator without risk of accidentally rejecting formerly accepted addresses.¶
The PRECIS IdentifierClass is one of several similar sets, UAX31 and the Common LGR being others. I'm quite uncertain which is most appropriate, there are arguments in favour of each.¶
Unicode contains many code points that could perhaps be used for attacks. Whether they could be used for attacks is not important, since one of the goals is to be safe at first glance even to implementers with limited knowledge of unicode. By constraining the repertoire to the plainest code points, the specification gains safety at first glance.¶
Mixed-direction text can be confusing, and confusion has been used to attack users before. This rule tries to gain safety at first glance by constraining mixed-direction text in addresses to that which is known to be necessary.¶
This rule permits addresse that are e.g. all-Chinese or all-Thai, and rejects addresses that mix e.g. Thai and Chinese. This restricts the scope for visually confusable code points. Since some communities are known to mix ASCII localparts with IDNs, combining left-to-right text with ASCII is allowed, at least in the way that's currently used.¶
Note that metal umlauts ("Mötley Crüe") are allowed (see [UMLAUT]). This is an unintentional feature.¶
Please remove all mentions of the Protocol Police before publication (including this sentence).¶
Please remove the Open Issues section.¶
The use of the Common MSR is not ideal. It does allow a fairly good similarity between what's allowed in a domain name and in localparts, and it is a set of codepoints that includes most of what should be included and excludes most of what should be excluded. However, it was designed for the DNS, not for localparts. Ten thousand working hours have gone into it, none of the 10k into considering whether the use for localparts has particular problems.¶
The term "nice" is vague and inappropriate.¶
The relationship with RFC 8265 needs to be explained somewhere. A starting point: This document tries to guard against confusing addresses, in the sense that they confuse humans. Computers can also be confused; this document relies on RFC 8265 to make two confusable addresses practically equal. If two addresses look confusable, but test as identical according to 8264/5, then the confusability shouldn't be a problem.¶
Metal umlauts might be a problem. Accents are used sometimes with non-latin, but ver seldom and might be seen as surprising to native users of e.g. cyrillic, even if Åквариум exists. https://krebsonsecurity.com/2022/11/disneyland-malware-team-its-a-puny-world-after-all/ is worth considering. Not entirely clear how to subdivide the Common script so it can be used with some scripts, not with others.¶
5.¶