Internet-Draft EVPN Anycast Multihoming November 2024
Rabadan, et al. Expires 17 May 2025 [Page]
Workgroup:
BESS Workgroup
Internet-Draft:
draft-rabnag-bess-evpn-anycast-aliasing-03
Published:
Intended Status:
Standards Track
Expires:
Authors:
J. Rabadan, Ed.
Nokia
K. Nagaraj
Nokia
A. Nichol
Arista
A. Sajassi
Cisco Systems
W. Lin
Juniper Networks
J. Tantsura
Nvidia

EVPN Anycast Multi-Homing

Abstract

The current Ethernet Virtual Private Network (EVPN) all-active multi-homing procedures in Network Virtualization Over Layer-3 (NVO3) networks provide the required Split Horizon filtering, Designated Forwarder Election and Aliasing functions that the network needs in order to handle the traffic to and from the multi-homed CE in an efficient way. In particular, the Aliasing function addresses the load balancing of unicast packets from remote Network Virtualization Edge (NVE) devices to the NVEs that are multi-homed to the same CE, irrespective of the learning of the CE's MAC/IP information on the NVEs. This document describes an optional optimization of the EVPN multi-homing Aliasing function - EVPN Anycast Multi-homing - that is specific to the use of EVPN with NVO3 tunnels (i.e., IP tunnels) and, in typical Data Center designs, may provide some benefits that are discussed in the document.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on 17 May 2025.

Table of Contents

1. Introduction

Ethernet Virtual Private Network (EVPN) is the de-facto standard control plane in Network Virtualization Over Layer-3 (NVO3) networks deployed in multi-tenant Data Centers [RFC8365][RFC9469]. EVPN provides Network Virtualization Edge (NVE) auto-discovery, tenant MAC/IP dissemination and advanced features required by Network Virtualization Over Layer-3 (NVO3) networks, such as all-active multi-homing. The current EVPN all-active multi-homing procedures in NVO3 networks provide the required Split Horizon filtering, Designated Forwarder Election and Aliasing functions that the network needs in order to handle the traffic to and from the multi-homed CE in an efficient way. In particular, the Aliasing function addresses the load balancing of unicast packets from remote NVEs to the NVEs that are multi-homed to the same CE, irrespective of the learning of the CE's MAC/IP information on the NVEs. This document describes an optional optimization of the EVPN multi-homing Aliasing function - EVPN Anycast Multi-homing - that is specific to the use of EVPN with NVO3 tunnels (i.e., IP tunnels) and, in typical Data Center designs, may provide some savings in terms of data plane and control plane resources in the routers, as well as lower convergence times in case of failures.

1.1. Terminology and Conventions

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.

  • A-D per EVI route: EVPN route type 1, Auto-Discovery per EVPN Instance route. Route used for aliasing or backup signaling in EVPN multi-homing procedures [RFC7432].

  • A-D per ES route: EVPN route type 1, Auto-Discovery per Ethernet Segment route. Route used for mass withdraw in EVPN multi-homing procedures [RFC7432].

  • BUM traffic: Broadcast, Unknown unicast and Multicast traffic.

  • CE: Customer Edge, e.g., a host, router, or switch.

  • Clos: a multistage network topology described in [CLOS1953], where all the edge nodes (or Leaf routers) are connected to all the core nodes (or Spines). Typically used in Data Centers.

  • ECMP: Equal Cost Multi-Path.

  • ES: Ethernet Segment. When a Tenant System (TS) is connected to one or more NVEs via a set of Ethernet links, then that set of links is referred to as an 'Ethernet segment'. Each ES is represented by a unique Ethernet Segment Identifier (ESI) in the NVO3 network and the ESI is used in EVPN routes that are specific to that ES.

  • EVI: or EVPN Instance. It is a Layer-2 Virtual Network that uses an EVPN control-plane to exchange reachability information among the member NVEs. It corresponds to a set of MAC-VRFs of the same tenant. See MAC-VRF in this section.

  • GENEVE: Generic Network Virtualization Encapsulation, an NVO3 encapsulation defined in [RFC8926].

  • IP-VRF: an IP Virtual Routing and Forwarding table, as defined in [RFC4364]. It stores IP Prefixes that are part of the tenant's IP space, and are distributed among NVEs of the same tenant by EVPN. Route Distinguisher (RD) and Route Target(s) (RTs) are required properties of an IP-VRF. An IP-VRF is instantiated in an NVE for a given tenant, if the NVE is attached to multiple subnets of the tenant and local inter-subnet-forwarding is required across those subnets.

  • IRB: Integrated Routing and Bridging interface. It refers to the logical interface that connects a Broadcast Domain instance (or a BT) to an IP-VRF and allows to forward packets with destination in a different subnet.

  • MAC-VRF: a MAC Virtual Routing and Forwarding table, as defined in [RFC7432]. The instantiation of an EVI (EVPN Instance) in an NVE. Route Distinguisher (RD) and Route Target(s) (RTs) are required properties of a MAC-VRF and they are normally different from the ones defined in the associated IP-VRF (if the MAC-VRF has an IRB interface).

  • MPLS and non-MPLS NVO3 tunnels: refer to Multi-Protocol Label Switching (or the absence of it) Network Virtualization Overlay tunnels. Network Virtualization Overlay tunnels use an IP encapsulation for overlay frames, where the source IP address identifies the ingress NVE and the destination IP address the egress NVE.

  • NLRI: BGP Network Layer Reachability Information.

  • NVE: Network Virtualization Edge device, a network entity that sits at the edge of an underlay network and implements Layer-2 and/or Layer-3 network virtualization functions. The network-facing side of the NVE uses the underlying Layer-3 network to tunnel tenant frames to and from other NVEs. The tenant-facing side of the NVE sends and receives Ethernet frames to and from individual Tenant Systems. In this document, an NVE could be implemented as a virtual switch within a hypervisor, a switch or a router, and runs EVPN in the control-plane. This document uses the terms NVE and "Leaf router" interchangeably.

  • NVO3 tunnels: Network Virtualization Over Layer-3 tunnels. In this document, NVO3 tunnels refer to a way to encapsulate tenant frames or packets into IP packets whose IP Source Addresses (SA) or Destination Addresses (DA) belong to the underlay IP address space, and identify NVEs connected to the same underlay network. Examples of NVO3 tunnel encapsulations are VXLAN [RFC7348], GENEVE [RFC8926] or MPLSoUDP [RFC7510].

  • SRv6: Segment routing with an IPv6 data plane, [RFC8986].

  • TS: Tenant System. A physical or virtual system that can play the role of a host or a forwarding element such as a router, switch, firewall, etc. It belongs to a single tenant and connects to one or more Broadcast Domains of that tenant.

  • VNI: Virtual Network Identifier. Irrespective of the NVO3 encapsulation, the tunnel header always includes a VNI that is added at the ingress NVE (based on the mapping table lookup) and identifies the BT at the egress NVE. This VNI is called VNI in VXLAN or GENEVE, VSID in nvGRE or Label in MPLSoGRE or MPLSoUDP. This document will refer to VNI as a generic Virtual Network Identifier for any NVO3 encapsulation.

  • VTEP: VXLAN Termination End Point. A loopback IP address of the destination NVE that is used in the outer destination IP address of VXLAN packets directed to that NVE.

  • VXLAN: Virtual eXtensible Local Area Network, an NVO3 encapsulation defined in [RFC7348].

1.2. Problem Statement

Figure 1 depicts the typical Clos topology in multi-tenant Data Centers, only simplified to show three Leaf routers and two Spines, forming a 3-stage Clos topology. The NVEs or Leaf routers run EVPN for NVO3 tunnels, as in [RFC8365]. We assume VXLAN is used as the NVO3 tunnel, given that VXLAN is highly prevalent in multi-tenant Data Centers. This diagram is used as a reference throughout this document. Note that in very large scale Data Centers, the number of Tenant Systems, Leaf routers and Spines (in multiple layers) may be significantly higher than what is illustrated in Figure 1.

          +-------+   +-------+
          |Spine-1|   |Spine-2|
          |       |   |       |
          +-------+   +-------+
           |  |  |     |  |  |
       +---+  |  |     |  |  +---+
       |      |  |     |  |      |
       |  +------------+  |      |
       |  |   |  |        |      |
       |  |   |  +------------+  |
       |  |   |           |   |  |
       |  |   +---+  +----+   |  |
   L1  |  |    L2 |  |     L3 |  |
    +-------+   +-------+   +-------+
    | +---+ |   | +---+ |   | +---+ |
    | |BD1| |   | |BD1| |   | |BD1| |
    | +---+ |   | +---+ |   | +---+ |
    +-------+   +-------+   +-------+
       | |         | |          |
       | +---+ +---+ |          |
       |     | |     |          |
       |    +---+    |        +---+
       |    |TS1|    |        |TS3|
       |    +---+    |        +---+
       |    ES-1     |
       +-----+ +-----+
             | |
            +---+
            |TS2|
            +---+
            ES-2

Figure 1: Simplified Clos topology in Data Centers

In the example of Figure 1 the Tenant Systems TS1 and TS2 are multi-homed to Leaf routers L1 and L2, and Ethernet Segments Identifiers ESI-1 and ESI-2 are the representation of TS1 and TS2 Ethernet Segments in the EVPN control plane for the Split Horizon filtering, Designated Forwarder and Aliasing functions [RFC8365].

Taking Tenant Systems TS1 and TS3 as an example, the EVPN all-active multi-homing procedures guarantee that, when TS3 sends unicast traffic to TS1, Leaf L3 does per-flow load balancing towards Leaf routers L1 and L2. As explained in [RFC7432] and [RFC8365] this is possible due to L1 and/or L2 Leaf routers advertising TS1's MAC address in an EVPN MAC/IP Advertisement route that includes ESI-1 in the Ethernet Segment Identifier field. When the route is imported in Leaf L3, TS1's MAC address is programmed with a destination associated to the ESI-1 next hop list. This ESI-1 next hop list is created based on the reception of the EVPN A-D per ES and A-D per EVI routes for ESI-1 received from Leaf routers L1 and L2. Assuming Ethernet Segment ES-1 links are operationally active, Leaf routers L1 and L2 advertise the EVPN A-D per ES/EVI routes for ESI-1 and Leaf L3 adds L1 and L2 to its next hop list for ESI-1. Unicast flows from TS3 to TS1 are therefore load balanced to Leaf routers L1 and L2, and L3's ESI-1 next hop list is what we refer to as the "overlay ECMP-set" for ESI-1 in Leaf L3. In addition, once Leaf L3 selects one of the next hops in the overlay ECMP-set, e.g. L1, Leaf L3 does a route lookup of the L1 address in the Base router route table. The lookup yields a list of two next hops, Spine-1 and Spine-2, which we refer to as the "underlay ECMP-set". Therefore, for a given unicast flow to TS1, Leaf L3 does per flow load balancing at two levels: a next hop in the overlay ECMP-set is selected first, e.g., L1, and then a next hop in the underlay ECMP-set is selected, e.g., Spine-1.

While aliasing [RFC7432] provides an efficient method to load balance unicast traffic to the Leaf routers attached to the same all-active Ethernet Segment, there are some challenges in very large Data Centers where the number of Ethernet Segments and Leaf routers is significant:

  1. Control Plane Scale: In a large Data Center environment, the number of multi-homed compute nodes can grow significantly to the 1000s range, where each compute node requires a unique ES and hosts 10s of EVIs per ES. In the aliasing model defined within [RFC7432], there is a requirement to advertise EVPN A-D per EVI routes for each active EVI on each ethernet segment. The resultant EVPN state that Route Reflectors, Data Center Gateways and a Leaf routers need to process becomes significant and will only grow as the number of Ethernet Segments, Broadcast Domains and Leaf routers are added. Removing the need to advertise the EVPN A-D per EVI routes would therefore offer a considerable advantage to the overall route scale and processing overhead.

  2. Convergence and Processing overhead: In accordance with [RFC8365] each node of an Ethernet Segment acts as an independent VTEP and therefore EVPN next hop. In a typical Data Center leaf-spine topology this results in ECMP being performed in both the underlay ECMP-set and also the overlay ECMP-set. Consequently, convergence at scale during a failure can be slow and CPU intensive as all leaf routers are required to process the overlay state change caused by the EVPN route(s) being withdrawn at the point of failure and update their overlay ECMP-set accordingly. Performing the load-balancing with just the underlay ECMP-set, offers the potential to dramatically reduce this network wide state-churn and processing overhead, while providing faster convergence at scale by limiting the scope of the re-convergence to just the intermediate Spine nodes.

  3. Inefficient underlay forwarding during a failure: a further consequence of ECMP using the overlay ECMP-set is the potential for in-flight packets sent by remote Leaf routers being rerouted in an inefficient way. As an example, suppose the link L1-to-Spine-1 in Figure 1 fails. In-flight VXLAN packets already sent from L3 with destination VTEP equal L1 arrive at Spine-1 and are rerouted via e.g., L2->Spine-2->L1->TS1, while they could go directly via L2->TS1, since TS1 is also connected to Leaf L2. After the underlay routing protocol converges, all VXLAN packets with destination VTEP L1 are correctly sent to Spine-2 and Leaf L3 removes Spine-1 from the underlay ECMP-set for Leaf L1.

There are existing proprietary multi-chassis Link Aggregation Group implementations, collectively and commonly known as MC-LAG, that attempt to work around the above challenges by using the concept of "Anycast VTEPs", or the use of a shared loopback IP address that the Leaf routers attached to the same multi-homed Tenant System can use to terminate VXLAN packets. As an example in Figure 1, if Leaf routers L1 and L2 used an Anycast VTEP address "anycast-IP1" to identify VXLAN packets to multi-homed Tenant Systems:

  • Leaf L3 would not need to create an overlay ECMP-set for packets to TS1 or TS2, since the use of anycast-IP1 in the underlay ECMP-set would gurantee the per-flow load balancing to the two Leaf routers.

  • In the same failure example as above for link L1-to-Spine-1 failure, Spine-1 would reroute VXLAN packets directly to Leaf L2, since L2 also advertises the anycast-IP1 address that is used from Leaf L3 to send packets to TS1 or TS2.

  • In addition, if Leaf routers L1 and L2 used proprietary MC-LAG techniques, no EVPN A-D per EVI routes would be needed, hence the number of EVPN routes would be significantly decreased in a large scale Data Center.

However, the use of proprietary MC-LAG technologies in EVPN NVO3 networks is being abandoned due to the superior functionality of EVPN Multi-Homing, including mass withdraw [RFC7432], advanced Designated Forwarding election [RFC8584] or weighted load balancing [I-D.ietf-bess-evpn-unequal-lb], to name a few features.

1.3. Solution Overview

This document specifies an EVPN Anycast Multi-homing extension that can be used as an alternative to EVPN Aliasing [RFC7432]. The EVPN Anycast Multi-homing procedures described in this document may optionally replace the per-flow overlay ECMP load-balancing with a simplified per-flow underlay ECMP load balancing, in a similar way to how proprietary MC-LAG solutions do it, but in a standard way and keeping the superior advantages of EVPN Multi-Homing, such as the Designated Forwarder Election, Split Horizon filtering or the mass withdraw function (all of them described in [RFC8365] and [RFC7432]). The solution uses the A-D per ES routes to advertise the Anycast VTEP address to be used when sending traffic to the Ethernet Segment and suppresses the use of A-D per EVI routes for the Ethernet Segments configured in this mode. This solution addresses the challenges outlined in Section 1.2.

The solution is valid for all NVO3 tunnels, or even for IP tunnels in general. Sometimes the description uses VXLAN as an example, given that VXLAN is highly prevalent in multi-tenant Data Centers. However, the examples and procedures are valid for any NVO3 or IP tunnel type.

2. BGP EVPN Extensions

This specification makes use of two BGP extensions that are used along with the A-D per ES routes [RFC7432].

The first extension is the flag "A" or "Anycast Multi-homing mode" and it is requested to IANA to be allocated in bit 2 of the EVPN ESI Multihoming Attributes registry for the 1-octect Flags field in the ESI Label Extended Community, as follows:

   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   | Type=0x06     | Sub-Type=0x01 | Flags(1 octet)|  Reserved=0   |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |  Reserved=0   |          ESI Label                            |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

Flags field:

        0 1 2 3 4 5 6 7
       +-+-+-+-+-+-+-+-+
       |SHT|A|     |RED|
       +-+-+-+-+-+-+-+-+

Figure 2: ESI Label Extended Community and Flags

Where the following Flags are defined:

Table 1: Flags Field
Name Meaning Reference
RED Multihomed redundancy mode [I-D.ietf-bess-rfc7432bis]
SHT Split Horizon type [I-D.ietf-bess-evpn-mh-split-horizon]
A Anycast Multi-homing mode This document

When the NVE advertises an A-D per ES route with the A flag set, it indicates the Ethernet Segment is working in Anycast Multi-homing mode. The A flag is set only if the RED = 00 (All-Active redundancy mode), and MUST NOT be set if RED is different from 00.

The second extension that this document introduces is the encoding of the "Anycast VTEP" address in the BGP Tunnel Encapsulation Attribute, Tunnel Egress Endpoint Sub-TLV (code point 6) [RFC9012], that is advertised along with the A-D per ES routes for an Ethernet Segment working in Anycast Multi-homing mode. Refer to [RFC9012] for the error handling procedures related to the BGP Tunnel Encapsulation Attribute. For NVO3 tunnel types (e.g., VXLAN, GENEVE), the ‘Anycast VTEP’ MUST be encoded in the BGP Tunnel Encapsulation Attribute and advertised with the A-D per ES routes. However, when using SRv6 encapsulation, the BGP Tunnel Encapsulation Attribute is not applicable. Refer to Section 6 for details about SRv6.

3. Anycast Multi-Homing Solution

This document proposes an optional "EVPN Anycast Multi-homing" procedure that provides a solution to optimize the behavior in case the challenges described in Section 1.2 become a problem. The description makes use of the terms "Ingress NVE" and "Egress NVE". In this document, Egress NVE refers to an NVE that is attached to a group of Ethernet Segments working in Anycast Multi-homing mode, whereas Ingress NVE refers to the NVE that transmits unicast traffic to a MAC address that is associated to a remote Ethernet Segment that works in Anycast Multi-homing mode. In addition, the concepts of Unicast VTEP and Anycast VTEP are used. A Unicast VTEP is a loopback IP address that is unique in the Data Center fabric and it is owned by a single NVE terminating VXLAN (or NVO3) traffic. An Anycast VTEP is a loopback IP address that is shared among the NVEs attached to the same group of Ethernet Segments in Anycast Multi-homing mode and it is used to terminate VXLAN (or NVO3) traffic on those NVEs. An Anycast VTEP in this document MUST NOT be used as BGP next hop of any EVPN route NLRI. This is due to the need for the Multi-Homing procedures to uniquely identify the originator of the EVPN routes via their NLRI next hops.

The solution consists of the following modifications of the [RFC7432] EVPN Aliasing function:

  1. The [RFC8365] Designated Forwarder and Split Horizon filtering procedures remain unmodified. The Aliasing procedure is modified in this Anycast Multi-homing mode.

  2. The forwarding of BUM traffic and related procedures are not modified by this document. Only the procedures related to the forwarding of unicast traffic to a remote Ethernet Segment are modified.

  3. Any two or more Egress NVEs attached to the same Ethernet Segment working in Anycast Multi-homing mode MUST use the same VNI or label to identify the Broadcast Domain that makes use of the Ethernet Segment. For non-MPLS NVO3 tunnels, using the same VNI is implicit if global VNIs are used ([RFC8365] section 5.1.1). If locally significant values are used for the VNIs, at least all the Egress NVEs sharing Ethernet Segments MUST use the same VNI for the Broadcast Domain. For MPLS NVO3 tunnels, the Egress NVEs sharing Anycast Multi-homing Ethernet Segments MUST use Domain-wide Common Block labels [RFC9573] so that all can be configured with the same unicast label for the same Broadcast Domain. Note that this rule only affects unicast labels or the labels advertised with the EVPN MAC/IP Advertisement routes and not the Ingress Replication labels for BUM traffic advertised in the EVPN Inclusive Multicast Ethernet Tag routes.

  4. The default behavior for an Egress NVE attached to an Ethernet Segment follows [RFC8365]. The Anycast Multi-homing mode MUST be explicitly configured for a given all-active Ethernet Segment. When the Egress NVE Ethernet Segment is configured to follow the Anycast Multi-homing behavior for at least one Ethernet Segment, the egress NVE:

    1. Is configured with an Anycast VTEP address. A single Anycast VTEP address is allocated for all the Anycast Aliasing Ethernet Segments shared among the same group of Egress NVEs. That is the only additional address for which reachability needs to be announced in the underlay routing protocol. If "m" Egress NVEs are attached to the same "n" Ethernet Segments, all the "m" Egress NVEs advertise the same Anycast VTEP address in the A-D per ES routes for the "n" Ethernet Segments.

    2. Is assumed to advertise reachability for the Anycast VTEP in the underlay routing protocol, via an advertisement of an exact match route for the Anycast VTEP (mask /32 for IPv4 and /128 for IPv6) or a prefix of shorter length that covers the Anycast VTEP IP address.

    3. Advertises EVPN A-D per ES routes for each Ethernet Segment with:

      • an "Anycast Multi-homing mode" flag that indicates to the remote NVEs that the EVPN MAC/IP Advertisement routes with matching Ethernet Segment Identifier are resolved by only A-D per ES routes for the Ethernet Segment. In other words, this flag indicates to the ingress NVE that no A-D per EVI routes are advertised for the Ethernet Segment.

      • an Anycast VTEP that identifies the Ethernet Segment and is encoded in a BGP tunnel encapsulation attribute [RFC9012] attached to the route.

    4. Does not modify the procedures for the EVPN MAC/IP Advertisement routes.

    5. Suppresses the advertisement of the A-D per EVI routes for the Ethernet Segment configured in Anycast Multi-homing mode.

    6. In case of a failure on the Ethernet Segment link, the Egress NVE withdraws the A-D per ES route(s), as well as the ES route for the Ethernet Segment. The Egress NVE cannot withdraw the Anycast VTEP address from the underlay routing protocol as long as there is at least one Ethernet Segment left that makes use of the Anycast VTEP. Only in case of a failure on the entire Egress NVE (or all the Ethernet Segments sharing the Anycast VTEP) will the Anycast VTEP be withdrawn from the Egress NVE.

    7. Unicast traffic for a failed local Ethernet Segment may still be attracted by the Egress NVE, given that the Anycast VTEP address is still advertised in the underlay routing protocol. In this case, the Egress NVE SHOULD support the procedures in Section 4 so that unicast traffic can be rerouted to another Egress NVE attached to the Ethernet Segment.

  5. The Ingress NVE that supports this document:

    1. Follows the regular [RFC8365] Aliasing procedures for the Ethernet Segments of the received in A-D per ES routes without the Anycast Multi-homing mode Flag set.

    2. Identifies the imported EVPN A-D per ES routes with the Anycast Multi-homing flag set and process them for Anycast Multi-homing.

    3. Upon receiving and importing (on a Broadcast Domain) an EVPN MAC/IP Advertisement route for MAC-1 with a non-zero Ethernet Segment Identifier ESI-1, the NVE looks for an A-D per ES route with the same Ethernet Segment Identifier ESI-1 imported in the same Broadcast Domain. If there is at least one A-D per ES route for ESI-1, the NVE checks if the Anycast Multi-homing flag is set. If not, the ingress NVE follows the procedures in [RFC8365]. If the Anycast Multi-homing flag is set, the ingress NVE programs MAC-1 associated to destination ESI-1. The ESI-1 destination is resolved to the Ethernet Segment Anycast VTEP that is extracted from the A-D per ES routes, and the VNI, e.g, VNI-1, that was received in the MAC/IP Advertisement route.

    4. When the Ingress NVE receives a frame with destination MAC address MAC-1 on any of the Attachment Circuits of the Broadcast Domain, the destination MAC lookup yields ESI-1 as destination. The frame is then encapsulated into a VXLAN (or NVO3) packet where the destination VTEP is the Anycast VTEP and the VNI is VNI-1. Since all the Egress NVEs attached to the Ethernet Segment previously announced reachability to the Anycast VTEP, the ingress NVE has an underlay ECMP-set created for the Anycast VTEP (assuming multiple underlay paths with equal cost) and per flow load balancing is accomplished.

    5. The Ingress NVE MUST NOT use an Anycast VTEP as the outer source IP address of the VXLAN (or NVO3) tunnel, unless the Ingress NVE is also an Egress NVE that re-encapsulates the traffic into a tunnel for the purpose of Fast Reroute (Section 4).

    6. The reception of one or more MP_UNREACH_NLRI messages for the A-D per ES routes for Ethernet Segment Identifier ESI-1 does not change the programming of the MAC addresses associated to ESI-1 as long as there is at least one valid A-D per ES route for ESI-1 in the Bridge Domain. The reception of the MP_UNREACH_NLRI message for the last A-D per ES route for ESI-1 triggers the mass withdraw procedures for all MACs pointing at ESI-1. As an OPTIONAL optimization, if an ingress node receives an MP_UNREACH_NLRI message for the A-D per ES route from one of the NVEs on the Ethernet Segment, and only one NVE remains active on that Ethernet Segment, the ingress node may update the Ethernet Segment destination resolution from the Anycast VTEP to the Unicast VTEP, derived from the next hop of the MAC/IP Advertisement route.

  6. The procedures on the Ingress NVE for Anycast Multi-homing assume that all the Egress NVEs attached to the same Ethernet Segment advertise the same Anycast Multi-homing flag value and Anycast VTEP in their A-D per ES routes for the Ethernet Segment. Inconsistency in any of those two received values makes the Ingress NVE fall back to the [RFC8365] behavior, which means that the MAC address will be programmed with the Unicast VTEP derived from the MAC/IP Advertisement route next hop.

Non-upgraded NVEs ignore the Anycast Multi-homing flag value and the BGP tunnel encapsulation attribute.

3.1. Anycast Multi-Homing Example

Consider the example of Figure 1 where three Leaf routers run EVPN over VXLAN tunnels. Suppose Leaf routers L1, L2 and L3 support Anycast Multi-homing as per Section 3 and Ethernet Segments ES-1 and ES-2 are configured as an Anycast Ethernet Segments, all-active mode, with Anycast VTEP IP12. Leaf routers L1 and L2 both advertise an A-D per ES route for ESI-1, and an A-D per ES route for ESI-2 (in addition to the ES routes). Both routes will carry the Anycast Aliasing flag set and the same Anycast VTEP IP12. Upon receiving MAC/IP Advertisement routes for the two Ethernet Segments, Leaf L3 programs the MAC addresses associated to their corresponding destination Ethernet Segment. Therefore, when sending unicast packets to Tenant Systems TS1 or TS2, L3 uses the Anycast VTEP address as outer IP address. All the A-D per EVI routes for ES-1 and ES-2 are suppressed.

Suppose only Leaf L1 learns TS1 MAC address, hence only L1 advertises a MAC/IP Advertisement route for TS1 MAC with ESI-1. In that case:

  • Leaf L3 has Anycast VTEP IP12 programmed in its route table against an underlay ECMP-set composed of Spine-1 and Spine-2. Tenant System TS1 MAC address is programmed with a destination ESI-1, which is resolved to Anycast VTEP IP12.

  • When Tenant System TS3 sends unicast traffic to Tenant System TS1, Leaf L3 encapsulates the frames into VXLAN packets with destination VTEP being the Anycast VTEP IP12. Leaf L3 can perform per-flow load balancing just by using the underlay ECMP-set, and without the need to create an overlay ECMP-set.

  • Spine-1 and Spine-2 also create underlay ECMP-sets for Anycast VTEP IP12 with next hops L1 and L2. Therefore, in case of:

    • A failure on the link L1-to-Spine-1, Spine-1 immediately removes L1 from the ECMP-set for IP12 and packets are rerouted faster than in the case regular Aliasing is used.

    • Suppose now that the link TS1-L1 fails. Leaf L1 then sends an MP_UNREACH_NLRI for the A-D per ES route for ESI-1. Upon reception of the message, Leaf L3 does not change the resolution of the ESI-1 destination, since the A-D per ES route for ESI-1 from L2 is still active. Packets sent to TS1 but arriving at Leaf L1 are "fast-rerouted" to Leaf L2 as per Section 4. As per Section 3, point 5f, Leaf L3 can optionally be configured to change the resolution of the ESI-1 destination to the unicast VTEP (given by the MAC/IP Advertisement route) upon receiving an MP_UNREACH_NLRI for the A-D per ES route from L1. Still, in-flight packets with destination TS1 arriving at Leaf L1 are "fast-rerouted" to Leaf L2.

4. EVPN Fast Reroute Extensions For Anycast Multi-Homing

The procedures in Section 3 may lead to some situations in which traffic destined to an Anycast VTEP for an Ethernet Segment arrives at an Egress NVE where the Ethernet Segment link is in a failed state. In that case, the Egress NVE SHOULD re-encapsulate the traffic into a NVO3 tunnel following the procedures described in [I-D.burdet-bess-evpn-fast-reroute], section 7.1, with the following modifications:

  1. The Egress NVEs in this document do not advertise A-D per EVI routes, therefore there is no signaling of specific redirect labels or VNIs. The Egress NVE uses the global VNI or Domain-wide Common Block label of the Ethernet Segment NVEs when re-encapsulates the traffic into an NVO3 tunnel (Section 3, point 3).

  2. In addition, when rerouting traffic, the Egress NVE uses the Anycast VTEP of the Ethernet Segment as outer source IP address of the NVO3 tunnel. Note this is the only case in this document where the use of the Anycast VTEP as source IP address is allowed. When an Egress NVE receives NVO3-encapsulated packets where the source VTEP matches a local Anycast VTEP, there are two implicit behaviors on the Egress NVE:

    1. The packets pass the Local Bias Split Horizon filtering check (which is based on the Unicast VTEP of the Ethernet Segment peers, and not the Anycast VTEP).

    2. Receiving NVO3-encapsulated packets with a local Anycast VTEP is an indication for the NVE that those packets have been "fast-rerouted", hence they MUST not be forwarded to another tunnel.

5. Applicability of Anycast Multi-Homing to IP Aliasing

The procedures described in this document are applicable also to IP Aliasing use cases in [I-D.ietf-bess-evpn-ip-aliasing]. Details will be added in future versions of this document.

6. Applicability of Anycast Multi-Homing to SRv6 tunnels

To be added.

7. Operational Considerations

"Underlay convergence", or network convergence processed by the underlay routing protocol in case of a failure, is normally considered to be faster than "overlay convergence" (or network convergence processed by EVPN in case of failures). The use of Anycast Multi-homing is extremely valuable in cases where the operator wants to optimize the convergence, since a node failure on an Ethernet Segment Egress NVE simply means that the underlay routing protocol reroutes the traffic to another Egress NVE that uses the same Anycast VTEP. This underlay rerouting to a different owner of the Anycast VTEP is extremely fast and efficient, especially when used in Data Center designs that make use of BGP in the underlay and the Autonomous System allocation recommended in [RFC7938] for loop protection. To illustrate this statement, suppose a link failure on the link L1-Spine-1 Figure 1, while Spine-1 and Spine-2 are assigned the same Autonomous System Number for their underlay BGP peering sessions, and no "Allowas-in" is configured [RFC7938]. If packets with destination Anycast VTEP IP12 are received on Spine-1, and the link L1-Spine-1 fails, the packets are immediately rerouted to L2. In the same example, if unicast VTEPs are used (as in regular all-active Ethernet Segments) and in-flight packets with destination unicast VTEP L1 get to Spine-1, packets would be dropped if link L1-Spine-1 is not available. This translates into a much faster convergence in the case of Anycast Multi-homing.

Another benefit of Anycast Multi-homing is the reduction of EVPN control plane pressure (due to the suppression of the A-D per EVI routes).

However, an operator must take into account the following operational considerations before deploying this solution:

  1. Troubleshooting Anycast Multi-homing Ethernet Segments is different from troubleshooting regular all-active Ethernet Segments. Operators use an A-D per EVI route withdrawal as an indication that the Ethernet Segment has failed in a particular Broadcast Domain associated with that A-D per EVI route. The suppression of the A-D per EVI routes for the Anycast Multi-homing Ethernet Segment means that logical failures on a subset of Broadcast Domains of the Ethernet Segment (while other Broadcast Domains are still operational) are more challenging to detect.

  2. Anycast Multi-homing Ethernet Segments MUST NOT be used in in the following cases:

    1. If the Ethernet Segment multi-homing redundancy mode is different from All-Active mode.

    2. If the Ethernet Segment is used on EVPN VPWS Attachment Circuits [RFC8214].

    3. If the Attachment Circuit Influenced Designated Forwarded capability is needed in the Ethernet Segment [RFC8584].

    4. If advanced multi-homing features that make use of the signaling in EVPN A-D per EVI routes are needed. An example would be per EVI mass withdraw.

    5. If unequal load balancing is needed [I-D.ietf-bess-evpn-unequal-lb].

    6. If the tunnels used by EVPN in the Broadcast Domains that use the Ethernet Segment are not IP tunnels, i.e., not NVO3 tunnels.

    7. If the NVEs attached to the Ethernet Segment do not use the same VNI or label to identify the same Broadcast Domain.

  3. Using the procedure in Section 3 may mean that packets are permanently fast rerouted in case of a link failure. To illustrate this, suppose three Egress NVEs attached to ES-1: L1, L2 and L3. In this case, a failure on ES-1 on L1 does not prevent the network from sending packets to L1 with destination the Anycast VTEP. Upon receiving those packets, L1 re-encapsulates the packets and sends them to e.g., L2. This rerouting persists as long as ES-1 on L1 is in failed state. In these cases, the operator may consider direct inter node links on the egress NVEs to optimize the fast rerouting forwarding. That is, in the previous example, packets are more efficiently rerouted if L1, L2 and L3 are directly connected.

8. Security Considerations

To be added.

9. IANA Considerations

IANA is requested to allocate the flag "A" or "Anycast Multi-homing mode" in bit 2 of the EVPN ESI Multihoming Attributes registry for the 1-octect Flags field in the ESI Label Extended Community.

10. Contributors

In addition to the authors listed on the front page, the following co-authors have also contributed to previous versions of this document:

Nick Morris, Verizon

[email protected]

11. Acknowledgments

12. Annex - Potential Multi Ethernet Segment Anycast Multi-Homing optimizations

This section is here for documentation purposes only, and it will be removed from the document before publication. While these procedures were initially included in the document, they introduce additional complexity and are therefore excluded, as they undermine the primary goal of using anycast VTEPs, which is to simplify EVPN operations. However, the section is included as an annex for completeness.

As described in Section 7, the use of Anycast Multi-Homing may mean that packets are permanently fast rerouted in case of a link failure. Some potential additional extensions on the Ingress NVE may mitigate the permanent "fast rerouting", as follows:

  1. On the Ingress NVEs, an "anycast-aliasing-threshold" and a "collect-timer" can be configured. The "anycast-aliasing-threshold" represents the number of active Egress NVEs per Ethernet Segment under which the ingress PE no longer uses the Anycast VTEP address to resolve the Ethernet Segment destination (and uses the Unicast VTEP instead, derived from the MAC/IP Advertisement route next hop). The "collect-timer" is triggered upon the creation of the Ethernet Segment destination, and it is needed to settle on the number of Egress NVEs for the Ethernet Segment against which the "anycast-aliasing-threshold" is compared.

  2. Upon expiration of the "collect-timer", the Ingress NVE computes the number of Egress NVEs for the Ethernet Segment based on the next hop count of the received A-D per ES routes. If the number of Egress NVEs for the Ethernet Segment is greater than or equal to the "anycast-aliasing-threshold" integer, the Ethernet Destination is resolved to the Anycast VTEP address. If lower than the threshold, the Ethernet Destination is resolved to the unicast VTEP address.

In most of the use cases in multi-tenant Data Centers, there are two Leaf routers per rack that share all the Ethernet Segments of Tenant Systems in the rack. In this case, the "anycast-aliasing-threshold" is set to 2 and in case of link failure on the Ethernet Segment, this limits the amount of "fast-rerouted" traffic to only the in-flight packets.

As an example, consider Figure 1. Suppose Leaf router L3 supports these additional extensions. Leaf routers L1 and L2 both advertise an A-D per ES route for ESI-1, and an A-D per ES route for ESI-2. Both routes will carry the Anycast Multi-homing flag set and the same Anycast VTEP IP12. Following the described procedure, Leaf L3 is configured with anycast-aliasing-threshold = 2 and collect-timer = t. Upon receiving MAC/IP Advertisement routes for the two Ethernet Segments and the expiration of "t" seconds, Leaf L3 determines that the number of NVEs for ESI-1 and ESI-2 is equal to the threshold. Therefore, when sending unicast packets to Tenant Systems TS1 or TS2, L3 uses the Anycast VTEP address as outer IP address. Suppose now that the link TS1-L1 fails. Leaf L1 then sends an MP_UNREACH_NLRI for the A-D per ES route for ESI-1. Upon reception of the message, Leaf L3 changes the resolution of the ESI-1 destination from the Anycast VTEP to the Unicast VTEP derived from the MAC/IP Advertisement route next hop. Packets sent to Tenant System TS2 (on ES-2) still use the Anycast VTEP. In-flight packets sent to TS1 but still arriving at Leaf L1 are "fast-rerouted" to Leaf L2 as per Section 4.

Another potential optimization is to use different Anycast VTEPs per ES. The proposal in Section 3 uses a shared VTEP for all the Ethernet Segments in a common Egress NVE group. In case the number of Egress NVEs sharing the group of Ethernet Segments is limited to two, an alternative proposal is to use a different Anycast VTEP per Ethernet Segment, however allocate all those Anycast VTEP addresses from the same subnet. A single IP Prefix for such subnet is announced in the underlay routing protocol by the Egress NVEs. The benefit of this proposal is that, in case of link failure in one individual Ethernet Segment, e.g., link TS1-L1 in Figure 1, Leaf L2 detects the failure (based on the withdraw of the A-D per ES and ES routes) and can immediately announce the specific Anycast VTEP address (/32 or /128) into the underlay. Based on a Longest Prefix Match when routing NVO3 packets, Spines can immediately reroute packets (with destination the Anycast VTEP for ESI-1) to Leaf L2. This may reduce the amount of fast-rerouted VXLAN packets and spares the Ingress NVE from having to change the resolution of the Ethernet Segment destination from the Anycast VTEP to the Unicast VTEP.

13. References

13.1. Normative References

[RFC2119]
Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, , <https://www.rfc-editor.org/info/rfc2119>.
[RFC8174]
Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, , <https://www.rfc-editor.org/info/rfc8174>.
[RFC7432]
Sajassi, A., Ed., Aggarwal, R., Bitar, N., Isaac, A., Uttaro, J., Drake, J., and W. Henderickx, "BGP MPLS-Based Ethernet VPN", RFC 7432, DOI 10.17487/RFC7432, , <https://www.rfc-editor.org/info/rfc7432>.
[RFC8365]
Sajassi, A., Ed., Drake, J., Ed., Bitar, N., Shekhar, R., Uttaro, J., and W. Henderickx, "A Network Virtualization Overlay Solution Using Ethernet VPN (EVPN)", RFC 8365, DOI 10.17487/RFC8365, , <https://www.rfc-editor.org/info/rfc8365>.
[I-D.ietf-bess-rfc7432bis]
Sajassi, A., Burdet, L. A., Drake, J., and J. Rabadan, "BGP MPLS-Based Ethernet VPN", Work in Progress, Internet-Draft, draft-ietf-bess-rfc7432bis-10, , <https://datatracker.ietf.org/doc/html/draft-ietf-bess-rfc7432bis-10>.
[RFC9573]
Zhang, Z., Rosen, E., Lin, W., Li, Z., and IJ. Wijnands, "MVPN/EVPN Tunnel Aggregation with Common Labels", RFC 9573, DOI 10.17487/RFC9573, , <https://www.rfc-editor.org/info/rfc9573>.
[RFC8584]
Rabadan, J., Ed., Mohanty, S., Ed., Sajassi, A., Drake, J., Nagaraj, K., and S. Sathappan, "Framework for Ethernet VPN Designated Forwarder Election Extensibility", RFC 8584, DOI 10.17487/RFC8584, , <https://www.rfc-editor.org/info/rfc8584>.
[RFC9012]
Patel, K., Van de Velde, G., Sangli, S., and J. Scudder, "The BGP Tunnel Encapsulation Attribute", RFC 9012, DOI 10.17487/RFC9012, , <https://www.rfc-editor.org/info/rfc9012>.

13.2. Informative References

[RFC7348]
Mahalingam, M., Dutt, D., Duda, K., Agarwal, P., Kreeger, L., Sridhar, T., Bursell, M., and C. Wright, "Virtual eXtensible Local Area Network (VXLAN): A Framework for Overlaying Virtualized Layer 2 Networks over Layer 3 Networks", RFC 7348, DOI 10.17487/RFC7348, , <https://www.rfc-editor.org/info/rfc7348>.
[RFC8926]
Gross, J., Ed., Ganga, I., Ed., and T. Sridhar, Ed., "Geneve: Generic Network Virtualization Encapsulation", RFC 8926, DOI 10.17487/RFC8926, , <https://www.rfc-editor.org/info/rfc8926>.
[RFC4364]
Rosen, E. and Y. Rekhter, "BGP/MPLS IP Virtual Private Networks (VPNs)", RFC 4364, DOI 10.17487/RFC4364, , <https://www.rfc-editor.org/info/rfc4364>.
[RFC7510]
Xu, X., Sheth, N., Yong, L., Callon, R., and D. Black, "Encapsulating MPLS in UDP", RFC 7510, DOI 10.17487/RFC7510, , <https://www.rfc-editor.org/info/rfc7510>.
[RFC8986]
Filsfils, C., Ed., Camarillo, P., Ed., Leddy, J., Voyer, D., Matsushima, S., and Z. Li, "Segment Routing over IPv6 (SRv6) Network Programming", RFC 8986, DOI 10.17487/RFC8986, , <https://www.rfc-editor.org/info/rfc8986>.
[RFC8214]
Boutros, S., Sajassi, A., Salam, S., Drake, J., and J. Rabadan, "Virtual Private Wire Service Support in Ethernet VPN", RFC 8214, DOI 10.17487/RFC8214, , <https://www.rfc-editor.org/info/rfc8214>.
[RFC7938]
Lapukhov, P., Premji, A., and J. Mitchell, Ed., "Use of BGP for Routing in Large-Scale Data Centers", RFC 7938, DOI 10.17487/RFC7938, , <https://www.rfc-editor.org/info/rfc7938>.
[RFC9469]
Rabadan, J., Ed., Bocci, M., Boutros, S., and A. Sajassi, "Applicability of Ethernet Virtual Private Network (EVPN) to Network Virtualization over Layer 3 (NVO3) Networks", RFC 9469, DOI 10.17487/RFC9469, , <https://www.rfc-editor.org/info/rfc9469>.
[I-D.ietf-bess-evpn-ip-aliasing]
Sajassi, A., Rabadan, J., Pasupula, S., Krattiger, L., and J. Drake, "EVPN Support for L3 Fast Convergence and Aliasing/Backup Path", Work in Progress, Internet-Draft, draft-ietf-bess-evpn-ip-aliasing-02, , <https://datatracker.ietf.org/doc/html/draft-ietf-bess-evpn-ip-aliasing-02>.
[I-D.ietf-bess-evpn-unequal-lb]
Malhotra, N., Sajassi, A., Rabadan, J., Drake, J., Lingala, A. R., and S. Thoria, "Weighted Multi-Path Procedures for EVPN Multi-Homing", Work in Progress, Internet-Draft, draft-ietf-bess-evpn-unequal-lb-24, , <https://datatracker.ietf.org/doc/html/draft-ietf-bess-evpn-unequal-lb-24>.
[I-D.burdet-bess-evpn-fast-reroute]
Burdet, L. A., Brissette, P., Miyasaka, T., and J. Rabadan, "EVPN Fast Reroute", Work in Progress, Internet-Draft, draft-burdet-bess-evpn-fast-reroute-08, , <https://datatracker.ietf.org/doc/html/draft-burdet-bess-evpn-fast-reroute-08>.
[I-D.ietf-bess-evpn-mh-split-horizon]
Rabadan, J., Nagaraj, K., Lin, W., and A. Sajassi, "BGP EVPN Multi-Homing Extensions for Split Horizon Filtering", Work in Progress, Internet-Draft, draft-ietf-bess-evpn-mh-split-horizon-11, , <https://datatracker.ietf.org/doc/html/draft-ietf-bess-evpn-mh-split-horizon-11>.
[CLOS1953]
Clos, C., "A Study of Non-Blocking Switching Networks", .

Authors' Addresses

Jorge Rabadan (editor)
Nokia
520 Almanor Avenue
Sunnyvale, CA 94085
United States of America
Kiran Nagaraj
Nokia
520 Almanor Avenue
Sunnyvale, CA 94085
United States of America
Alex Nichol
Arista
Ali Sajassi
Cisco Systems
Wen Lin
Juniper Networks
Jeff Tantsura
Nvidia