Internet-Draft | Cloud Resource Abstraction | October 2024 |
Dunbar, et al. | Expires 17 April 2025 | [Page] |
This document proposes extensions to existing YANG models, as well as new YANG models, to enable the management of cross-domain cloud and network resources. The intent is to provide dynamic resource allocation mechanisms that allow services to scale efficiently across multiple cloud environments and edge computing platforms. By defining unified YANG models for both network and cloud domains, this draft addresses challenges in orchestrating and managing resources in a hybrid environment while maintaining interoperability and dynamic scaling.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 17 April 2025.¶
Copyright (c) 2024 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
Cloud and edge computing environments are increasingly interconnected with network infrastructure, and modern services require dynamic, cross-domain orchestration to scale efficiently. Services placed in Cloud Data Centers (DC) are changing dynamically, often undergoing high-frequency modifications based on evolving service requirements. As a result, the network connecting these services must dynamically adapt and reconfigure itself in real-time to accommodate the services changes.¶
A set of network-related problems that enterprises face when interconnecting their branch offices with dynamic workloads in third-party data centers (Cloud DCs) is described in [Net2Cloud], which outlines various issues, including the challenges of ensuring reliable, scalable, and efficient network connectivity between enterprise sites and cloud-hosted services. While mitigation practices have been referenced by [Net2Cloud], they fall short of addressing the dynamic and rapidly changing nature of services placed in Cloud DC. More advanced solutions are needed to make the network serve these dynamic services effectively, ensuring that the network can adjust in real-time to the changes in service workloads, resource allocations, bandwidth requirements, and latency constraints driven by cloud-hosted services.¶
This draft extends existing YANG models or introduces new ones to enable the management of both cloud and network resources in a unified, cross-domain manner. The goal is to optimize dynamic resource allocation, allowing services to scale seamlessly across public clouds, private clouds, and edge computing nodes while ensuring consistency, interoperability, and real time adaptability of the network to the dynamically changing services placed in Cloud DC.¶
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.¶
Current management models face several limitations:¶
- Siloed Resource Management: Most current models treat network and cloud resources as separate entities, making cross-domain management inefficient.¶
- Lack of Dynamic Scaling Support: Many models lack the mechanisms needed to dynamically allocate and reallocate resources across domains based on real-time service demands.¶
- Inconsistent Interfaces and Data Models: Inconsistent data models across cloud and network platforms hinder seamless integration.¶
Limited Support for Edge Environments: Traditional models focus on cloud and core network infrastructure, often overlooking edge computing platforms where latency-sensitive workloads run.¶
This draft proposes a solution by extending YANG models to facilitate cross-domain resource management and efficient scaling.¶
Several existing IETF YANG models, such as ietf-routing-mgnt [RFC8349], ietf-network-instance [RFC8529], and ietf-l3vpn-svc [RFC8299], offer foundational models for network resource management. However, these models need to be extended to include cloud-specific attributes and edge-related extensions.¶
The primary design objectives for the extended or new YANG models include:¶
- Cross-Domain Resource Orchestrator: Provides the high level orchestration and policies for managing resources across domains, invoking network reconfiguration actions as needed.¶
- Dynamic Resource Allocation: Handles the overall allocation of resources (compute, storage, network). For 5G network and beyond, Dynamic Resource Allocation can be used to allocating network resources based on the needs of federated learning process.¶
- Dynamic Network Reconfiguration: To reflect the network dynamic adaptation to cloud services, focusing on real-time network reconfiguration based on cloud workload needs. Extend support for multi-cloud VPNs, multi-segment SD-WAN [MULTI-SEG-SDWAN], and service overlays.¶
- Edge Node Resource: edge nodes refer to computing resources placed at the edge of the network, closer to the end-user or data source, to reduce latency and improve performance for time-sensitive or high-bandwidth applications. Edge nodes can be located Telcom Provider's Edge Data Centers, such as Edge DCs for 5G or Regional Micro DC. Extend models to manage compute and storage resources on edge platforms.¶
How they work together:¶
- High Level Orchestration (Cross-Domain Resource Orchestrator): The orchestrator manages the overall allocation of cloud and network resources based on policies and telemetry.¶
- Resource Requests (Resource Allocation): When the orchestrator detects a need for resource changes (e.g., increased compute or bandwidth), it triggers resource requests. Network resource allocation will adapt based on these requests.¶
- Real Time Adjustments (Dynamic Network Reconfiguration): As resource demands change (due to dynamic cloud services), the network reconfigures in real time. This includes adjusting bandwidth, latency, or other parameters to ensure that the network supports the new service requirements effectively.¶
- Edge Node Integration (Edge Node Resource): The network reconfiguration model can dynamically adjust the network to ensure optimal connectivity between edge nodes and cloud services, allowing latency-sensitive or bandwidth-intensive applications to operate efficiently.¶
Together, these models provide a comprehensive framework for orchestrating, allocating, and dynamically adjusting network and compute resources across cloud, edge, and network domains. The Dynamic Network Reconfiguration model enhances this by ensuring that the network component reacts in real-time to the dynamic nature of cloud services.¶
Here is an examplary strcture of YANG model for a Cross-Domain Resource Orchestrator. This model enables the orchestration of cloud and network resources, allowing efficient dynamic resource allocation and scaling across multiple cloud and network domains.¶
module: cross-domain-orchestrator +--rw orchestrator +--rw policies | +--rw policy* [policy-id] | +--rw policy-id string | +--rw policy-name string | +--rw policy-type enumeration | +--rw status enumeration | +--rw conditions | +--rw cpu-utilization-threshold uint8 | +--rw memory-utilization-threshold uint8 | +--rw latency-threshold uint32 | +--rw bandwidth-threshold uint32 +--rw telemetry | +--rw domain* [domain-id] | +--rw domain-id string | +--rw domain-type enumeration | +--rw resources | | +--rw cpu decimal64 | | +--rw memory uint64 | | +--rw storage uint64 | | +--rw bandwidth uint64 | +--rw utilization | +--rw cpu-utilization decimal64 | +--rw memory-utilization decimal64 | +--rw storage-utilization decimal64 | +--rw bandwidth-utilization decimal64 | +--rw latency uint32 +--rpc allocate-resources +--input | +--rw service-id string | +--rw resource-type enumeration | +--rw amount decimal64 | +--rw domain-id string +--output +--ro allocation-status enumeration +--ro allocated-amount decimal64¶
Explanation of the structure¶
- orchestrator:¶
The top-level container for managing the orchestration of resources across cloud and network domains.¶
- policies:¶
Defines the set of policies that govern resource allocation..¶
Each policy has: policy-id, policy-name, polity-type (e.g., the purpose of the policy), conditions (e.g., the thresholds (CPU, memory, latency, etc.) that trigger the policy).¶
- telemetry¶
Collects real-time telemetry data from different domains (e.g., cloud, edge, network).¶
Each domain contains information about the resources (CPU, memory, storage, bandwidth) and their utilization metrics (percentage of usage, current latency)¶
- Action: allocate-resources (as an RPC or YANG action):¶
This defines the action that a service or orchestrator can call to request dynamic allocation of resources in real-time.¶
Example for using the Action:¶
- A cloud-hosted service detects a spike in user traffic and requests an additional 50 Mbps of network bandwidth. The service submits an allocate-resources request¶
- The orchestration system processes the request based on the current telemetry data (bandwidth utilization, network latency) and any active policies (scaling, SLA compliance, etc.). It checks if the additional bandwidth is available in the requested domain.¶
- If the resources are available, the system returns success. If not, it returns failure.¶
- If successful, it shows how much bandwidth (e.g., 50 Mbps) was allocated to the service.¶
The resource needs for federated learning fluctuate depending on the phase of the training process, model complexity, and number of devices involved. Dynamic Resource Allocation for Federated Learning is a specific type or use case of a Cross-Domain Orchestrator.¶
module: dynamic-resource-allocation-federated-learning +--rw dynamic-allocation +--rw federated-learning | +--rw training-job* [job-id] | +--rw job-id string | +--rw model-type string | +--rw device-type enumeration | +--rw required-cpu decimal64 | +--rw required-memory uint64 | +--rw required-storage uint64 | +--rw required-bandwidth uint64 | +--rw latency-tolerance uint32 +--rw policies | +--rw policy* [policy-id] | +--rw policy-id string | +--rw policy-name string | +--rw policy-type enumeration | +--rw conditions | +--rw cpu-utilization-threshold uint8 | +--rw memory-utilization-threshold uint8 | +--rw bandwidth-utilization-threshold uint8 | +--rw latency-threshold uint32 +--rw telemetry | +--rw domain* [domain-id] | +--rw domain-id string | +--rw domain-type enumeration | +--rw resources | | +--rw cpu decimal64 | | +--rw memory uint64 | | +--rw storage uint64 | | +--rw bandwidth uint64 | +--rw utilization | +--rw cpu-utilization decimal64 | +--rw memory-utilization decimal64 | +--rw storage-utilization decimal64 | +--rw bandwidth-utilization decimal64 | +--rw latency uint32 +--rpc allocate-resources +--input | +--rw job-id string | +--rw resource-type enumeration | +--rw amount decimal64 | +--rw domain-id string +--output +--ro allocation-status enumeration +--ro allocated-amount decimal64¶
This section describe a YANG structure for Dynamic Network Reconfiguration, which supports the scenario where services placed in Cloud Data Centers (DCs) undergo frequent changes, requiring the network to dynamically adapt and reconfigure itself in real time. This structure enables the dynamic adjustment of network parameters (such as bandwidth, latency, QoS, and paths) based on evolving service requirements.¶
module: dynamic-network-reconfiguration +--rw network-reconfiguration +--rw telemetry | +--rw bandwidth-utilization decimal64 | +--rw latency uint32 | +--rw packet-loss-rate decimal64 | +--rw jitter decimal64 | +--rw qos-level string +--rw policies | +--rw policy* [policy-id] | +--rw policy-id string | +--rw policy-name string | +--rw policy-type enumeration | +--rw conditions | +--rw bandwidth-utilization-threshold uint8 | +--rw latency-threshold uint32 | +--rw packet-loss-threshold decimal64 | +--rw qos-threshold string +--rpc reconfigure-network +--input | +--rw service-id string | +--rw target-latency uint32 | +--rw target-bandwidth uint64 | +--rw target-qos string | +--rw target-packet-loss decimal64 | +--rw target-jitter decimal64 +--output +--ro reconfiguration-status enumeration +--ro achieved-latency uint32 +--ro achieved-bandwidth uint64 +--ro achieved-qos string +--ro achieved-packet-loss decimal64 +--ro achieved-jitter decimal64¶
Explanation of the structure:¶
The telemetry container collects real-time data about the current state of the network, which is used to determine whether network reconfiguration is needed to accommodate changes in cloud services.¶
Policies govern how and when the network should be dynamically reconfigured. Each policy has specific conditions that, when met, trigger network reconfiguration.¶
This action (or RPC) is the primary mechanism for dynamically reconfiguring the network in real-time. When triggered, it adjusts the network settings to meet the new requirements of services running in cloud data centers.¶
How it works together:¶
The system continuously monitors network conditions (bandwidth usage, latency, packet loss, jitter) using telemetry data. As services in cloud data centers evolve, this data helps determine whether the network is performing within acceptable limits.¶
When telemetry data indicates that certain thresholds are being breached (e.g., high latency or packet loss), policies are triggered. For example, if bandwidth usage exceeds 80%, the system may allocate more bandwidth to ensure the services continue to operate smoothly.¶
The reconfigure-network action is called in real-time to adjust the network parameters, including bandwidth, latency, packet loss, and QoS, to accommodate changes in cloud services. This action ensures the network can keep up with the frequent modifications to services hosted in the cloud.¶
Below is the YANG tree structure designed to enable resource allocation close to the end-user or device, specifically optimized for latency-sensitive workloads. It includes support for Mobile Edge Computing (MEC) and integration with 5G edge computing. The structure allows for dynamic allocation of compute, storage, and network resources, with real-time adjustments based on the needs of low-latency applications like IoT, AR/VR, and real-time analytics.¶
module: mec-5g-resource-allocation +--rw edge-resource-allocation +--rw telemetry | +--rw latency uint32 | +--rw bandwidth-utilization decimal64 | +--rw edge-cpu-utilization decimal64 | +--rw edge-memory-utilization decimal64 | +--rw edge-storage-utilization decimal64 +--rw policies | +--rw policy* [policy-id] | +--rw policy-id string | +--rw policy-name string | +--rw policy-type enumeration | +--rw conditions | +--rw latency-threshold uint32 | +--rw bandwidth-utilization-threshold uint8 | +--rw edge-cpu-utilization-threshold uint8 | +--rw edge-memory-utilization-threshold uint8 +--rw resource-allocation | +--rw workload* [workload-id] | +--rw workload-id string | +--rw workload-type enumeration | +--rw required-latency uint32 | +--rw required-bandwidth uint64 | +--rw required-edge-cpu decimal64 | +--rw required-edge-memory uint64 | +--rw required-edge-storage uint64 +--rpc allocate-edge-resources +--input | +--rw workload-id string | +--rw target-latency uint32 | +--rw target-bandwidth uint64 | +--rw target-edge-cpu decimal64 | +--rw target-edge-memory uint64 | +--rw target-edge-storage uint64 +--output +--ro allocation-status enumeration +--ro achieved-latency uint32 +--ro achieved-bandwidth uint64 +--ro allocated-edge-cpu decimal64 +--ro allocated-edge-memory uint64 +--ro allocated-edge-storage uint64¶
Authentication and Authorization: The orchestrator must authenticate requests using secure credentials (e.g., OAuth tokens, X.509 certificates).¶
Data Encryption: All data exchanged between domains, especially telemetry and resource allocation requests, must be encrypted using protocols like TLS.¶
Access Control: Role-Based Access Control (RBAC) must be implemented to ensure that only authorized users can request or allocate resources.¶
TBD¶
The authors would like to thank for following for discussions and providing input to this document: xxx.¶