Internet-Draft | mojette-encoding | October 2024 |
Haynes & Evenou | Expires 18 April 2025 | [Page] |
Parallel NFS (pNFS) allows a separation between the metadata (onto a metadata server) and data (onto a storage device) for a file. The flex file layout type version 2 further allows for erasure encoding types to provide data integrity. In this document, a new erasure encoding type for the Mojette Transformation is introduced.¶
This note is to be removed before publishing as an RFC.¶
Discussion of this draft takes place on the NFSv4 working group mailing list ([email protected]), which is archived at https://mailarchive.ietf.org/arch/browse/nfsv4/. Working Group information can be found at https://datatracker.ietf.org/wg/nfsv4/about/.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 18 April 2025.¶
Copyright (c) 2024 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
In Parallel NFS (pNFS) (see Section 12 of [RFC8881]), the metadata server returns layout type structures that describe where file data is located. There are different layout types for different storage systems and methods of arranging data on storage devices. [RFCTBD09] defined the Flexible File Version 2 Layout Type used with file-based data servers that are accessed using the NFS protocols: NFSv3 [RFC1813], NFSv4.0 [RFC7530], NFSv4.1 [RFC8881], and NFSv4.2 [RFC7862]. This document introduces a new Erasure Encoding Type called Mojette Transformation.¶
Using the process detailed in [RFC8178], the revisions in this document become an extension of NFSv4.2 [RFC7862]. They are built on top of the external data representation (XDR) [RFC4506] generated from [RFC7863].¶
The key words 'MUST', 'MUST NOT', 'REQUIRED', 'SHALL', 'SHALL NOT', 'SHOULD', 'SHOULD NOT', 'RECOMMENDED', 'NOT RECOMMENDED', 'MAY', and 'OPTIONAL' in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.¶
The Mojette Transform is an erasure coding technique that provides fault tolerance for data storage systems by enabling the recovery of lost data blocks. This section describes the integration of the systematic Mojette Transform into the NFS protocol, focusing on encoding and decoding file system blocks, typically sized at 4KB or 8KB.¶
The Mojette Transform involves the following steps to encode a data block:¶
Decoding the Mojette Transform is the inverse of encoding, involving the reconstruction of data from projection data to fill an empty grid. This involves solving a system of linear equations defined by the projection differences and the projection directions (p_i, q_i). The algorithm iterates to refine the values of the missing data elements until the original data block is reconstructed.¶
Data reconstruction is possible if Katz's criterion holds, which was extended to any convex shape. The Katz's criterion specifies that reconstruction is valid if for a given set of n projections along n directions (p_i, q_i) either sum_{i=0}^{n} q_i ≥ Q or sum_{i=0}^{n} p_i ≥ P.¶
Adjusting the number of lines Q and the projections set allows setting a desired fault-tolerance threshold.¶
For example a 64x4 grid can be decoded by the projection set {(0, 1), (1, 1), (2, 1), (3, 1)} as sum_{i=1}^{4} q_i = 4.¶
A systematic code is an error-correcting code where the input data is embedded directly in the encoded output. In contrast, a non-systematic code produces an output that does not contain the original input symbols. The Mojette Transform can be implemented in both ways, allowing it to adapt to various use cases.¶
In the context of NFS, a data block corresponds to a file system block, which is a contiguous segment of data, typically 4KB or 8KB in size. The Mojette Transform encodes these blocks to ensure data integrity and availability in distributed storage environments.¶
In the non-systematic version of the Mojette Transform, the original data block is not directly included in the encoded output. Instead, the entire encoded output consists of projections computed from the original data. The number of computed projections n is larger than the number of projections m required to rebuild the initial data.¶
To decode a file system block that has undergone the non-systematic Mojette Transform, the following steps are followed:¶
Assume a file system block of 4KB is divided into a 128x4 matrix of 128-bit elements. Using the non-systematic Mojette Transform, we compute projections along selected directions, such as (-2,1), (-1, -1), (0,1), (1,1), (2,1) and (3,1). The original data is not stored directly; instead, the projections are stored.¶
If a data loss occurs, for instance, if two projections are lost, the missing elements can be recovered by using the remaining projections and solving the inverse problem. Any set of 4 projections among the 6 generated can rebuild exactly the original data block¶
In the systematic version, the original data block (file system block) is part of the encoded output. Additional projections are calculated to provide redundancy. If k is the number of original data blocks and n is the total number of encoded blocks (including projections), the systematic code will have the first k blocks as the original data and the remaining n - k blocks as projections.¶
To decode a file system block that has undergone the Systematic Mojette Transform, the following steps are followed:¶
Assume a file system block of 4KB is divided into a $64\times4$ matrix of 128-bit elements. Using the systematic Mojette Transform, we first compute projections along selected directions, such as (0,1), and (1,1). The original 4 blocks of 64 128-bit elements remains part of the encoded data, and the 2 additional projections are stored for redundancy. If a data loss occurs, the missing elements can be recovered by using the projections and solving the inverse problem.¶
The Mojette Transform provides two implementations for an efficient and effective way to enhance data reliability by encoding file system blocks with additional projections. These methods ensure that data can be reconstructed even in the presence of failures, thereby enhancing the fault tolerance of the file system.¶
In summary, the Mojette Transform offers robust solutions for data reliability in file systems, balancing redundancy, efficiency, and performance to ensure data integrity and quick recovery from failures.¶
By storing only encoded data and having the ability to rebuild the original block from any sufficient subset of projections, the non-systematic encoding implementation offers great flexibility and allows fast failure detection and recovery.¶
The non-systematic decoding algorithm is the most performant of all implementations. Although decoding always occurs, the overhead is low, and unlike systematic encoding, performance remains constant regardless of the number of failures.¶
Systematic Mojette coding reduces redundancy by integrating the original data blocks into the encoded data, unlike non-systematic codes that generate entirely new data from the original.¶
Fewer projections need to be calculated and stored, reducing both computational and storage overhead.¶
Decoding is faster and simpler, especially when some original data blocks are available, enabling quicker data access. However, performance is slightly degraded in the case of failures and depends on the number of failures.¶
ffv2_encoding_type is extended in Figure 3 to introduce two different erasure encoding types: FFV2_ENCODING_MOJETTE_SYSTEMATIC and FFV2_ENCODING_MOJETTE_NON_SYSTEMATIC. They are intoduced at this level instead of at the ffv2_encoding_data (see Figure 5) in order for the client to negotiate the support of one over the other with the NFS4ERR_ERASURE_ENCODING_NOT_SUPPORTED error (see Section ZZZ of [RFCTBD09]).¶
The ffv2_mojette_faulty_devices (see Figure 4) can be used in both the layout_hint (see both Section 5.12.4 of [RFC8881] and Section XXX of [RFCTBD09]) and the ffl_encoding_type_data (see Section YYY of [RFCTBD09]) to convey the distribution of FFV2_DS_FLAGS_ACTIVE and FFV2_DS_FLAGS_SPARE projection blocks (see Section XXX of [RFCTBD09]) in the layouts for the Flexible File Version 2 Layout Type. The name of each of the enum targets ends with 'X_Y', which states that there is a need for X+Y files to compose the projection. X is the number of FFV2_DS_FLAGS_ACTIVE blocks and Y is the number of FFV2_DS_FLAGS_SPARE blocks.¶
This addition of FFV2_ENCODING_MOJETTE_SYSTEMATIC and FFV2_ENCODING_MOJETTE_NON_SYSTEMATIC to the ffv2_encoding_type_data (see Figure 5) allows for the metadata server to inform the client as to the expected distribution of FFV2_DS_FLAGS_ACTIVE and FFV2_DS_FLAGS_SPARE projection blocks inside the ffm_data_servers arm for the Mojette erasure encoding types.¶
Consider a FFV2_ENCODING_MOJETTE_NON_SYSTEMATIC encoding type which needs 6 active blocks and 2 spare blocks in the payload (Not a valid ffv2_mojette_faulty_devices, but needed to show varied block sizes). As can be seen in Table 1 , 4kB data blocks need blocks about 1kB in size. But not all blocks are the same size.¶
Projection ID | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
---|---|---|---|---|---|---|---|---|
p value | -3 | -2 | -1 | 0 | 1 | 2 | 0 | 0 |
q value | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 0 |
size (bytes) | 1048 | 1080 | 1080 | 1048 | 1064 | 1048 | 0 | 0 |
This document contains the external data representation (XDR) [RFC4506] description of the uncacheable attribute. The XDR description is embedded in this document in a way that makes it simple for the reader to extract into a ready-to-compile form. The reader can feed this document into the following shell script to produce the machine readable XDR description of the new flags:¶
<CODE BEGINS> #!/bin/sh grep '^ *///' $* | sed 's?^ */// ??' | sed 's?^ *///$??' <CODE ENDS>¶
That is, if the above script is stored in a file called 'extract.sh', and this document is in a file called 'spec.txt', then the reader can do:¶
<CODE BEGINS> sh extract.sh < spec.txt > mojette_encoding_prot.x <CODE ENDS>¶
The effect of the script is to remove leading white space from each line, plus a sentinel sequence of '///'. XDR descriptions with the sentinel sequence are embedded throughout the document.¶
Note that the XDR code contained in this document depends on types from the NFSv4.2 nfs4_prot.x file (generated from [RFC7863]), the Flex Files Layout Type flexfiles.x file (generated from [RFC8435]), and the Flex Files Layout Type version 2 flexfilesv2.x (generated from [RFCTBD09]). This includes both nfs types that end with a 4, such as offset4, length4, etc., as well as more generic types such as uint32_t and uint64_t.¶
While the XDR can be appended to that from [RFC7863], the various code snippets belong in their respective areas of that XDR.¶
This document has the same security considerations as both Flex Files Layout Type version 1 (see Section 15 of [RFC8435]) and NFSv4.2 (see Section 17 of [RFC7862]).¶
This document introduces changes in the 'Flex Files V2 Erasure Encoding Type Registry'. This document defines both the FFV2_ENCODING_MOJETTE_SYSTEMATIC and FFV2_ENCODING_MOJETTE_NON_SYSTEMATIC types for Client-Side Mojette Transformations.¶
Erasure Encoding Type Name | Value | RFC | How | Minor Versions |
---|---|---|---|---|
FFV2_ENCODING_MOJETTE_SYSTEMATIC | 2 | RFCTBD10 | L | 2 |
FFV2_ENCODING_MOJETTE_NON_SYSTEMATIC | 3 | RFCTBD10 | L | 2 |
The following from Hammerspace were instrumental in driving the incorporation of the Mojette Transformation into an encoding type for Flexible Files Version 2 Layout Type: David Flynn, Trond Myklebust, Tom Haynes, Didier Feron, Jean-Pierre Monchanin, Pierre Evenou, and Brian Pawlowski.¶
This section is to be removed before publishing as an RFC.¶
[RFC Editor: prior to publishing this document as an RFC, please replace all occurrences of RFCTBD10 with RFCxxxx where xxxx is the RFC number of this document]¶
[RFC Editor: prior to publishing this document as an RFC, please replace all occurrences of RFCTBD09 with RFCyyyy where yyyy is the RFC number of the document: draft-haynes-nfsv4-erasure-encoding.xml]¶