Internet-Draft erasure encoding November 2024
Haynes Expires 9 May 2025 [Page]
Workgroup:
Network File System Version 4
Internet-Draft:
draft-haynes-nfsv4-erasure-encoding-03
Published:
Intended Status:
Standards Track
Expires:
Author:
T. Haynes
Hammerspace

Erasure Encoding of Files in NFSv4.2

Abstract

Parallel NFS (pNFS) allows a separation between the metadata (onto a metadata server) and data (onto a storage device) for a file. The Flexible File Version 2 Layout Type is defined in this document as an extension to pNFS that allows the use of storage devices that require only a limited degree of interaction with the metadata server and use already-existing protocols. Data replication is also added to provide integrity.

This note is to be removed before publishing as an RFC.

Discussion of this draft takes place on the NFSv4 working group mailing list ([email protected]), which is archived at https://mailarchive.ietf.org/arch/browse/nfsv4/. Working Group information can be found at https://datatracker.ietf.org/wg/nfsv4/about/.

This note is to be removed before publishing as an RFC.

This draft starts sparse and will be filled in as details are ironed out. For example, WRITE_BLOCK4 in Section 6.5 is presented as being WRITE4 (see Section 18.32 of [RFC8881]) plus some semantic changes. In the first draft, we simply explain the semantics changes. As these are accepted by the knowledgeable reviewers, we will flesh out the WRITE_BLOCK4 section to include sub-sections more akin to 18.32.3 and 18.32.4 of [RFC8881].

Except where called out, all the semantics of the Flexible File Version 1 Layout Type presented in [RFC8435] still apply. This new version extends it and does not replace it.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on 9 May 2025.

Table of Contents

1. Introduction

In Parallel NFS (pNFS) (see Section 12 of [RFC8881]), the metadata server returns layout type structures that describe where file data is located. There are different layout types for different storage systems and methods of arranging data on storage devices. [RFC8435] defined the Flexible File Version 1 Layout Type used with file-based data servers that are accessed using the NFS protocols: NFSv3 [RFC1813], NFSv4.0 [RFC7530], NFSv4.1 [RFC8881], and NFSv4.2 [RFC7862].

The Client Side Mirroring (see Section 8 of [RFC8435]), introduced with the first version of the Flexible File Layout Type, provides for replication of data but does not provide for integrity of data. In the event of an error, an user would be able to repair the file by silvering the mirror contents. I.e., they would pick one of the mirror instances and replicate it to the other instance locations.

However, lacking integrity checks, silent corruptions are not able to be detected and the choice of what constitutes the good copy is difficult. This document updates the Flexible File Layout Type to version 2 by providing data integrity for erasure encoding. It introduces new variants of COMMIT4 (see Section 18.3 of [RFC8881]) , READ4 (see Section 18.22 of [RFC8881]) , and WRITE4 (see Section 18.32 of [RFC8881]) to allow for the transmission of integrity checking.

Using the process detailed in [RFC8178], the revisions in this document become an extension of NFSv4.2 [RFC7862]. They are built on top of the external data representation (XDR) [RFC4506] generated from [RFC7863].

1.1. Definitions

block:
One of the resulting blocks to be exchanged with a data server after a transformation has been applied to a data block. Note that the resulting block may be a different size than the data block.
Client Side Mirroring:
A file based replication method where copies are maintained in parallel.
data block:
A block of data in the client's cache for a file.
Erasure Encoding:
A data protection scheme where a block of data is replicated into fragments and additional redundant fragments are added to achieve parity. The new blocks are stored in different locations.
Client Side Erasure Encoding:
A file based integrity method where copies are maintained in parallel.
consistency of payload:
A payload is consistent when all contained blocks have the same owner, i.e., they share the same writing client and transaction id.
integrity of data:
Data integrity refers to the accuracy, consistency, and reliability of data throughout its life cycle.
payload:
The set of metadata header and transformed blocks generate per data block by the erasure encoding type. Note that the resulting blocks might be of type active, parity, spare, or repair.
replication of data:
Data replication is making and storing multiple copies of data in different locations.
write hole:
A write hole is a data corruption scenario where either two clients are trying to write to the same block or one client is overwriting an existing block of data.

1.2. Requirements Language

The key words 'MUST', 'MUST NOT', 'REQUIRED', 'SHALL', 'SHALL NOT', 'SHOULD', 'SHOULD NOT', 'RECOMMENDED', 'NOT RECOMMENDED', 'MAY', and 'OPTIONAL' in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.

2. Flexible File Version 2 Layout Type

In order to introduce erasure encoding to pNFS, a new layout type of LAYOUT4_FLEX_FILES_V2 needs to be defined. While we could define a new layout type per erasure encoding type, there exist use cases where multiple erasure encoding types exist in the same layout.

The original layouttype4 introduced in [RFC8881] is modified to as in Figure 1.

       enum layouttype4 {
           LAYOUT4_NFSV4_1_FILES   = 1,
           LAYOUT4_OSD2_OBJECTS    = 2,
           LAYOUT4_BLOCK_VOLUME    = 3,
           LAYOUT4_FLEX_FILES      = 4,
           LAYOUT4_FLEX_FILES_V2   = 5
       };

       struct layout_content4 {
           layouttype4             loc_type;
           opaque                  loc_body<>;
       };

       struct layout4 {
           offset4                 lo_offset;
           length4                 lo_length;
           layoutiomode4           lo_iomode;
           layout_content4         lo_content;
       };
Figure 1

This document defines structures associated with the layouttype4 value LAYOUT4_FLEX_FILES_V2. [RFC8881] specifies the loc_body structure as an XDR type 'opaque'. The opaque layout is uninterpreted by the generic pNFS client layers but is interpreted by the Flexible File Version 2 Layout Type implementation. This section defines the structure of this otherwise opaque value, ffv2_layout4.

2.1. ffv2_encoding_type

   /// enum ffv2_encoding_type {
   ///     FFV2_ENCODING_MIRRORED       = 0x1;
   /// };
Figure 2

The ffv2_encoding_type (see Figure 2) encompasses a new IANA registry for 'Flex Files V2 Erasure Encoding Type Registry' (see Section 9.3). I.e., instead of defining a new Layout Type for each Erasure Encoding, we define a new Erasure Encoding Type. Except for FFV2_ENCODING_MIRRORED, each of the types is expected to employ the new operations in this document.

FFV2_ENCODING_MIRRORED offers replication of data and not integrity of data. As such, it does not need operations like WRITE_BLOCK4 (see Section 6.5).

2.2. ff_flags4

   const FF_FLAGS_NO_LAYOUTCOMMIT4   = 0x00000001;
   const FF_FLAGS_NO_IO_THRU_MDS    = 0x00000002;
   const FF_FLAGS_NO_READ_IO        = 0x00000004;
   const FF_FLAGS_WRITE_ONE_MIRROR  = 0x00000008;
   typedef uint32_t            ff_flags4;
Figure 3

ff_flags4 is defined as in Section 5.1 of [RFC8435] and is shown in Figure 3 for reference.

2.3. ffv2_file_info4

   /// struct ffv2_file_info4 {
   ///     stateid4                fffi_stateid;
   ///     nfs_fh4                 fffi_fh_vers;
   /// };
Figure 4

The ffv2_file_info4 is a new structure to help with the stateid issue discussed in Section 5.1 of [RFC8435]. I.e., in version 1 of the Flexible File Layout Type, there was the singleton ffds_stateid combined with the ffds_fh_vers array. I.e., each NFSv4 version has its own stateid. In Figure 4, each NFSv4 file handle has a one-to-one correspondence to a stateid.

2.4. ffv2_ds_flags4

   /// const FFV2_DS_FLAGS_ACTIVE        = 0x00000001;
   /// const FFV2_DS_FLAGS_SPARE         = 0x00000002;
   /// const FFV2_DS_FLAGS_PARITY        = 0x00000004;
   /// const FFV2_DS_FLAGS_REPAIR        = 0x00000008;
   /// typedef uint32_t            ffv2_ds_flags4;
Figure 5

The ffv2_layout4 (in Figure 5) flags detail the state of the data servers. With Erasure Encoding algorithms, there are both Systematic and Non-Systematic approaches. In the Systematic, the bits for integrity are placed amoungst the resulting transformed block. Such an implementation would typically see FFV2_DS_FLAGS_ACTIVE and FFV2_DS_FLAGS_SPARE data servers. The FFV2_DS_FLAGS_SPARE ones allow the client to repair a payload with enaging the metadata server. I.e., if one of the FFV2_DS_FLAGS_ACTIVE did not respond to a WRITE_BLOCK4, the client could fail the block to the FFV2_DS_FLAGS_SPARE data server.

With the Non-Systematic approach, the data and integrity live on different data servers. Such an implementation would typically see FFV2_DS_FLAGS_ACTIVE and FFV2_DS_FLAGS_PARITY data servers. If the implementation wanted to allow for local repair, it would also use FFV2_DS_FLAGS_SPARE. Note that with a Non-Systematic approach, it is possible to update parts of the blocks, see Section 6.5.3.2.

See [Plank97] for further reference to storage layouts for encoding.

2.5. ffv2_data_server4

   /// struct ffv2_data_server4 {
   ///     deviceid4               ffds_deviceid;
   ///     uint32_t                ffds_efficiency;
   ///     ffv2_file_info4         ffds_file_info<>;
   ///     fattr4_owner            ffds_user;
   ///     fattr4_owner_group      ffds_group;
   ///     ffv2_ds_flags4          ffds_flags;
   /// };
Figure 6

The ffv2_data_server4 (in Figure 6) describes a data file and how to access it via the different NFS protocols.

2.6. ffv2_encoding_type_data

   /// union ffv2_encoding_type_data switch
   ///         (ffv2_encoding_type fetd_encoding) {
   ///     case FFV2_ENCODING_MIRRORED:
   ///         void;
   /// };
Figure 7

The ffv2_encoding_type_data (in Figure 7) describes erasure encoding type specific fields. I.e., this is how the encoding type can communicate the need for counts of active, spare, parity, and repair types of blocks.

2.7. ffv2_mirror4

   /// struct ffv2_mirror4 {
   ///     ffv2_data_server4       ffm_data_servers<>;
   ///     ffv2_encoding_type_data ffm_encoding_type_data;
   /// };
Figure 8

The ffv2_mirror4 (in Figure 8) describes the Flexible File Layout Version 2 specific fields.

2.8. ffv2_layout4

   /// struct ffv2_layout4 {
   ///     length4                 ffl_stripe_unit;
   ///     ffv2_mirror4            ffl_mirrors<>;
   ///     ff_flags4               ffl_flags;
   ///     uint32_t                ffl_stats_collect_hint;
   /// };
Figure 9

The ffv2_layout4 (in Figure 9) describes the Flexible Files Layout Version 2.

2.9. ffv2_layouthint4

/// union ffv2_mirrors_hint switch (ffv2_encoding_type ffmh_type) {
///     case FFV2_ENCODING_MIRRORED:
///         void;
/// };
///
/// struct ffv2_layouthint4 {
///     ffv2_encoding_type fflh_supported_types<>;
///     ffv2_mirrors_hint fflh_mirrors_hint;
/// };
Figure 10

The ffv2_layouthint4 (in Figure 10) describes the layout_hint (see Section 5.12.4 of [RFC8881]) that the client can provide to the metadata server.

2.10. Mixing of Encoding Types

Note that effectively, multiple encoding types can be present in a Flexible Files Version 2 Layout Type layout. The ffv2_layout4 has an array of ffv2_mirror4, each of which has a ffv2_encoding_type. The main reason to allow for this is to provide for either the assimilation of a non-erasure encoded file to an erasure encoded file or the exporting of an erasure encoded file to a non-erasure encoded file.

Assume there is an additional ffv2_encoding_type of FFV2_ENCODING_REED_SOLOMON and it needs 4 active blocks, 2 parity blocks, and 2 spare blocks. The user wants to actively assimilate a regular file. As such, a layout might be as represented in Figure 11. As this is an assimilation, most of the data reads will be satisfied by READ4 (see Section 18.22 of [RFC8881]) calls to index 0. However, as this is also an active file, there could also be READ_BLOCK4 (see Section 6.3) calls to the other indexes.

         +---------------------------------------------------+
         | ffv2_layout4:                                     |
         +---------------------------------------------------+
         |     ffl_mirrors[0]:                               |
         |         ffm_data_servers:                         |
         |             ffv2_data_server4[0]                  |
         |                 ffds_flags: 0                     |
         |         ffm_encoding: FFV2_ENCODING_MIRRORED      |
         +---------------------------------------------------+
         |     ffl_mirrors[1]:                               |
         |         ffm_data_servers:                         |
         |             ffv2_data_server4[0]                  |
         |                 ffds_flags: FFV2_DS_FLAGS_ACTIVE  |
         |             ffv2_data_server4[1]                  |
         |                 ffds_flags: FFV2_DS_FLAGS_ACTIVE  |
         |             ffv2_data_server4[2]                  |
         |                 ffds_flags: FFV2_DS_FLAGS_ACTIVE  |
         |             ffv2_data_server4[3]                  |
         |                 ffds_flags: FFV2_DS_FLAGS_ACTIVE  |
         |             ffv2_data_server4[4]                  |
         |                 ffds_flags: FFV2_DS_FLAGS_PARITY  |
         |             ffv2_data_server4[5]                  |
         |                 ffds_flags: FFV2_DS_FLAGS_PARITY  |
         |             ffv2_data_server4[6]                  |
         |                 ffds_flags: FFV2_DS_FLAGS_SPARE   |
         |             ffv2_data_server4[7]                  |
         |                 ffds_flags: FFV2_DS_FLAGS_SPARE   |
         |     ffm_encoding: FFV2_ENCODING_REED_SOLOMON      |
         +---------------------------------------------------+
Figure 11

When performing I/O via a FFV2_ENCODING_MIRRORED encoding type, the non-transformed data will be used, Whereas with other encoding types, a metadata header and transformed block will be sent. Further, when reading data from the instance files, the client MUST be prepared to have one of the encoding types supply data and the other type not to supply data. I.e., the READ_BLOCK4 call might return rlr_eof set to true (see Figure 37), which indicates that there is no data, where the READ4 call might return eof to be false, which indicates that there is data. The client MUST determine that there is in fact data.

An example use case is the active assimilation of a file to ensure integrity. As the client is helping to translated the file to the new encoding scheme, it is actively modifying the file. As such, it might be sequentially reading the file in order to translate. The READ4 call would be returning data and the READ_BLOCK4 would not be returning data. As the client overwrites the file, the WRITE4 call and the WRITE_BLOCK4 call would both have data sent. Finally, if the client read back a section which had been modified earlier, both the READ4 and READ_BLOCK4 calls would return data.

3. Erasure Encoding

Erasure Encoding takes an data block and transforms it to a payload to send to the data servers (see Figure 12). It generates a metadata header and transformed block per data server. The header is metadata information for the transformed block. From now on, the metadata is simply referred to as the header and the transformed block as the block. The payload of a data block is the set of generated headers and blocks for that data block.

The change_id is an unique identifier generated by the client to describe the current write transaction. The client_id is an unique identifier assigned by the metadata server to describe which client is making the current write transaction. The seq_id describes the index across payload. The eff_len is the length of the data within the block. Finally, the crc32 is the 32 bit crc calculation of the header (with the crc32 field being 0) and the block. By combining the two parts of the payload, integrity is ensured for both the parts.

While the data block might have a length of 4kB, that does not necessarily mean that the length of the block is 4kB. That length is determined by the erasure encoding type algorithm. For example, Reed Solomon might have 4kB blocks with the data integrity being compromised by parity blocks. Another example would be the Mojette Transformation, which might have 1kB block lengths.

The payload contains redundancy which will allow the erasure encoding type algorithm to repair blocks in the payload as it is transformed back to a data block (see Figure 17). A payload is consistent when all of the contained headers share the same change_id and client_id. It has integrity when it is consistent and the blocks all pass the crc32 checks.

3.1. Encoding a Data Block

                      +-----------------+
                      |  data block     |
                      +-----------------+
                      |                 |
                      | 3kB data        |
                      |                 |
                      +-----------------+
                      | 1kB empty       |
                      +-------+---------+
                              |
                              |
       +----------------------+-----------------------+
       |      Erasure Encoding (Transform Forward)    |
       +----+-------------------------------------+---+
            |                                     |
            |                                     |
        +---+----------------+         +----------+---------+
        | HEADER             |         | HEADER             |
        +--------------------+         +--------------------+
        | change_id: 3       |         | change_id: 3       |
        | client_id: 6       |         | client_id: 6       |
        | seq_id   : 0       |         | seq_id   : 5       |
        | eff_len  : 3kB     |  ...    | eff_len  : 3kB     |
        | crc32    :         |         | crc32    :         |
        +--------------------+         +--------------------+
        | BLOCK              |         | BLOCK              |
        +--------------------+         +--------------------+
        | data: ....         |         | data: ....         |
        +--------------------+         +--------------------+
             Data Server 1                 Data Server 6
Figure 12

Each data block of the file resident in the client's cache of the file will be encoded into N different payloads to be sent to the data servers as shown in Figure 12. As WRITE_BLOCK4 (see Section 6.5) can encode multiple write_block4 into a single transaction, a more accurate description of a WRITE_BLOCK4 might be as in Figure 13.

        +------------------------------------+
        | WRITE_BLOCK4args                   |
        +------------------------------------+
        | wba_stateid: 0                     |
        | wba_offset: 1                      |
        | wba_stable: FILE_SYNC4             |
        | wba_seq_id: 0                      |
        | wba_owner:                         |
        |            bo_change_id: 3         |
        |            bo_client_id: 6         |
        | wba_block[0]:                      |
        |            wb_crc    :  0x32ef89   |
        |            wb_effective_len  : 4kB |
        |            wb_block  :  ......     |
        | wba_block[1]:                      |
        |            wb_crc    :  0x56fa89   |
        |            wb_effective_len  : 4kB |
        |            wb_block  :  ......     |
        | wba_block[2]:                      |
        |            wb_crc    :  0x7693af   |
        |            wb_effective_len  : 3kB |
        |            wb_block  :  ......     |
        +------------------------------------+
Figure 13

pay attention to the 128 bits alignment for wb_block_valDF

This describes a 3 block write of data from an offset of 1 block in the file. As each block shares the wba_owner, it is only presented once. I.e., the data server will be able to construct the header for each wba_block from the wba_seq_id, wba_owner, wb_effective_len, and wb_crc.

Assuming that there were no issues, Figure 14 illustrates the results. The payload sequence id is implicit in the WRITE_BLOCK4args.

        +-------------------------------+
        | WRITE_BLOCK4resok             |
        +-------------------------------+
        | wbr_count: 3                  |
        | wbr_committed: FILE_SYNC4     |
        | wbr_writeverf: 0xf1234abc     |
        | wbr_owners[0]:                |
        |            bo_block_id: 1     |
        |            bo_change_id: 3    |
        |            bo_client_id: 6    |
        |            bo_activated: true |
        | wbr_owners[1]:                |
        |            bo_block_id: 2     |
        |            bo_change_id: 3    |
        |            bo_client_id: 6    |
        |            bo_activated: true |
        | wbr_owners[2]:                |
        |            bo_block_id: 3     |
        |            bo_change_id: 3    |
        |            bo_client_id: 6    |
        |            bo_activated: true |
        +-------------------------------+
Figure 14

3.1.1. Calculating the CRC32

        +---+----------------+
        | HEADER             |
        +--------------------+
        | change_id: 7       |
        | client_id: 6       |
        | seq_id   : 0       |
        | eff_len  : 3kB     |
        | crc32    : 0       |
        +--------------------+
        | BLOCK              |
        +--------------------+
        | data:  ....        |
        +--------------------+
             Data Server 1
Figure 15

Assuming the header and payload as in Figure 15, the crc32 needs to be calculated in order to fill in the wb_crc field. In this case, the crc32 is calculated over the 5 fields as shown in the header and the data of the block. In this example, it is calculated to be 0x21de8. The resulting WRITE_BLOCK4 is shown in Figure 16.

        +------------------------------------+
        | WRITE_BLOCK4args                   |
        +------------------------------------+
        | wba_stateid: 0                     |
        | wba_offset: 1                      |
        | wba_stable: FILE_SYNC4             |
        | wba_seq_id: 0                      |
        | wba_owner:                         |
        |            bo_change_id: 7         |
        |            bo_client_id: 6         |
        | wba_block[0]:                      |
        |            wb_crc    :  0x21de8    |
        |            wb_effective_len  : 3kB |
        |            wb_block  :  ......     |
        +------------------------------------+
Figure 16

3.2. Decoding a Data Block

             Data Server 1                 Data Server 6
        +--------------------+         +--------------------+
        | HEADER             |         | HEADER             |
        +--------------------+         +--------------------+
        | change_id: 1       |         | change_id: 1       |
        | client_id: 6       |         | client_id: 6       |
        | seq_id   : 0       |         | seq_id   : 5       |
        | eff_len  : 3kB     |  ...    | eff_len  : 3kB     |
        | crc32    :         |         | crc32    :         |
        +--------------------+         +--------------------+
        | BLOCK              |         | BLOCK              |
        +--------------------+         +--------------------+
        | data:  ....        |         | data:  ....        |
        +---+----------------+         +----------+---------+
            |                                     |
            |                                     |
       +----+-------------------------------------+---+
       |      Erasure Decoding (Transform Reverse)    |
       +----------------------+-----------------------+
                              |
                              |
                      +-------+---------+
                      |  data block     |
                      +-----------------+
                      |                 |
                      | 3kB data        |
                      |                 |
                      +-----------------+
                      | 1kB empty       |
                      +-----------------+
Figure 17

When reading blocks via a READ_BLOCK4 operation, the client will decode the headers and payload into data blocks as shown in Figure 17. If the resulting data block is to be sized less than a data block, i.e., the rb_effective_len is less than the data block size, then the inverse transformation MUST fill the remainder of the data block with 0s. It must appear as a freshly written data block which was not completely filled.

Note that at this time, the client could detect issues in the integrity of the data. The handling and repair are out of the scope of this document and MUST be addressed in the document describing each erasure encoding type.

3.2.1. Checking the CRC32

        +------------------------------------+
        | READ_BLOCK4resok                   |
        +------------------------------------+
        | rbr_eof: false                     |
        | rbr_blocks[0]:                     |
        |            rb_crc: 0x21de8         |
        |            rb_effective_len  : 3kB |
        |            rb_owner:               |
        |                 bo_block_id: 1     |
        |                 bo_change_id: 7    |
        |                 bo_client_id: 6    |
        |                 bo_activated: true |
        |            rb_block  :  ......     |
        +------------------------------------+
Figure 18

Assuming the READ_BLOCK4 results as in Figure 18, the crc32 needs to be checked in order to ensure data integrity. Conceptually, a header and payload can be built as shown in Figure 19. The crc32 is calculated over the 5 fields as shown in the header and the 3kB of data block. In this example, it is calculated to be 0x21de8. Thus this payload for the data server has data integrity.

        +---+----------------+
        | HEADER             |
        +--------------------+
        | change_id: 7       |
        | client_id: 6       |
        | seq_id   : 0       |
        | eff_len  : 3kB     |
        | crc32    : 0       |
        +--------------------+
        | BLOCK              |
        +--------------------+
        | data:  ....        |
        +--------------------+
             Data Server 1
Figure 19

4. Blocks and Activating

Unlike the regular NFSv4.2 I/O operations, the base unit of I/O in this document is the block. The raw data stream is encoded/decoded into blocks as described in Section 3. Each block has the concept of whether it is activated or pending activation. This is crucial in detecting write holes. A write hole occurs either when two different clients write to the same block concurrently or when a client overwrites existing data. In the first scenario, the order of writes is not deterministic and can result in a mixture of blocks in the payload. In the last scenario, network partitions or client restarts can result in partial writes. In both cases, the blocks have to be repaired, either by abandoning the new I/O or by sorting out the winner. Note that unlike the case of the encoding type detecting data integrity issues (see Section 3.2), the case of write holes is in the scope of this document.

What is out of scope of this document is the manner in which the data servers implement the semantics of the new operations. I.e., the data servers might be able to leverage the native file system to achieve the semantics or it might completely implement a multi-file approach to stage WRITE_BLOCK4 results and then shuffle blocks when the ACTIVATE_BLOCK4 or ROLLBACK_BLOCK4 operations activate the data.

4.1. Dead or Partitioned Client

Consider a client which was in the middle of sending WRITE_BLOCK4 to a set of data servers and it crashes. Regardless of whether it comes back online or not, the metadata server can detect that the client had restarted when it had an outstanding LAYOUTIOMODE4_RW on the file. The metadata server can assign the file to a repair program, which would basically scan the entire file with READ_BLOCK_STATUS4. When it determines that it does not have enough payload blocks to rebuild the data block, it can determine that the I/O for that data block was not complete and throw away the blocks.

Note that the repair process can throw away the blocks by using the ROLLBACK_BLOCK4 operation to unstage the pending written blocks.

4.2. Client Overwrite

Consider a client which gets back conflicting information in the WRITE_BLOCK4 results. Assume that we had written to 6 data servers with WRITE_BLOCK4s as in Figure 20. And we get the results as in Figure 21.

        +------------------------------------+
        | WRITE_BLOCK4args                   |
        +------------------------------------+
        | wba_stateid: 0                     |
        | wba_offset: 1                      |
        | wba_stable: FILE_SYNC4             |
        | wba_seq_id: 0                      |
        | wba_owner:                         |
        |            bo_change_id: 3         |
        |            bo_client_id: 6         |
        | wba_block[0]:                      |
        |            wb_crc    :  0x32ef89   |
        |            wb_effective_len  : 4kB |
        |            wb_block  :  ......     |
        | wba_block[1]:                      |
        |            wb_crc    :  0x56fa89   |
        |            wb_effective_len  : 4kB |
        |            wb_block  :  ......     |
        +------------------------------------+
Figure 20

Figure 21 shows that the first block was an overwrite and an activation has to be done in order for the newly written block to be returned in a READ_BLOCK4. Assume that the next four data servers had the same type of response.

                Data Server 1
        +--------------------------------+
        | WRITE_BLOCK4resok              |
        +--------------------------------+
        | wbr_count: 2                   |
        | wbr_committed: FILE_SYNC4      |
        | wbr_writeverf: 0xf1234abc      |
        | wbr_owners[0]:                 |
        |            bo_block_id: 1      |
        |            bo_change_id: 2     |
        |            bo_client_id: 6     |
        |            bo_activated: true  |
        | wbr_owners[1]:                 |
        |            bo_block_id: 1      |
        |            bo_change_id: 3     |
        |            bo_client_id: 6     |
        |            bo_activated: false |
        | wbr_owners[2]:                 |
        |            bo_block_id: 2      |
        |            bo_change_id: 3     |
        |            bo_client_id: 6     |
        |            bo_activated: true  |
        +--------------------------------+
Figure 21

But assume that data server 4 does not respond to the WRITE_BLOCK4 operation. While the client can detect this and send the WRITE_BLOCK4 to any data server marked as FFV2_DS_FLAGS_SPARE, it might decide to see if the data server did in fact do the transaction. It might also be the case that there are no data servers marked as FFV2_DS_FLAGS_SPARE. The client issues a READ_BLOCK_STATUS4 (see Figure 22) and gets the results in Figure 23. This indicates that data server 4 did not get the WRITE_BLOCK4 request.

In general, the client can either resend the WRITE_BLOCK4 request, determine by the erasure encoding type that there is sufficient payload blocks present to decode the data block, or ROLLBACK_BLOCK4 the existing blocks to back out the change.

                Data Server 4
        +--------------------------------+
        | READ_BLOCK_STATUS4args         |
        +--------------------------------+
        | rbsa_stateid: 0                |
        | rbsa_offset: 1                 |
        | rbsa_count: 3                  |
        +----------+---------------------+
Figure 22
                Data Server 4
        +--------------------------------+
        | READ_BLOCK_STATUS4resok        |
        +--------------------------------+
        | rbsr_eof: true                 |
        | rbsr_blocks[0]:                |
        |            bo_block_id: 1      |
        |            bo_change_id: 2     |
        |            bo_client_id: 6     |
        |            bo_activated: true  |
        +--------------------------------+
Figure 23

4.3. Racing Clients

Assume that the client has written to 6 data servers with WRITE_BLOCK4s as in Figure 20. But now it gets back the conflicting results in Figure 24 and Figure 25. From this, it can detect that there was a race with another client. Note, even though both clients present the same bo_change_id, nothing can be inferred as to the ordering of the two transactions. In some cases, bo_client_id 10 won the race and in some cases, bo_client_id 6 won the race.

As a subsequent READ_BLOCK4 will produce garbage, the clients need to agree on how to fix this issue without any communication. A simplistic approach is for each client to retry the WRITE_BLOCK4 until such time as the payload is consistent. Note, this does not mean that both clients win, it just means that one of them wins.

Another option is for the clients to report a LAYOUTERROR4 (see Section 15.6 of [RFC7862]) to the metadata server with an error of NFS4ERR_ERASURE_ENCODING_NOT_CONSISTENT. That would then allow the metadata server to assign the repairing of the file.

                Data Server 1
        +--------------------------------+
        | WRITE_BLOCK4resok              |
        +--------------------------------+
        | wbr_count: 2                   |
        | wbr_committed: FILE_SYNC4      |
        | wbr_writeverf: 0xf1234abc      |
        | wbr_owners[0]:                 |
        |            bo_block_id: 1      |
        |            bo_change_id: 3     |
        |            bo_client_id: 10    |
        |            bo_activated: true  |
        | wbr_owners[1]:                 |
        |            bo_block_id: 1      |
        |            bo_change_id: 3     |
        |            bo_client_id: 6     |
        |            bo_activated: false |
        | wbr_owners[2]:                 |
        |            bo_block_id: 2      |
        |            bo_change_id: 3     |
        |            bo_client_id: 6     |
        |            bo_activated: true  |
        +--------------------------------+
Figure 24
                Data Server 2
        +--------------------------------+
        | WRITE_BLOCK4resok              |
        +--------------------------------+
        | wbr_count: 2                   |
        | wbr_committed: FILE_SYNC4      |
        | wbr_writeverf: 0xf1234abc      |
        | wbr_owners[0]:                 |
        |            bo_block_id: 1      |
        |            bo_change_id: 3     |
        |            bo_client_id: 6     |
        |            bo_activated: true  |
        | wbr_owners[1]:                 |
        |            bo_block_id: 1      |
        |            bo_change_id: 3     |
        |            bo_client_id: 10    |
        |            bo_activated: false |
        | wbr_owners[2]:                 |
        |            bo_block_id: 2      |
        |            bo_change_id: 3     |
        |            bo_client_id: 6     |
        |            bo_activated: true  |
        +--------------------------------+
Figure 25

4.3.1. Multiple Writers

Note that nothing prevents pending blocks from accumulating or from more than 2 writers trying to write the same payload. An example of such a WRITE_BLOCK4resok in response to the example of Figure 20 is shown in Figure 26. Note only has client 6 tried to update the block 1, but all of clients 6, 7, and 20 are attempting to update it.

                Data Server 2
        +--------------------------------+
        | WRITE_BLOCK4resok              |
        +--------------------------------+
        | wbr_count: 2                   |
        | wbr_committed: FILE_SYNC4      |
        | wbr_writeverf: 0xf1234abc      |
        | wbr_owners[0]:                 |
        |            bo_block_id: 1      |
        |            bo_change_id: 3     |
        |            bo_client_id: 6     |
        |            bo_activated: true  |
        | wbr_owners[1]:                 |
        |            bo_block_id: 1      |
        |            bo_change_id: 4     |
        |            bo_client_id: 6     |
        |            bo_activated: false |
        | wbr_owners[2]:                 |
        |            bo_block_id: 1      |
        |            bo_change_id: 20    |
        |            bo_client_id: 7     |
        |            bo_activated: false |
        | wbr_owners[3]:                 |
        |            bo_block_id: 1      |
        |            bo_change_id: 3     |
        |            bo_client_id: 10    |
        |            bo_activated: false |
        | wbr_owners[4]:                 |
        |            bo_block_id: 2      |
        |            bo_change_id: 3     |
        |            bo_client_id: 6     |
        |            bo_activated: true  |
        +--------------------------------+
Figure 26

4.4. Reader and Writer Racing

In addition to the above write hole scenarios, a further complication is a racing reader and writer. If the client reads a block and determines that the payload is not consistent (i.e., not all of the payload blocks share the same client_id and change_id), then it can assume that it has encountered a race with another client writing to the file. It SHOULD retry the READ_BLOCK4 operation until payload consistency is achieved. It may determine to send a LAYOUTERROR4 to the metadata server with an error of NFS4ERR_ERASURE_ENCODING_NOT_CONSISTENT. And should it hang forever? Perhaps a new layout error that the client can send the MDS? Or should it probe with READ_BLOCK_STATUS4 to try to repair? TH Perhaps a LAYOUTERROR_BLOCK4 to send an encoding type specific location? TH

5. New Infrastructure

5.1. Errors

5.1.1. Error 10097 - NFS4ERR_ERASURE_ENCODING_NOT_CONSISTENT

The client encountered a payload in which the blocks were inconsistent and stays inconsistent. As the client can not tell if another client is actively writing, it informs the metadata server of this error via LAYOUTERROR4. The metadata server can then arrange for repair of the file.

Note that due to the opaqueness of the clientid4, the client can not differentiate between boot instances of the metadata server or client, but the metadata server can do that differentiation. I.e., it can tell if the inconsistency is from the same client, whether that client is active and actively writing to the file (i.e., does the client have the file open and with a LAYOUTIOMODE4_RW layout?).

5.1.2. Error 10098 - NFS4ERR_ERASURE_ENCODING_NOT_SUPPORTED

The client requested a ffv2_encoding_type which the metadata server does not support. I.e., if the client sends a layout_hint requesting an erasure encoding type that the metadata server does not support, this error code can be returned. The client might have to send the layout_hint several times to determine the overlapping set of supported erasure encoding types.

5.1.3. Error 10099 - NFS4ERR_ERASURE_ENCODING_BLOCK_MISMATCH

The client requested to the data server to update the header only and the data server can not find a matching block at that offset.

5.2. EXCHGID4_FLAG_USE_PNFS_DS

/// const EXCHGID4_FLAG_USE_ERASURE_DS      = 0x00100000;
Figure 27

When a data server connects to a metadata server it can via EXCHANGE_ID (see Section 18.35 of [RFC8881]) state its pNFS role. The data server can use EXCHGID4_FLAG_USE_ERASURE_DS (see Figure 27) to indicate that it supports the new NFSv4.2 operations introduced in this document. Section 13.1 [RFC8881] describes the interaction of the various pNFS roles masked by EXCHGID4_FLAG_MASK_PNFS. However, that does not mask out EXCHGID4_FLAG_USE_ERASURE_DS. I.e., EXCHGID4_FLAG_USE_ERASURE_DS can be used in combination with all of the pNFS flags.

If the data server sets EXCHGID4_FLAG_USE_ERASURE_DS during the EXCHANGE_ID operation, then it MUST support: ACTIVATE_BLOCK4, READ_BLOCK_STATUS4, READ_BLOCK4, ROLLBACK_BLOCK4, and WRITE_BLOCK4. Further, note that this support is orthoganol to the Erasure Encoding Type selected. The data server is unaware of which type is driving the I/O. It is also unaware of the payload layout or what type of block it is serving.

5.3. Block Owner

/// struct block_owner4 {
///     uint32_t    bo_block_id;
///     changeid4   bo_change_id;
///     clientid4   bo_client_id;
///     bool        bo_activated;
/// };
Figure 28

The block_owner4 (see Figure 28) is used to determine when and by whom a block was written. The bo_block_id is used to identify the block and MUST be the index of the block within the file. I.e., it is the offset of the start of the block divided by the block len. The bo_client_id MUST be the client id handed out by the metadata server to the client as the eir_clientid during the EXCHANGE_ID results (see Section 18.35 of [RFC8881]) and MUST NOT be the client id supplied by the data server to the client. I.e., across all data files, the bo_client_id uniquely describes one and only one client.

The bo_change_id is like the change attribute (see Section 5.8.1.4 of [RFC8881]) in that each block write by a given client has to have an unique bo_change_id. I.e., it can be determined which transaction across all data files that a block corresponds.

The bo_activated is used by the data server to indicate whether the block I/O was activated or pending activation. The first WRITE_BLOCK4 to a location is automatically activated if the WRITE_BLOCK_FLAGS_ACTIVATE_IF_EMPTY is set. Subsequent WRITE_BLOCK4 modifications to that block location are not automatically activated. The client has to ACTIVATE_BLOCK4 the block in order to get it activated.

The concept of automatically activating is dependent on the wba_stable field of the WRITE_BLOCK4args.

6. New NFSv4.2 Operations

6.1. Operation 77: ACTIVATE_BLOCK4 - Activate Cached Block Data

6.1.1. ARGUMENTS

/// struct ACTIVATE_BLOCK4args {
///     /* CURRENT_FH: file */
///     offset4         aba_offset;
///     count4          aba_count;
///     block_owner4    aba_blocks<>;
/// };
Figure 29

6.1.2. RESULTS

/// struct ACTIVATE_BLOCK4resok {
///     verifier4       abr_writeverf;
/// };
Figure 30
/// union ACTIVATE_BLOCK4res switch (nfsstat4 abr_status) {
///     case NFS4_OK:
///         ACTIVATE_BLOCK4resok   abr_resok4;
///     default:
///         void;
/// };
Figure 31

6.1.3. DESCRIPTION

ACTIVATE_BLOCK4 is COMMIT4 (see Section 18.3 of [RFC8881]) with additional semantics over the block_owner activating the blocks. As such, all of the normal semantics of COMMIT4 directly apply.

The main difference between the two operations is that ACTIVATE_BLOCK4 works on blocks and not a raw data stream. As such aba_offset is the starting block offset in the file and not the byte offset in the file. Some erasure encoding types can have different block sizes depending on the block type. Further, aba_count is a count of blocks to activate and not bytes to activate.

Further, while it may appear that the combination of aba_offset and aba_count are redundant to aba_blocks, the purpose of aba_blocks is to allow the data server to differentiate between potentially multiple pending blocks.

6.2. Operation 78: READ_BLOCK_STATUS4 - Read Block Commit Status from File

6.2.1. ARGUMENTS

/// struct READ_BLOCK_STATUS4args {
///     /* CURRENT_FH: file */
///     stateid4    rbsa_stateid;
///     offset4     rbsa_offset;
///     count4      rbsa_count;
/// };
Figure 32

6.2.2. RESULTS

/// struct READ_BLOCK_STATUS4resok {
///     bool            rbsr_eof;
///     block_owner4    rbsr_blocks<>;
/// };
Figure 33
/// union READ_BLOCK_STATUS4res switch (nfsstat4 rbsr_status) {
///     case NFS4_OK:
///         READ_BLOCK4resok     rbsr_resok4;
///     default:
///         void;
/// };
Figure 34

6.2.3. DESCRIPTION

READ_BLOCK_STATUS4 differs from READ_BLOCK4 in that it only reads active and pending headers in the desired data range.

6.3. Operation 79: READ_BLOCK4 - Read Blocks from File

6.3.1. ARGUMENTS

/// struct READ_BLOCK4args {
///     /* CURRENT_FH: file */
///     stateid4    rba_stateid;
///     offset4     rba_offset;
///     count4      rba_count;
/// };
Figure 35

6.3.2. RESULTS

/// struct read_block4 {
///     uint32_t        rb_crc;
///     uint32_t        rb_effective_len;
///     block_owner4    rb_owner;
///     uint32_t        rb_seq_id;
///     opaque          rb_block<>;
/// };
Figure 36
/// struct READ_BLOCK4resok {
///     bool        rbr_eof;
///     read_block4 rbr_blocks<>;
/// };
Figure 37
/// union READ_BLOCK4res switch (nfsstat4 rbr_status) {
///     case NFS4_OK:
///          READ_BLOCK4resok     rbr_resok4;
///     default:
///          void;
/// };
Figure 38

6.3.3. DESCRIPTION

READ_BLOCK is READ4 (see Section 18.22 of [RFC8881]) with additional semantics over the block_owner and the activation of blocks. As such, all of the normal semantics of READ4 directly apply.

The main difference between the two operations is that READ_BLOCK works on blocks and not a raw data stream. As such rba_offset is the starting block offset in the file and not the byte offset in the file. Some erasure encoding types can have different block sizes depending on the block type. Further, rba_count is a count of blocks to read and not bytes to read.

READ_BLOCK also only returns the activated block at the location. I.e., if a client overwrites a block at offset 10, then tries to read the block without activating it, then the original block is returned.

When reading a set of blocks across the data servers, it can be the case that some data servers do not have any data at that location. In that case, the server either returns rbr_eof if the rba_offset exceeds the number of blocks that the data server is aware or it returns an empty block for that block.

For example, in Figure 39, the client asks for 4 blocks starting with the 3rd block in the file. The second data server responds as in Figure 40. The client would read this as there is valid data for blocks 2 and 4, there is a hole at block 3, and there is no data for block 5. Note that the data server MUST calculate a valid rb_crc for block 3 based on the generated fields.

                Data Server 2
        +--------------------------------+
        | READ_BLOCK4args                |
        +--------------------------------+
        | rba_stateid: 0                 |
        | rba_offset: 2                  |
        | rba_count: 4                   |
        +----------+---------------------+
Figure 39
                Data Server 2
        +--------------------------------+
        | READ_BLOCK4resok               |
        +--------------------------------+
        | rbr_eof: true                  |
        | rbr_blocks[0]:                 |
        |     rb_crc: 0x3faddace         |
        |     rb_effective_len: 4kB      |
        |     rb_owner:                  |
        |            bo_block_id: 2      |
        |            bo_change_id: 3     |
        |            bo_client_id: 6     |
        |            bo_activated: true  |
        |     rb_seq_id: 1               |
        |     rb_block: ....             |
        | rbr_blocks[0]:                 |
        |     rb_crc: 0xdeade4e5         |
        |     rb_effective_len: 4kB      |
        |     rb_owner:                  |
        |            bo_block_id: 3      |
        |            bo_change_id: 0     |
        |            bo_client_id: 0     |
        |            bo_activated: false |
        |     rb_seq_id: 1               |
        |     rb_block: 0000...00000     |
        | rbr_blocks[0]:                 |
        |     rb_crc: 0x7778abcd         |
        |     rb_effective_len: 2kB      |
        |     rb_owner:                  |
        |            bo_block_id: 4      |
        |            bo_change_id: 3     |
        |            bo_client_id: 6     |
        |            bo_activated: true  |
        |     rb_seq_id: 1               |
        |     rb_block: ....             |
        +--------------------------------+
Figure 40

6.4. Operation 80: ROLLBACK_BLOCK - Rollback Cached Block Data

6.4.1. ARGUMENTS

/// struct ROLLBACK_BLOCK4args {
///     /* CURRENT_FH: file */
///     offset4         rba_offset;
///     count4          rba_count;
///     block_owner4    rba_blocks<>;
/// };
Figure 41

6.4.2. RESULTS

/// struct ROLLBACK_BLOCK4resok {
///     verifier4       rbr_writeverf;
/// };
Figure 42
/// union ROLLBACK_BLOCK4res switch (nfsstat4 rbr_status) {
///     case NFS4_OK:
///         ROLLBACK_BLOCK4resok   rbr_resok4;
///     default:
///         void;
/// };
Figure 43

6.4.3. DESCRIPTION

ROLLBACK_BLOCK4 is a new form like COMMIT4 (see Section 18.3 of [RFC8881]) with additional semantics over the block_owner the rolling back the writing of blocks. As such, all of the normal semantics of COMMIT4 directly apply.

The main difference between the two operations is that ROLLBACK_BLOCK4 works on blocks and not a raw data stream. As such rba_offset is the starting block offset in the file and not the byte offset in the file. Some erasure encoding types can have different block sizes depending on the block type. Further, rba_count is a count of blocks to rollback and not bytes to rollback.

Further, while it may appear that the combination of rba_offset and rba_count are redundant to rba_blocks, the purpose of rba_blocks is to allow the data server to differentiate between potentially multiple pending blocks.

ROLLBACK_BLOCK4 deletes prior WRITE_BLOCK4 transactions. In case of write holes, it allows the client to undo transactions to repair the file.

6.5. Operation 81: WRITE_BLOCK4 - Write Blocks to File

6.5.1. ARGUMENTS

/// const WRITE_BLOCK_FLAGS_UPDATE_HEADER_ONLY   = 0x00000001;
/// const WRITE_BLOCK_FLAGS_ACTIVATE_IF_EMPTY      = 0x00000002;
Figure 44
/// struct write_block4 {
///     uint32_t        wb_crc;
///     uint32_t        wb_effective_len;
///     uint32_t        wb_flags;
///     opaque          wb_block<>;
/// };
Figure 45
/// struct guard_block_owner4 {
///     changeid4   gbo_change_id;
///     clientid4   gbo_client_id;
/// };
Figure 46
/// union write_block_guard4 (bool wbg_check) {
///     case TRUE:
///         guard_block_owner4   wbg_block_owner;
///     case FALSE:
///         void;
/// };
Figure 47
/// struct WRITE_BLOCK4args {
///     /* CURRENT_FH: file */
///     stateid4           wba_stateid;
///     offset4            wba_offset;
///     stable_how4        wba_stable;
///     block_owner4       wba_owner;
///     uint32_t           wba_seq_id;
///     write_block_guard4 wba_guard;
///     write_block4       wba_data<>;
/// };
Figure 48

6.5.2. RESULTS

/// struct WRITE_BLOCK4resok {
///     count4          wbr_count;
///     stable_how4     wbr_committed;
///     verifier4       wbr_writeverf;
///     block_owner4    wbr_owners<>;
/// };
Figure 49
/// union WRITE_BLOCK4res switch (nfsstat4 wbr_status) {
///     case NFS4_OK:
///         WRITE_BLOCK4resok    wbr_resok4;
///     default:
///         void;
/// };
Figure 50

6.5.3. DESCRIPTION

WRITE_BLOCK4 is WRITE4 (see Section 18.32 of [RFC8881]) with additional semantics over the block_owner and the activation of blocks. As such, all of the normal semantics of WRITE4 directly apply.

The main difference between the two operations is that WRITE_BLOCK4 works on blocks and not a raw data stream. As such wba_offset is the starting block offset in the file and not the byte offset in the file. Some erasure encoding types can have different block sizes depending on the block type. Further, wbr_count is a count of written blocks and not written bytes.

If wba_stable is FILE_SYNC4, the data server MUST commit the written header and block data plus all file system metadata to stable storage before returning results. This corresponds to the NFSv2 protocol semantics. Any other behavior constitutes a protocol violation. If wba_stable is DATA_SYNC4, then the data server MUST commit all of the header and block data to stable storage and enough of the metadata to retrieve the data before returning. The data server implementer is free to implement DATA_SYNC4 in the same fashion as FILE_SYNC4, but with a possible performance drop. If wba_stable is UNSTABLE4, the data server is free to commit any part of the header and block data and the metadata to stable storage, including all or none, before returning a reply to the client. There is no guarantee whether or when any uncommitted data will subsequently be committed to stable storage. The only guarantees made by the data server are that it will not destroy any data without changing the value of writeverf and that it will not commit the data and metadata at a level less than that requested by the client.

The activation of header and block data interacts with the bo_activated for each of the written blocks. If the data is not committed to stable storage then the bo_activated field MUST NOT be set to true. Once the data is committed to stable storage, then the data server can set the block's bo_activated if one of these conditions apply:

  • it is the first write to that block and the WRITE_BLOCK_FLAGS_ACTIVATE_IF_EMPTY flag is set
  • the ACTIVATE_BLOCK4 is issued later for that block.

There are subtle interactions with write holes caused by racing clients. One client could win the race in each case, but because it used a wba_stable of UNSTABLE4, the subsequent writes from the second client with a wba_stable of FILE_SYNC4 can be awarded the bo_activated being set to true for each of the blocks in the payload.

Finally, the interaction of wba_stable can cause a client to mistakenly believe that by the time it gets the response of bo_activated of false, that the blocks are not activated. A subsequent READ_BLOCK4 or READ_BLOCK_STATUS4 might show that the bo_activated is true without any interaction by the client via ACTIVATE_BLOCK4. Automatic setting of bo_activated to true if it is the first write should be a performance boost. But it can lead to the client having incorrect information (as above) and trying to ACTIVATE_BLOCK4 a payload that has lost the race. But is that bad? If you have racing clients, there is no guarantee at all as to the contents of the file. TH

6.5.3.1. Guarding the Write

A guarded WRITE_BLOCK4 is when the writing of a block MUST fail if wba_guard.wbg_check is set and the target block does not have both the same change_id as the gbo_change_id and the same client_id as the gbo_client_id. This is useful in read-update-write scenarios. The client reads a block, updates it, and is prepared to write it back. It guards the write such that if another writer has modified the block, the data server will reject the modification.

Note that as the guard_block_owner4 (see Figure 46 does not have a block_id and the WRITE_BLOCK4 applies to all blocks in the range of wba_offset to the length of wba_data, then each of the target blocks MUST have the same change_id and client_id. The client SHOULD present the smallest set of blocks as possible to meet this requirement.

And the complexity goes up here. Does the DS reject only based on active blocks? Or can inactive ones also cause rejection? TH

Is the DS supposed to vet all blocks first or proceed to the first error? Or do all blocks and return an array of errors? (This last one is a no-go.) Also, if we do the vet first, what happens if a WRITE_BLOCK4 comes in after the vetting? Are we to lock the file during this process. Even if we do that, we still have the issue of multiple DSes. TH

6.5.3.2. Updating the Header Only

Some erasure encoding types keep their blocks in plain text and have parity blocks in order to provide integrity. A common configuration for Reed Solomon is 4 active blocks, 2 parity blocks, and 2 spares. Assuming 4kB data blocks, then each payload delivers 16kB of data and 8kB of parity. If the application modifies the first data block, then all that needs to change is the first active block and the two parity blocks in the payload.

In any other approach, only 12kB of the total 24kB has to be written to storage. If that is attempted in the Flexible Files Version 2 Layout Type, then the payload will be deemed as inconsistent. The reason for this is that the change_id for the unmodified blocks will not match those of the modified blocks.

The WRITE_BLOCK_FLAGS_UPDATE_HEADER_ONLY flag in wb_flags can be used to save the transmission of the blocks. If it is set, then the wb_block is ignored. It MUST be empty. Note that the client MUST only modify both the wb_crc and the wba_owner.bo_change_id fields in this case. The wb_crc MUST change as the wba_owner.bo_change_id has been modified (see Section 3.1.1).

For the purpose of computing the activation state of the block, The data server MUST treat this as an overwrite. Thus, in the response, bo_activated MUST be false.

7. Extraction of XDR

This document contains the external data representation (XDR) [RFC4506] description of the Flexible Files Version 2 Layout Type. The XDR description is embedded in this document in a way that makes it simple for the reader to extract into a ready-to-compile form. The reader can feed this document into the following shell script to produce the machine readable XDR description of the new flags:

#!/bin/sh
grep '^ *///' $* | sed 's?^ */// ??' | sed 's?^ *///$??'

That is, if the above script is stored in a file called 'extract.sh', and this document is in a file called 'spec.txt', then the reader can do:

sh extract.sh < spec.txt > erasure_coding_prot.x

The effect of the script is to remove leading white space from each line, plus a sentinel sequence of '///'. XDR descriptions with the sentinel sequence are embedded throughout the document.

Note that the XDR code contained in this document depends on types from the NFSv4.2 nfs4_prot.x file (generated from [RFC7863]) and the Flexible Files Layout Type flexfiles.x file (generated from [RFC8435]). This includes both nfs types that end with a 4, such as offset4, length4, etc., as well as more generic types such as uint32_t and uint64_t.

While the XDR can be appended to that from [RFC7863], the various code snippets belong in their respective areas of that XDR.

8. Security Considerations

This document has the same security considerations as both Flex Files Layout Type version 1 (see Section 15 of [RFC8435]) and NFSv4.2 (see Section 17 of [RFC7862]).

9. IANA Considerations

9.1. pNFS Layout Types Registry

[RFC8881] introduced the 'pNFS Layout Types Registry'; new layout type numbers in this registry need to be assigned by IANA. This document defines the protocol associated with an existing layout type number: LAYOUT4_FLEX_FILES_V2 (see Table 1).

Table 1: Layout Type Assignments
Layout Type Name Value RFC How Minor Versions
LAYOUT4_FLEX_FILES_V2 0x6 RFCTBD10 L 1

9.2. NFSv4 Recallable Object Types Registry

[RFC8881] also introduced the 'NFSv4 Recallable Object Types Registry'. This document defines new recallable objects for RCA4_TYPE_MASK_FFV2_LAYOUT_MIN and RCA4_TYPE_MASK_FFV2_LAYOUT_MAX (see Table 2).

Table 2: Recallable Object Type Assignments
Recallable Object Type Name Value RFC How Minor Versions
RCA4_TYPE_MASK_FFV2_LAYOUT_MIN 20 RFCTBD10 L 1
RCA4_TYPE_MASK_FFV2_LAYOUT_MAX 21 RFCTBD10 L 1

9.3. Flexible Files Version 2 Layout Type Erasure Encoding Type Registry

This document introduces the 'Flexible Files Version 2 Layout Type Erasure Encoding Type Registry'. This document defines the FFV2_ENCODING_MIRRORED type for Client-Side Mirroring (see Table 3).

Table 3: Flexible Files Version 2 Layout Type Erasure Encoding Type Assignments
Erasure Encoding Type Name Value RFC How Minor Versions
FFV2_ENCODING_MIRRORED 1 RFCTBD10 L 2

10. References

10.1. Normative References

[RFC2119]
Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, DOI 10.17487/RFC2119, , <https://www.rfc-editor.org/info/rfc2119>.
[RFC4506]
Eisler, M., Ed., "XDR: External Data Representation Standard", STD 67, RFC 4506, DOI 10.17487/RFC4506, , <https://www.rfc-editor.org/info/rfc4506>.
[RFC7530]
Haynes, T., Ed. and D. Noveck, Ed., "Network File System (NFS) Version 4 Protocol", RFC 7530, DOI 10.17487/RFC7530, , <https://www.rfc-editor.org/info/rfc7530>.
[RFC7862]
Haynes, T., "Network File System (NFS) Version 4 Minor Version 2 Protocol", RFC 7862, DOI 10.17487/RFC7862, , <https://www.rfc-editor.org/info/rfc7862>.
[RFC7863]
Haynes, T., "Network File System (NFS) Version 4 Minor Version 2 External Data Representation Standard (XDR) Description", RFC 7863, DOI 10.17487/RFC7863, , <https://www.rfc-editor.org/info/rfc7863>.
[RFC8174]
Leiba, B., "Ambiguity of Uppercase vs Lowercase in RFC 2119 Key Words", BCP 14, RFC 8174, DOI 10.17487/RFC8174, , <https://www.rfc-editor.org/info/rfc8174>.
[RFC8178]
Noveck, D., "Rules for NFSv4 Extensions and Minor Versions", RFC 8178, DOI 10.17487/RFC8178, , <https://www.rfc-editor.org/info/rfc8178>.
[RFC8435]
Halevy, B. and T. Haynes, "Parallel NFS (pNFS) Flexible File Layout", RFC 8435, DOI 10.17487/RFC8435, , <https://www.rfc-editor.org/info/rfc8435>.
[RFC8881]
Noveck, D., Ed. and C. Lever, "Network File System (NFS) Version 4 Minor Version 1 Protocol", RFC 8881, DOI 10.17487/RFC8881, , <https://www.rfc-editor.org/info/rfc8881>.

10.2. Informative References

[Plank97]
Plank, J., "A Tutorial on Reed-Solomon Coding for Fault-Tolerance in RAID-like System", , <http://web.eecs.utk.edu/~jplank/plank/papers/CS-96-332.html>.
[RFC1813]
Callaghan, B., Pawlowski, B., and P. Staubach, "NFS Version 3 Protocol Specification", RFC 1813, DOI 10.17487/RFC1813, , <https://www.rfc-editor.org/info/rfc1813>.

Appendix A. Acknowledgments

The following from Hammerspace were instrumental in driving Flex Files v2: David Flynn, Trond Myklebust, Tom Haynes, Didier Feron, Jean-Pierre Monchanin, Pierre Evenou, and Brian Pawlowski.

Christoph Helwig was instrumental in making sure Flexible Files Version 2 Layout Type was applicable to more than one Erasure-Encoding Type.

Appendix B. RFC Editor Notes

This section is to be removed before publishing as an RFC.

[RFC Editor: prior to publishing this document as an RFC, please replace all occurrences of RFCTBD10 with RFCxxxx where xxxx is the RFC number of this document]

Author's Address

Thomas Haynes
Hammerspace