Internet-Draft | erasure encoding | November 2024 |
Haynes | Expires 9 May 2025 | [Page] |
Parallel NFS (pNFS) allows a separation between the metadata (onto a metadata server) and data (onto a storage device) for a file. The Flexible File Version 2 Layout Type is defined in this document as an extension to pNFS that allows the use of storage devices that require only a limited degree of interaction with the metadata server and use already-existing protocols. Data replication is also added to provide integrity.¶
This note is to be removed before publishing as an RFC.¶
Discussion of this draft takes place on the NFSv4 working group mailing list ([email protected]), which is archived at https://mailarchive.ietf.org/arch/browse/nfsv4/. Working Group information can be found at https://datatracker.ietf.org/wg/nfsv4/about/.¶
This note is to be removed before publishing as an RFC.¶
This draft starts sparse and will be filled in as details are ironed out. For example, WRITE_BLOCK4 in Section 6.5 is presented as being WRITE4 (see Section 18.32 of [RFC8881]) plus some semantic changes. In the first draft, we simply explain the semantics changes. As these are accepted by the knowledgeable reviewers, we will flesh out the WRITE_BLOCK4 section to include sub-sections more akin to 18.32.3 and 18.32.4 of [RFC8881].¶
Except where called out, all the semantics of the Flexible File Version 1 Layout Type presented in [RFC8435] still apply. This new version extends it and does not replace it.¶
This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.¶
Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at https://datatracker.ietf.org/drafts/current/.¶
Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."¶
This Internet-Draft will expire on 9 May 2025.¶
Copyright (c) 2024 IETF Trust and the persons identified as the document authors. All rights reserved.¶
This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (https://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Revised BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Revised BSD License.¶
In Parallel NFS (pNFS) (see Section 12 of [RFC8881]), the metadata server returns layout type structures that describe where file data is located. There are different layout types for different storage systems and methods of arranging data on storage devices. [RFC8435] defined the Flexible File Version 1 Layout Type used with file-based data servers that are accessed using the NFS protocols: NFSv3 [RFC1813], NFSv4.0 [RFC7530], NFSv4.1 [RFC8881], and NFSv4.2 [RFC7862].¶
The Client Side Mirroring (see Section 8 of [RFC8435]), introduced with the first version of the Flexible File Layout Type, provides for replication of data but does not provide for integrity of data. In the event of an error, an user would be able to repair the file by silvering the mirror contents. I.e., they would pick one of the mirror instances and replicate it to the other instance locations.¶
However, lacking integrity checks, silent corruptions are not able to be detected and the choice of what constitutes the good copy is difficult. This document updates the Flexible File Layout Type to version 2 by providing data integrity for erasure encoding. It introduces new variants of COMMIT4 (see Section 18.3 of [RFC8881]) , READ4 (see Section 18.22 of [RFC8881]) , and WRITE4 (see Section 18.32 of [RFC8881]) to allow for the transmission of integrity checking.¶
Using the process detailed in [RFC8178], the revisions in this document become an extension of NFSv4.2 [RFC7862]. They are built on top of the external data representation (XDR) [RFC4506] generated from [RFC7863].¶
The key words 'MUST', 'MUST NOT', 'REQUIRED', 'SHALL', 'SHALL NOT', 'SHOULD', 'SHOULD NOT', 'RECOMMENDED', 'NOT RECOMMENDED', 'MAY', and 'OPTIONAL' in this document are to be interpreted as described in BCP 14 [RFC2119] [RFC8174] when, and only when, they appear in all capitals, as shown here.¶
In order to introduce erasure encoding to pNFS, a new layout type of LAYOUT4_FLEX_FILES_V2 needs to be defined. While we could define a new layout type per erasure encoding type, there exist use cases where multiple erasure encoding types exist in the same layout.¶
The original layouttype4 introduced in [RFC8881] is modified to as in Figure 1.¶
This document defines structures associated with the layouttype4 value LAYOUT4_FLEX_FILES_V2. [RFC8881] specifies the loc_body structure as an XDR type 'opaque'. The opaque layout is uninterpreted by the generic pNFS client layers but is interpreted by the Flexible File Version 2 Layout Type implementation. This section defines the structure of this otherwise opaque value, ffv2_layout4.¶
The ffv2_encoding_type (see Figure 2) encompasses a new IANA registry for 'Flex Files V2 Erasure Encoding Type Registry' (see Section 9.3). I.e., instead of defining a new Layout Type for each Erasure Encoding, we define a new Erasure Encoding Type. Except for FFV2_ENCODING_MIRRORED, each of the types is expected to employ the new operations in this document.¶
FFV2_ENCODING_MIRRORED offers replication of data and not integrity of data. As such, it does not need operations like WRITE_BLOCK4 (see Section 6.5).¶
ff_flags4 is defined as in Section 5.1 of [RFC8435] and is shown in Figure 3 for reference.¶
The ffv2_file_info4 is a new structure to help with the stateid issue discussed in Section 5.1 of [RFC8435]. I.e., in version 1 of the Flexible File Layout Type, there was the singleton ffds_stateid combined with the ffds_fh_vers array. I.e., each NFSv4 version has its own stateid. In Figure 4, each NFSv4 file handle has a one-to-one correspondence to a stateid.¶
The ffv2_layout4 (in Figure 5) flags detail the state of the data servers. With Erasure Encoding algorithms, there are both Systematic and Non-Systematic approaches. In the Systematic, the bits for integrity are placed amoungst the resulting transformed block. Such an implementation would typically see FFV2_DS_FLAGS_ACTIVE and FFV2_DS_FLAGS_SPARE data servers. The FFV2_DS_FLAGS_SPARE ones allow the client to repair a payload with enaging the metadata server. I.e., if one of the FFV2_DS_FLAGS_ACTIVE did not respond to a WRITE_BLOCK4, the client could fail the block to the FFV2_DS_FLAGS_SPARE data server.¶
With the Non-Systematic approach, the data and integrity live on different data servers. Such an implementation would typically see FFV2_DS_FLAGS_ACTIVE and FFV2_DS_FLAGS_PARITY data servers. If the implementation wanted to allow for local repair, it would also use FFV2_DS_FLAGS_SPARE. Note that with a Non-Systematic approach, it is possible to update parts of the blocks, see Section 6.5.3.2.¶
See [Plank97] for further reference to storage layouts for encoding.¶
The ffv2_data_server4 (in Figure 6) describes a data file and how to access it via the different NFS protocols.¶
The ffv2_encoding_type_data (in Figure 7) describes erasure encoding type specific fields. I.e., this is how the encoding type can communicate the need for counts of active, spare, parity, and repair types of blocks.¶
The ffv2_mirror4 (in Figure 8) describes the Flexible File Layout Version 2 specific fields.¶
The ffv2_layout4 (in Figure 9) describes the Flexible Files Layout Version 2.¶
The ffv2_layouthint4 (in Figure 10) describes the layout_hint (see Section 5.12.4 of [RFC8881]) that the client can provide to the metadata server.¶
Note that effectively, multiple encoding types can be present in a Flexible Files Version 2 Layout Type layout. The ffv2_layout4 has an array of ffv2_mirror4, each of which has a ffv2_encoding_type. The main reason to allow for this is to provide for either the assimilation of a non-erasure encoded file to an erasure encoded file or the exporting of an erasure encoded file to a non-erasure encoded file.¶
Assume there is an additional ffv2_encoding_type of FFV2_ENCODING_REED_SOLOMON and it needs 4 active blocks, 2 parity blocks, and 2 spare blocks. The user wants to actively assimilate a regular file. As such, a layout might be as represented in Figure 11. As this is an assimilation, most of the data reads will be satisfied by READ4 (see Section 18.22 of [RFC8881]) calls to index 0. However, as this is also an active file, there could also be READ_BLOCK4 (see Section 6.3) calls to the other indexes.¶
When performing I/O via a FFV2_ENCODING_MIRRORED encoding type, the non-transformed data will be used, Whereas with other encoding types, a metadata header and transformed block will be sent. Further, when reading data from the instance files, the client MUST be prepared to have one of the encoding types supply data and the other type not to supply data. I.e., the READ_BLOCK4 call might return rlr_eof set to true (see Figure 37), which indicates that there is no data, where the READ4 call might return eof to be false, which indicates that there is data. The client MUST determine that there is in fact data.¶
An example use case is the active assimilation of a file to ensure integrity. As the client is helping to translated the file to the new encoding scheme, it is actively modifying the file. As such, it might be sequentially reading the file in order to translate. The READ4 call would be returning data and the READ_BLOCK4 would not be returning data. As the client overwrites the file, the WRITE4 call and the WRITE_BLOCK4 call would both have data sent. Finally, if the client read back a section which had been modified earlier, both the READ4 and READ_BLOCK4 calls would return data.¶
Erasure Encoding takes an data block and transforms it to a payload to send to the data servers (see Figure 12). It generates a metadata header and transformed block per data server. The header is metadata information for the transformed block. From now on, the metadata is simply referred to as the header and the transformed block as the block. The payload of a data block is the set of generated headers and blocks for that data block.¶
The change_id is an unique identifier generated by the client to describe the current write transaction. The client_id is an unique identifier assigned by the metadata server to describe which client is making the current write transaction. The seq_id describes the index across payload. The eff_len is the length of the data within the block. Finally, the crc32 is the 32 bit crc calculation of the header (with the crc32 field being 0) and the block. By combining the two parts of the payload, integrity is ensured for both the parts.¶
While the data block might have a length of 4kB, that does not necessarily mean that the length of the block is 4kB. That length is determined by the erasure encoding type algorithm. For example, Reed Solomon might have 4kB blocks with the data integrity being compromised by parity blocks. Another example would be the Mojette Transformation, which might have 1kB block lengths.¶
The payload contains redundancy which will allow the erasure encoding type algorithm to repair blocks in the payload as it is transformed back to a data block (see Figure 17). A payload is consistent when all of the contained headers share the same change_id and client_id. It has integrity when it is consistent and the blocks all pass the crc32 checks.¶
Each data block of the file resident in the client's cache of the file will be encoded into N different payloads to be sent to the data servers as shown in Figure 12. As WRITE_BLOCK4 (see Section 6.5) can encode multiple write_block4 into a single transaction, a more accurate description of a WRITE_BLOCK4 might be as in Figure 13.¶
pay attention to the 128 bits alignment for wb_block_valDF¶
This describes a 3 block write of data from an offset of 1 block in the file. As each block shares the wba_owner, it is only presented once. I.e., the data server will be able to construct the header for each wba_block from the wba_seq_id, wba_owner, wb_effective_len, and wb_crc.¶
Assuming that there were no issues, Figure 14 illustrates the results. The payload sequence id is implicit in the WRITE_BLOCK4args.¶
Assuming the header and payload as in Figure 15, the crc32 needs to be calculated in order to fill in the wb_crc field. In this case, the crc32 is calculated over the 5 fields as shown in the header and the data of the block. In this example, it is calculated to be 0x21de8. The resulting WRITE_BLOCK4 is shown in Figure 16.¶
When reading blocks via a READ_BLOCK4 operation, the client will decode the headers and payload into data blocks as shown in Figure 17. If the resulting data block is to be sized less than a data block, i.e., the rb_effective_len is less than the data block size, then the inverse transformation MUST fill the remainder of the data block with 0s. It must appear as a freshly written data block which was not completely filled.¶
Note that at this time, the client could detect issues in the integrity of the data. The handling and repair are out of the scope of this document and MUST be addressed in the document describing each erasure encoding type.¶
Assuming the READ_BLOCK4 results as in Figure 18, the crc32 needs to be checked in order to ensure data integrity. Conceptually, a header and payload can be built as shown in Figure 19. The crc32 is calculated over the 5 fields as shown in the header and the 3kB of data block. In this example, it is calculated to be 0x21de8. Thus this payload for the data server has data integrity.¶
Unlike the regular NFSv4.2 I/O operations, the base unit of I/O in this document is the block. The raw data stream is encoded/decoded into blocks as described in Section 3. Each block has the concept of whether it is activated or pending activation. This is crucial in detecting write holes. A write hole occurs either when two different clients write to the same block concurrently or when a client overwrites existing data. In the first scenario, the order of writes is not deterministic and can result in a mixture of blocks in the payload. In the last scenario, network partitions or client restarts can result in partial writes. In both cases, the blocks have to be repaired, either by abandoning the new I/O or by sorting out the winner. Note that unlike the case of the encoding type detecting data integrity issues (see Section 3.2), the case of write holes is in the scope of this document.¶
What is out of scope of this document is the manner in which the data servers implement the semantics of the new operations. I.e., the data servers might be able to leverage the native file system to achieve the semantics or it might completely implement a multi-file approach to stage WRITE_BLOCK4 results and then shuffle blocks when the ACTIVATE_BLOCK4 or ROLLBACK_BLOCK4 operations activate the data.¶
Consider a client which was in the middle of sending WRITE_BLOCK4 to a set of data servers and it crashes. Regardless of whether it comes back online or not, the metadata server can detect that the client had restarted when it had an outstanding LAYOUTIOMODE4_RW on the file. The metadata server can assign the file to a repair program, which would basically scan the entire file with READ_BLOCK_STATUS4. When it determines that it does not have enough payload blocks to rebuild the data block, it can determine that the I/O for that data block was not complete and throw away the blocks.¶
Note that the repair process can throw away the blocks by using the ROLLBACK_BLOCK4 operation to unstage the pending written blocks.¶
Consider a client which gets back conflicting information in the WRITE_BLOCK4 results. Assume that we had written to 6 data servers with WRITE_BLOCK4s as in Figure 20. And we get the results as in Figure 21.¶
Figure 21 shows that the first block was an overwrite and an activation has to be done in order for the newly written block to be returned in a READ_BLOCK4. Assume that the next four data servers had the same type of response.¶
But assume that data server 4 does not respond to the WRITE_BLOCK4 operation. While the client can detect this and send the WRITE_BLOCK4 to any data server marked as FFV2_DS_FLAGS_SPARE, it might decide to see if the data server did in fact do the transaction. It might also be the case that there are no data servers marked as FFV2_DS_FLAGS_SPARE. The client issues a READ_BLOCK_STATUS4 (see Figure 22) and gets the results in Figure 23. This indicates that data server 4 did not get the WRITE_BLOCK4 request.¶
In general, the client can either resend the WRITE_BLOCK4 request, determine by the erasure encoding type that there is sufficient payload blocks present to decode the data block, or ROLLBACK_BLOCK4 the existing blocks to back out the change.¶
Assume that the client has written to 6 data servers with WRITE_BLOCK4s as in Figure 20. But now it gets back the conflicting results in Figure 24 and Figure 25. From this, it can detect that there was a race with another client. Note, even though both clients present the same bo_change_id, nothing can be inferred as to the ordering of the two transactions. In some cases, bo_client_id 10 won the race and in some cases, bo_client_id 6 won the race.¶
As a subsequent READ_BLOCK4 will produce garbage, the clients need to agree on how to fix this issue without any communication. A simplistic approach is for each client to retry the WRITE_BLOCK4 until such time as the payload is consistent. Note, this does not mean that both clients win, it just means that one of them wins.¶
Another option is for the clients to report a LAYOUTERROR4 (see Section 15.6 of [RFC7862]) to the metadata server with an error of NFS4ERR_ERASURE_ENCODING_NOT_CONSISTENT. That would then allow the metadata server to assign the repairing of the file.¶
Note that nothing prevents pending blocks from accumulating or from more than 2 writers trying to write the same payload. An example of such a WRITE_BLOCK4resok in response to the example of Figure 20 is shown in Figure 26. Note only has client 6 tried to update the block 1, but all of clients 6, 7, and 20 are attempting to update it.¶
In addition to the above write hole scenarios, a further complication is a racing reader and writer. If the client reads a block and determines that the payload is not consistent (i.e., not all of the payload blocks share the same client_id and change_id), then it can assume that it has encountered a race with another client writing to the file. It SHOULD retry the READ_BLOCK4 operation until payload consistency is achieved. It may determine to send a LAYOUTERROR4 to the metadata server with an error of NFS4ERR_ERASURE_ENCODING_NOT_CONSISTENT. And should it hang forever? Perhaps a new layout error that the client can send the MDS? Or should it probe with READ_BLOCK_STATUS4 to try to repair? TH Perhaps a LAYOUTERROR_BLOCK4 to send an encoding type specific location? TH¶
The client encountered a payload in which the blocks were inconsistent and stays inconsistent. As the client can not tell if another client is actively writing, it informs the metadata server of this error via LAYOUTERROR4. The metadata server can then arrange for repair of the file.¶
Note that due to the opaqueness of the clientid4, the client can not differentiate between boot instances of the metadata server or client, but the metadata server can do that differentiation. I.e., it can tell if the inconsistency is from the same client, whether that client is active and actively writing to the file (i.e., does the client have the file open and with a LAYOUTIOMODE4_RW layout?).¶
The client requested a ffv2_encoding_type which the metadata server does not support. I.e., if the client sends a layout_hint requesting an erasure encoding type that the metadata server does not support, this error code can be returned. The client might have to send the layout_hint several times to determine the overlapping set of supported erasure encoding types.¶
The client requested to the data server to update the header only and the data server can not find a matching block at that offset.¶
When a data server connects to a metadata server it can via EXCHANGE_ID (see Section 18.35 of [RFC8881]) state its pNFS role. The data server can use EXCHGID4_FLAG_USE_ERASURE_DS (see Figure 27) to indicate that it supports the new NFSv4.2 operations introduced in this document. Section 13.1 [RFC8881] describes the interaction of the various pNFS roles masked by EXCHGID4_FLAG_MASK_PNFS. However, that does not mask out EXCHGID4_FLAG_USE_ERASURE_DS. I.e., EXCHGID4_FLAG_USE_ERASURE_DS can be used in combination with all of the pNFS flags.¶
If the data server sets EXCHGID4_FLAG_USE_ERASURE_DS during the EXCHANGE_ID operation, then it MUST support: ACTIVATE_BLOCK4, READ_BLOCK_STATUS4, READ_BLOCK4, ROLLBACK_BLOCK4, and WRITE_BLOCK4. Further, note that this support is orthoganol to the Erasure Encoding Type selected. The data server is unaware of which type is driving the I/O. It is also unaware of the payload layout or what type of block it is serving.¶
The block_owner4 (see Figure 28) is used to determine when and by whom a block was written. The bo_block_id is used to identify the block and MUST be the index of the block within the file. I.e., it is the offset of the start of the block divided by the block len. The bo_client_id MUST be the client id handed out by the metadata server to the client as the eir_clientid during the EXCHANGE_ID results (see Section 18.35 of [RFC8881]) and MUST NOT be the client id supplied by the data server to the client. I.e., across all data files, the bo_client_id uniquely describes one and only one client.¶
The bo_change_id is like the change attribute (see Section 5.8.1.4 of [RFC8881]) in that each block write by a given client has to have an unique bo_change_id. I.e., it can be determined which transaction across all data files that a block corresponds.¶
The bo_activated is used by the data server to indicate whether the block I/O was activated or pending activation. The first WRITE_BLOCK4 to a location is automatically activated if the WRITE_BLOCK_FLAGS_ACTIVATE_IF_EMPTY is set. Subsequent WRITE_BLOCK4 modifications to that block location are not automatically activated. The client has to ACTIVATE_BLOCK4 the block in order to get it activated.¶
The concept of automatically activating is dependent on the wba_stable field of the WRITE_BLOCK4args.¶
ACTIVATE_BLOCK4 is COMMIT4 (see Section 18.3 of [RFC8881]) with additional semantics over the block_owner activating the blocks. As such, all of the normal semantics of COMMIT4 directly apply.¶
The main difference between the two operations is that ACTIVATE_BLOCK4 works on blocks and not a raw data stream. As such aba_offset is the starting block offset in the file and not the byte offset in the file. Some erasure encoding types can have different block sizes depending on the block type. Further, aba_count is a count of blocks to activate and not bytes to activate.¶
Further, while it may appear that the combination of aba_offset and aba_count are redundant to aba_blocks, the purpose of aba_blocks is to allow the data server to differentiate between potentially multiple pending blocks.¶
READ_BLOCK_STATUS4 differs from READ_BLOCK4 in that it only reads active and pending headers in the desired data range.¶
READ_BLOCK is READ4 (see Section 18.22 of [RFC8881]) with additional semantics over the block_owner and the activation of blocks. As such, all of the normal semantics of READ4 directly apply.¶
The main difference between the two operations is that READ_BLOCK works on blocks and not a raw data stream. As such rba_offset is the starting block offset in the file and not the byte offset in the file. Some erasure encoding types can have different block sizes depending on the block type. Further, rba_count is a count of blocks to read and not bytes to read.¶
READ_BLOCK also only returns the activated block at the location. I.e., if a client overwrites a block at offset 10, then tries to read the block without activating it, then the original block is returned.¶
When reading a set of blocks across the data servers, it can be the case that some data servers do not have any data at that location. In that case, the server either returns rbr_eof if the rba_offset exceeds the number of blocks that the data server is aware or it returns an empty block for that block.¶
For example, in Figure 39, the client asks for 4 blocks starting with the 3rd block in the file. The second data server responds as in Figure 40. The client would read this as there is valid data for blocks 2 and 4, there is a hole at block 3, and there is no data for block 5. Note that the data server MUST calculate a valid rb_crc for block 3 based on the generated fields.¶
ROLLBACK_BLOCK4 is a new form like COMMIT4 (see Section 18.3 of [RFC8881]) with additional semantics over the block_owner the rolling back the writing of blocks. As such, all of the normal semantics of COMMIT4 directly apply.¶
The main difference between the two operations is that ROLLBACK_BLOCK4 works on blocks and not a raw data stream. As such rba_offset is the starting block offset in the file and not the byte offset in the file. Some erasure encoding types can have different block sizes depending on the block type. Further, rba_count is a count of blocks to rollback and not bytes to rollback.¶
Further, while it may appear that the combination of rba_offset and rba_count are redundant to rba_blocks, the purpose of rba_blocks is to allow the data server to differentiate between potentially multiple pending blocks.¶
ROLLBACK_BLOCK4 deletes prior WRITE_BLOCK4 transactions. In case of write holes, it allows the client to undo transactions to repair the file.¶
WRITE_BLOCK4 is WRITE4 (see Section 18.32 of [RFC8881]) with additional semantics over the block_owner and the activation of blocks. As such, all of the normal semantics of WRITE4 directly apply.¶
The main difference between the two operations is that WRITE_BLOCK4 works on blocks and not a raw data stream. As such wba_offset is the starting block offset in the file and not the byte offset in the file. Some erasure encoding types can have different block sizes depending on the block type. Further, wbr_count is a count of written blocks and not written bytes.¶
If wba_stable is FILE_SYNC4, the data server MUST commit the written header and block data plus all file system metadata to stable storage before returning results. This corresponds to the NFSv2 protocol semantics. Any other behavior constitutes a protocol violation. If wba_stable is DATA_SYNC4, then the data server MUST commit all of the header and block data to stable storage and enough of the metadata to retrieve the data before returning. The data server implementer is free to implement DATA_SYNC4 in the same fashion as FILE_SYNC4, but with a possible performance drop. If wba_stable is UNSTABLE4, the data server is free to commit any part of the header and block data and the metadata to stable storage, including all or none, before returning a reply to the client. There is no guarantee whether or when any uncommitted data will subsequently be committed to stable storage. The only guarantees made by the data server are that it will not destroy any data without changing the value of writeverf and that it will not commit the data and metadata at a level less than that requested by the client.¶
The activation of header and block data interacts with the bo_activated for each of the written blocks. If the data is not committed to stable storage then the bo_activated field MUST NOT be set to true. Once the data is committed to stable storage, then the data server can set the block's bo_activated if one of these conditions apply:¶
There are subtle interactions with write holes caused by racing clients. One client could win the race in each case, but because it used a wba_stable of UNSTABLE4, the subsequent writes from the second client with a wba_stable of FILE_SYNC4 can be awarded the bo_activated being set to true for each of the blocks in the payload.¶
Finally, the interaction of wba_stable can cause a client to mistakenly believe that by the time it gets the response of bo_activated of false, that the blocks are not activated. A subsequent READ_BLOCK4 or READ_BLOCK_STATUS4 might show that the bo_activated is true without any interaction by the client via ACTIVATE_BLOCK4. Automatic setting of bo_activated to true if it is the first write should be a performance boost. But it can lead to the client having incorrect information (as above) and trying to ACTIVATE_BLOCK4 a payload that has lost the race. But is that bad? If you have racing clients, there is no guarantee at all as to the contents of the file. TH¶
A guarded WRITE_BLOCK4 is when the writing of a block MUST fail if wba_guard.wbg_check is set and the target block does not have both the same change_id as the gbo_change_id and the same client_id as the gbo_client_id. This is useful in read-update-write scenarios. The client reads a block, updates it, and is prepared to write it back. It guards the write such that if another writer has modified the block, the data server will reject the modification.¶
Note that as the guard_block_owner4 (see Figure 46 does not have a block_id and the WRITE_BLOCK4 applies to all blocks in the range of wba_offset to the length of wba_data, then each of the target blocks MUST have the same change_id and client_id. The client SHOULD present the smallest set of blocks as possible to meet this requirement.¶
And the complexity goes up here. Does the DS reject only based on active blocks? Or can inactive ones also cause rejection? TH¶
Is the DS supposed to vet all blocks first or proceed to the first error? Or do all blocks and return an array of errors? (This last one is a no-go.) Also, if we do the vet first, what happens if a WRITE_BLOCK4 comes in after the vetting? Are we to lock the file during this process. Even if we do that, we still have the issue of multiple DSes. TH¶
Some erasure encoding types keep their blocks in plain text and have parity blocks in order to provide integrity. A common configuration for Reed Solomon is 4 active blocks, 2 parity blocks, and 2 spares. Assuming 4kB data blocks, then each payload delivers 16kB of data and 8kB of parity. If the application modifies the first data block, then all that needs to change is the first active block and the two parity blocks in the payload.¶
In any other approach, only 12kB of the total 24kB has to be written to storage. If that is attempted in the Flexible Files Version 2 Layout Type, then the payload will be deemed as inconsistent. The reason for this is that the change_id for the unmodified blocks will not match those of the modified blocks.¶
The WRITE_BLOCK_FLAGS_UPDATE_HEADER_ONLY flag in wb_flags can be used to save the transmission of the blocks. If it is set, then the wb_block is ignored. It MUST be empty. Note that the client MUST only modify both the wb_crc and the wba_owner.bo_change_id fields in this case. The wb_crc MUST change as the wba_owner.bo_change_id has been modified (see Section 3.1.1).¶
For the purpose of computing the activation state of the block, The data server MUST treat this as an overwrite. Thus, in the response, bo_activated MUST be false.¶
This document contains the external data representation (XDR) [RFC4506] description of the Flexible Files Version 2 Layout Type. The XDR description is embedded in this document in a way that makes it simple for the reader to extract into a ready-to-compile form. The reader can feed this document into the following shell script to produce the machine readable XDR description of the new flags:¶
#!/bin/sh grep '^ *///' $* | sed 's?^ */// ??' | sed 's?^ *///$??'¶
That is, if the above script is stored in a file called 'extract.sh', and this document is in a file called 'spec.txt', then the reader can do:¶
sh extract.sh < spec.txt > erasure_coding_prot.x¶
The effect of the script is to remove leading white space from each line, plus a sentinel sequence of '///'. XDR descriptions with the sentinel sequence are embedded throughout the document.¶
Note that the XDR code contained in this document depends on types from the NFSv4.2 nfs4_prot.x file (generated from [RFC7863]) and the Flexible Files Layout Type flexfiles.x file (generated from [RFC8435]). This includes both nfs types that end with a 4, such as offset4, length4, etc., as well as more generic types such as uint32_t and uint64_t.¶
While the XDR can be appended to that from [RFC7863], the various code snippets belong in their respective areas of that XDR.¶
This document has the same security considerations as both Flex Files Layout Type version 1 (see Section 15 of [RFC8435]) and NFSv4.2 (see Section 17 of [RFC7862]).¶
[RFC8881] introduced the 'pNFS Layout Types Registry'; new layout type numbers in this registry need to be assigned by IANA. This document defines the protocol associated with an existing layout type number: LAYOUT4_FLEX_FILES_V2 (see Table 1).¶
Layout Type Name | Value | RFC | How | Minor Versions |
---|---|---|---|---|
LAYOUT4_FLEX_FILES_V2 | 0x6 | RFCTBD10 | L | 1 |
[RFC8881] also introduced the 'NFSv4 Recallable Object Types Registry'. This document defines new recallable objects for RCA4_TYPE_MASK_FFV2_LAYOUT_MIN and RCA4_TYPE_MASK_FFV2_LAYOUT_MAX (see Table 2).¶
Recallable Object Type Name | Value | RFC | How | Minor Versions |
---|---|---|---|---|
RCA4_TYPE_MASK_FFV2_LAYOUT_MIN | 20 | RFCTBD10 | L | 1 |
RCA4_TYPE_MASK_FFV2_LAYOUT_MAX | 21 | RFCTBD10 | L | 1 |
This document introduces the 'Flexible Files Version 2 Layout Type Erasure Encoding Type Registry'. This document defines the FFV2_ENCODING_MIRRORED type for Client-Side Mirroring (see Table 3).¶
Erasure Encoding Type Name | Value | RFC | How | Minor Versions |
---|---|---|---|---|
FFV2_ENCODING_MIRRORED | 1 | RFCTBD10 | L | 2 |
The following from Hammerspace were instrumental in driving Flex Files v2: David Flynn, Trond Myklebust, Tom Haynes, Didier Feron, Jean-Pierre Monchanin, Pierre Evenou, and Brian Pawlowski.¶
Christoph Helwig was instrumental in making sure Flexible Files Version 2 Layout Type was applicable to more than one Erasure-Encoding Type.¶
This section is to be removed before publishing as an RFC.¶
[RFC Editor: prior to publishing this document as an RFC, please replace all occurrences of RFCTBD10 with RFCxxxx where xxxx is the RFC number of this document]¶