A Chunk Length Hiding Repository Format

2020-03-22

Background

The current arsuran repository format is roughly an 'improved' copy of the one borg uses. This format works reasonably well for most purposes, and it is quite possible to squeeze great performance out of it.

One weakness it does have, though, is failing to hide chunk lengths. Since both borg and asuran use content defined chunking, both formats are (theoretically) open to fingerprinting attacked based on stored chunk sizes.

Borg uses a mitigation approach based on randomizing the look table used inside BuzHash for the content defined chunking. Asuran currently employs a similar defense for when operating with BuzHash CDC¹, but currently has no such mitigation when operating with the default FastCDC implementation.

Such an approach is theoretically also possible to implement with FastCDC, as it also uses a 'random' lookup table as part of the algorithm. This 'table randomization' technique does provide a countermeasure against stored chunk length fingerprinting, by making it hard or impossible for an attacker to guess what size chunks a known plaintext will be broken up into, however I do not believe this to be a good general approach to fixing this concern with content defined chunking.

Asuran is designed to be the archiver for new/future technologies. While it may be possible to implement a similar kludge for FastCDC, I don't believe it is a good general approach going forward. New algorithms may exist in the future that provide even better performance than FastCDC, and may not be amenable to such a countermeasure. Users should not be expected to pay a performance cost for security when an alternative approach exists.

Other Issues with the borg approach

Repository syncing

The borg approach, as a fundamental element of its design, does not produce the same set of chunks for a file when it is inserted into two separate repositories. Under this model, in order to sync an object directly from one repository to another, it either requires manual intervention at the time the repository is initialized, or reconstruction and re-chunking of the object. The former being an error prone and not user friendly solution, and the latter requiring an unnecessary compute overhead, especially in cases where the remote repository already has most of the chunks.

Proposed Solution

Steps asuran already takes in this direction

The asuran format specification already does not require a priori knowledge of the length of chunks to be able to pull them out of the backend. Currently, when the repository wants to request a chunk from the backend, it uses the following struct to tell the backend where the chunk it wants is located:

pub struct SegmentDescriptor {
    pub segment_id: u64,
    pub start: u64,
}

Right now, the start value is the offset of the first byte of the chunk within the segment, and a format that includes length information in its encoding ² is used. This does not entirely avoid the issue, as the lengths of the chunks are still encoded in plain text. However, the asuran API makes no assumptions about what the start value actually is, and we shall exploit this to change it to a table index that does not provide any direct information about chunk location or size.

Splitting Chunks

Currently, chunks are described with the following struct:

pub struct Chunk {
    /// The data of the chunk, stored as a vec of raw bytes
    #[serde(with = "serde_bytes")]
    data: Vec<u8>,
    /// Compression algorithim used
    compression: Compression,
    /// Encryption Algorithim used, also stores IV
    encryption: Encryption,
    /// HMAC algorithim used
    hmac: HMAC,
    /// HMAC tag of the cyphertext bytes of this chunk
    #[serde(with = "serde_bytes")]
    mac: Vec<u8>,
    /// `ChunkID`, used for indexing in the repository and deduplication
    id: ChunkID,
}

Chunks are currently written to disk hole, and segment files are simple concatenations of MessagePacked chunks.

As a step in resolving the issue, I will be splitting the chunk's on disk representation into two new structs, a header containing the metadata:

pub struct ChunkHeader {
    compression: Compression,
    encryption: Encryption,
    hmac: HMAC,
    mac: Vec<u8>,
    id: ChunkId, 
}

and a body, containing the actual data:

pub struct ChunkBody(Vec<u8>)

Splitting segment files

Currently, each segment consists of only one file, containing a concatenation of serialized chunks.

Each segment will now be split into two files, N.data and N.header, where N is some non-negative base 10 integer.

The data file

N.data will now contain a raw concatenation of the chunks payload/body bytes, with no attached metadata. As these payloads will be encrypted, the data fill will be effectively a list of random numbers, and will provide no insight into where a chunk starts or ends.

The header file

The N.header file will now contain a serialized Vec<(ChunkHeader,usize,usize)>, with the two usizes being the start and end offsets of the chunk body within the data file. Before being written to disk, this Vec will be serialized, packed into a Chunk ³, and that Chunk will need serialized instead. As the entire thing will be serialized as one encrypted Chunk, there will be no leaking of information inside the ChunkHeaders, except for a rough guess of how many chunks are in a given segment, but this can be counteracted by zero-padding the serialized Vec<(ChunkHeader,usize,usize)> up to a length that would allow storing the expected maximum number of chunks in the file⁴.

This header file will contain a relatively small amount of information, and with the speed of modern encryption algorithms, can afford to be rewritten in its entirety. In-memory caching and only rewriting the header file when a segment is closed (with modifications) can further reduce this overhead.

Change in the meaning of SegmentDescriptors

The start field in the SegmentDescriptor struct will be renamed to index, and instead of being the offset of the start of the chunk, will now be the index of the header inside the array in the header file.

Footnotes

Many thanks to @snsmac for fixing my old prototyping code for this in MR6 ↩
Specifically MessagePack ↩
For reuse of cryptographic primitives ↩
This can be done by dividing the maximum segment size by the minimum chunk size produced by the chunker being used. ↩