Background
The current arsuran repository format is roughly an 'improved' copy of
the one
borg uses.
This format works reasonably well for most purposes, and it is quite
possible to squeeze great performance out of it.
One weakness it does have, though, is failing to hide chunk lengths.
Since both borg and asuran use content defined chunking, both formats
are (theoretically) open to fingerprinting attacked based on stored
chunk
sizes.
Borg uses a mitigation approach based on randomizing the look table used
inside BuzHash for the content defined chunking. Asuran currently
employs a similar defense for when operating with BuzHash CDC, but
currently has no such mitigation when operating with the default FastCDC
implementation.
Such an approach is theoretically also possible to implement with
FastCDC, as it also uses a 'random' lookup table as part of the
algorithm. This 'table randomization' technique does provide a
countermeasure against stored chunk length fingerprinting, by making it
hard or impossible for an attacker to guess what size chunks a known
plaintext will be broken up into, however I do not believe this to be a
good general approach to fixing this concern with content defined
chunking.
Asuran is designed to be the archiver for new/future technologies. While
it may be possible to implement a similar kludge for FastCDC, I don't
believe it is a good general approach going forward. New algorithms may
exist in the future that provide even better performance than FastCDC,
and may not be amenable to such a countermeasure. Users should not be
expected to pay a performance cost for security when an alternative
approach exists.
Other Issues with the borg approach
Repository syncing
The borg approach, as a fundamental element of its design, does not
produce the same set of chunks for a file when it is inserted into two
separate repositories. Under this model, in order to sync an object
directly from one repository to another, it either requires manual
intervention at the time the repository is initialized, or
reconstruction and re-chunking of the object. The former being an error
prone and not user friendly solution, and the latter requiring an
unnecessary compute overhead, especially in cases where the remote
repository already has most of the chunks.
Proposed Solution
Steps asuran already takes in this direction
The asuran format specification already does not require a priori
knowledge of the length of chunks to be able to pull them out of the
backend. Currently, when the repository wants to request a chunk from
the backend, it uses the following struct to tell the backend where the
chunk it wants is located:
pub struct SegmentDescriptor {
pub segment_id: u64,
pub start: u64,
}
Right now, the start
value is the offset of the first byte of the
chunk within the segment, and a format that includes length information
in its encoding is used. This does not entirely avoid the issue, as
the lengths of the chunks are still encoded in plain text. However, the
asuran API makes no assumptions about what the start value actually is,
and we shall exploit this to change it to a table index that does not
provide any direct information about chunk location or size.
Splitting Chunks
Currently, chunks are described with the following struct:
pub struct Chunk {
/// The data of the chunk, stored as a vec of raw bytes
#[serde(with = "serde_bytes")]
data: Vec<u8>,
/// Compression algorithim used
compression: Compression,
/// Encryption Algorithim used, also stores IV
encryption: Encryption,
/// HMAC algorithim used
hmac: HMAC,
/// HMAC tag of the cyphertext bytes of this chunk
#[serde(with = "serde_bytes")]
mac: Vec<u8>,
/// `ChunkID`, used for indexing in the repository and deduplication
id: ChunkID,
}
Chunks are currently written to disk hole, and segment files are simple
concatenations of MessagePacked chunks.
As a step in resolving the issue, I will be splitting the chunk's on
disk representation into two new structs, a header containing the
metadata:
pub struct ChunkHeader {
compression: Compression,
encryption: Encryption,
hmac: HMAC,
mac: Vec<u8>,
id: ChunkId,
}
and a body, containing the actual data:
pub struct ChunkBody(Vec<u8>)
Splitting segment files
Currently, each segment consists of only one file, containing a
concatenation of serialized chunks.
Each segment will now be split into two files, N.data
and N.header
,
where N is some non-negative base 10 integer.
The data file
N.data
will now contain a raw concatenation of the chunks payload/body
bytes, with no attached metadata. As these payloads will be encrypted,
the data fill will be effectively a list of random numbers, and will
provide no insight into where a chunk starts or ends.
The N.header
file will now contain a serialized
Vec<(ChunkHeader,usize,usize)>
, with the two usizes
being the start
and end offsets of the chunk body within the data file. Before being
written to disk, this Vec
will be serialized, packed into a Chunk
, and that Chunk
will need serialized instead. As the entire thing
will be serialized as one encrypted Chunk
, there will be no leaking of
information inside the ChunkHeaders
, except for a rough guess of how
many chunks are in a given segment, but this can be counteracted by
zero-padding the serialized Vec<(ChunkHeader,usize,usize)>
up to a
length that would allow storing the expected maximum number of chunks in
the file.
This header file will contain a relatively small amount of information,
and with the speed of modern encryption algorithms, can afford to be
rewritten in its entirety. In-memory caching and only rewriting the
header file when a segment is closed (with modifications) can further
reduce this overhead.
Change in the meaning of SegmentDescriptors
The start
field in the SegmentDescriptor
struct will be renamed to
index
, and instead of being the offset of the start of the chunk, will
now be the index of the header inside the array in the header file.