This module contains plain tree indexing code. The index itself is a
CACHE: you should only ever use it as an optimisation and never as a primary
storage. In practice, this means that when we change index format, the
application is expected to throw the old index away and build a fresh
index. Please note that tracking index validity is out of scope for this
library: this is responsibility of your application. It is advisable that in
your validity tracking code, you also check for format validity (see
indexFormatValid) and scrap and re-create index when needed.
The index is a binary file that overlays a hashed tree over the working
copy. This means that every working file and directory has an entry in the
index, that contains its path and hash and validity data. The validity data
is a timestamp plus the file size. The file hashes are sha256's of the
file's content.
There are two entry types, a file entry and a directory entry. Both have a
common binary format (see Item). The on-disk format is best described by
peekItem.
For each file, the index has a copy of the file's last modification
timestamp taken at the instant when the hash has been computed. This means
that when file size and timestamp of a file in working copy matches those in
the index, we assume that the hash stored in the index for given file is
valid. These hashes are then exposed in the resulting Tree object, and can
be leveraged by eg. diffTrees to compare many files quickly.
You may have noticed that we also keep hashes of directories. These are
assumed to be valid whenever the complete subtree has been valid. At any
point, as soon as a size or timestamp mismatch is found, the working file in
question is opened, its hash (and timestamp and size) is recomputed and
updated in-place in the index file (everything lives at a fixed offset and
is fixed size, so this isn't an issue). This is also true of directories:
when a file in a directory changes hash, this triggers recomputation of all
of its parent directory hashes; moreover this is done efficiently -- each
directory is updated at most once during an update run.
|