shithub: furgit

ref: df1f2fb3daa1acd25c88510f259d5535fb482126
dir: /research/packfile_bloom.txt/

View raw version
Packfile bloom filter RFC
=========================

Problem
-------

Especially for server-side usages, repacking is extremely expensive, and
creating multi-pack-indexes is still rather expensive. Incremental MIDX
partially solves this, but would defeat the purpose of MIDX when there are too
many of them, as Git would still have to walk the MIDXes in order while
performing expensive indexing queries.

Idea
----

Each MIDX layer, and each non-MIDX index, comes with a bloom filter. MIDXes and
ordinary .idx files are still traversed in their usual order, but the first
step when traversing them, is to check whether that index could possibly have
the desired object, through a bloom filter.

We will want the filters to be mmaped, and we want the lookup cost to be
dominated by one cache-line read rather than using many scattered reads.
Therefore, a blocked bloom filter is likely the right direction here. The steps
are as follows:

1. Split the filter into 64-octet buckets, since 64 octets is the most common
   cache-line size.
2. Use some bits of the object ID to choose the bucket.
3. Use the rest of the key to choose several bit positions inside that bucket.
4. A lookup thus reads one 64-octet bucket and checks whether all required bits
   are set.

Note on Object IDs
------------------

Git object IDs are cryptographic hashes (e.g., currently either SHA-256
or SHA-1), and are thus uniformly distributed in non-pathological scenarios.
See also the "Security considerations" section.

Definitions
-----------

Let:

     B := number of buckets
     K := number of bits set and tested per object ID

* All integers here are big endian.
* The OID is to be interpreted as a big-endian bitstring, where bit offset 0
  is the most significant bit of octet 0.
* log2(B) + 9K <= hash length in bits.

File layout
-----------

* 4-octet signature: {'I', 'D', 'B', 'L'}
* 4-octet version identifier (= 1)
* 4-octet object hash algorithm identifier (= 1 for SHA-1, 2 for SHA-256)
* 4-octet B (number of buckets)
* 2-octet K (number of bits set and tested per object ID)
* 46-octet padding (must be all zeros)
* B buckets of 64 octets each.

Validation
----------

* Matching signature
* Supported version (the rest of the rules are for this version)
* Hash function identifier must be recognized
* B must be nonzero and a power of two
* K must be nonzero
* log2(B) + 9K <= hash length in bits
* Padding must be all zero
* File size must be 64 + 64 * B octets

Lookup procedure
----------------

1. Let b be the unsigned integer encoded by the most significant log2(B) bits
   of OID. B is a power of two, and 0 <= b < B.
2. Select and read bucket b.
3. For each 0 <= i < K:
   1. Start immediately after the most significant log2(B) bits of OID, let the
      i-th 9-bit field be the bits at offset 9 * i through 9 * i + 8 within the
      next 9 * K bits of the OID.
   2. Let pi be the unsigned integer encoded by that 9-bit field.
      Then, 0 <= pi < 512.
   3. Compute wi := pi >> 6, and bi := pi & 63.
      Thus, wi identifies one of the 8 64-bit words in bucket b, and bi
      identifies one bit within that word.
   4. Test whether bi is set in the word wi of bucket b. (Within each 64-bit
      word, bit index 0 denotes the most significant bit, and bit index 63
      denotes the least significant bit.)

If any test fails, the OID is definitely not in the relevant idx.
If all tests succeed, the OID may be in the relevant idx.

Note that two of the K 9-bit fields can decode to the same pi, which means an
insertion may set fewer than K distinct bits.

Worked example
--------------

Let:

    B = 1 << 15 = 32768
    K = 8

Then, log2(B) = 15. Each lookup thus uses 15 bits to choose the bucket
and 8 * 9 = 72 bits to choose the in-bucket positions, for a total of
87 bits taken from the object ID.

1. Read the first 15 bits of OID and interpret them as b, where
   0 <= b < 32768.
2. Read bucket b.
3. For each 0 <= i < 8:
   1. Read the i-th 9-bit field from the next 72 bits of OID and interpret it
      as pi, where 0 <= pi < 512.
   2. Compute: wi = pi >> 6, bi = pi & 63.
   3. Test whether bit bi is set in the word wi of bucket b.

Security considerations
------------------------

An adversarial packfile where objects are (computationally intensive, even for
SHA-1 as vulnerable as it is) constructed to have the same prefix for the
relevant object format hash algorithm could be used to fill up the bloom
filters, rendering some buckets useless. In the worst case, if they somehow
fill all filters, this proposal's optimizations become useless, but would not
be a significant DoS vector.

TODOs
-----

* Consider dropping mmap (page read vs cachline read)
* How should B and K be chosen?
* How does creation/insert work? Note that packfiles and `.idx`es are immutable.
* What are the sizes?
* What are the false positive rates?
* How are benchmarks?

* Switch to XOR filters since those are immutable sets anyway.