shithub: furgit

--- /dev/null

+++ b/research/packfile_bloom.txt

@@ -1,0 +1,130 @@

+Packfile bloom filter RFC

+=========================

+Problem

+-------

+Especially for server-side usages, repacking is extremely expensive, and

+creating multi-pack-indexes is still rather expensive. Incremental MIDX

+partially solves this, but would defeat the purpose of MIDX when there are too

+many of them, as Git would still have to walk the MIDXes in order while

+performing expensive indexing queries.

+Idea

+----

+Each MIDX layer, and each non-MIDX index, comes with a bloom filter. MIDXes and

+ordinary .idx files are still traversed in their usual order, but the first

+step when traversing them, is to check whether that index could possibly have

+the desired object, through a bloom filter.

+We will want the filters to be mmaped, and we want the lookup cost to be

+dominated by one cache-line read rather than using many scattered reads.

+Therefore, a blocked bloom filter is likely the right direction here. The steps

+are as follows:

+1. Split the filter into 64-octet buckets, since 64 octets is the most common

+   cache-line size.

+2. Use some bits of the object ID to choose the bucket.

+3. Use the rest of the key to choose several bit positions inside that bucket.

+4. A lookup thus reads one 64-octet bucket and checks whether all required bits

+   are set.

+Note on Object IDs

+------------------

+Git object IDs are cryptographic hashes (e.g., currently either SHA-256

+or SHA-1), and are thus uniformly distributed in non-pathological scenarios.

+See also the "Security considerations" section.

+Definitions

+-----------

+Let:

+     B := number of buckets

+     K := number of bits set and tested per object ID

+* All integers here are big endian.

+* The OID is to be interpreted as a big-endian bitstring, where bit offset 0

+  is the most significant bit of octet 0.

+* log2(B) + 9K <= hash length in bits.

+File layout

+-----------

+* 4-octet signature: {'I', 'D', 'B', 'L'}

+* 4-octet version identifier (= 1)

+* 4-octet object hash algorithm identifier (= 1 for SHA-1, 2 for SHA-256)

+* 4-octet B (number of buckets)

+* 2-octet K (number of bits set and tested per object ID)

+* 6-octet padding (must be all zeros)

+* B buckets of 64 octets each.

+Validation

+----------

+* Matching signature

+* Supported version (the rest of the rules are for this version)

+* Hash function identifier must be recognized

+* B must be nonzero and a power of two

+* K must be nonzero

+* log2(B) + 9K <= hash length in bits

+* Padding must be all zero

+* File size must be 24 + 64 * B octets

+Lookup procedure

+----------------

+1. Let b be the unsigned integer encoded by the most significant log2(B) bits

+   of OID. B is a power of two, and 0 <= b < B.

+2. Select and read bucket b.

+3. For each 0 <= i < K:

+   1. Start immediately after the most significant log2(B) bits of OID, let the

+      i-th 9-bit field be the bits at offset 9 * i through 9 * i + 8 within the

+      next 9 * K bits of the OID.

+   2. Let pi be the unsigned integer encoded by that 9-bit field.

+      Then, 0 <= pi < 512.

+   3. Compute wi := pi >> 6, and bi := pi & 63.

+      Thus, wi identifies one of the 8 64-bit words in bucket b, and bi

+      identifies one bit within that word.

+   4. Test whether bi is set in the word wi of bucket b. (Within each 64-bit

+      word, bit index 0 denotes the most significant bit, and bit index 63

+      denotes the least significant bit.)

+If any test fails, the OID is definitely not in the relevant idx.

+If all tests succeed, the OID may be in the relevant idx.

+Note that two of the K 9-bit fields can decode to the same pi, which means an

+insertion may set fewer than K distinct bits.

+Worked example

+--------------

+Let:

+    B = 1 << 15 = 32768

+    K = 8

+Then, log2(B) = 15. Each lookup thus uses 15 bits to choose the bucket

+and 8 * 9 = 72 bits to choose the in-bucket positions, for a total of

+87 bits taken from the object ID.

+1. Read the first 15 bits of OID and interpret them as b, where

+   0 <= b < 32768.

+2. Read bucket b.

+3. For each 0 <= i < 8:

+   1. Read the i-th 9-bit field from the next 72 bits of OID and interpret it

+      as pi, where 0 <= pi < 512.

+   2. Compute: wi = pi >> 6, bi = pi & 63.

+   3. Test whether bit bi is set in the word wi of bucket b.

+Security considerations

+------------------------

+An adversarial packfile where objects are (computationally intensive, even for

+SHA-1 as vulnerable as it is) constructed to have the same prefix for the

+relevant object format hash algorithm could be used to fill up the bloom

+filters, rendering some buckets useless. In the worst case, if they somehow

+fill all filters, this proposal's optimizations become useless, but would not

+be a significant DoS vector.

--

⑨