Hamming Distance for Hybrid Search in SQLite Hamming Distance for Hybrid Search in SQLite Published on 5 February 2026 This article shows how I implemented semantic search in SQLite using binary embeddings and Hamming distance, enabling hybrid search without external vector databases. SQLite’s FTS5 extension provides excellent text search with BM25 ranking. However, it doesn’t support semantic search, which means you can’t combine keyword matching with meaning-based retrieval, a technique known as hybrid search. Binary embeddings and Hamming distance Semantic search works by converting text into numerical vectors (embeddings) that capture meaning (in my case, I do that via an API). Similar texts (i.e., talking about the same things) produce similar vectors. Embeddings generally use float32 values, that is, a 1024-dimensional embedding requires 4KiB per document. Binary embeddings quantize each dimension to a single bit (0 or 1). This reduces storage dramatically, because 1024 dimensions become only 128 bytes. The similarity metric changes from cosine distance to Hamming distance, so it also allows using fast simple bit operations vs more expensive floating-point arithmetic. The tradeoff is accuracy: binary quantization loses information. For many applications (including mine), this is acceptable given the storage and speed benefits, especially if it’s combined with another classic text search algorithm like BM25. Hamming Distance Hamming distance counts the number of bit positions where two binary vectors differ. Example with 8-bit vectors: Position: 0 1 2 3 4 5 6 7 Vector A: 1 0 1 1 0 1 1 0 Vector B: 1 0 0 1 1 0 1 0 ^ ^ ^ Bits differ at positions 2, 4, and 5. Hamming distance = 3. For semantic search, lower Hamming distance means higher similarity . To compute Hamming distance efficiently, we XOR the two vectors (1 where bits differ, 0 where same), then count the 1s (population count or “popcount”). Modern CPUs have dedicated popcount instructions, making this very fast. Example: Vector A: 1 0 1 1 0 1 1 0 Vector B: 1 0 0 1 1 0 1 0 A XOR B: 0 0 1 0 1 1 0 0 popcount(A XOR B) = 3 ones = distance of 3 Implementation as a SQLite extension SQLite extensions are dynamically loaded shared libraries ( .so on Linux, .dylib on macOS) that can register custom SQL functions. This is simpler than forking SQLite or using external tools. The extension needs to: Register the function name with SQLite, implement the computation logic, and finally handle BLOB inputs and return an integer result. Registration code: #include
Source: Hacker News | Original Link