sql – 在数据库中进行汉明距离/相似性搜索

我有一个过程,类似于生成感知哈希的tineye,这些是32位整数.

我打算将来存储在一个sql数据库(也许是一个nosql db)中

但是,我很难理解如何根据哈希的相似性检索记录.

有任何想法吗?

一种常见的方法(至少对我来说很常见)是将哈希位串分成几个块并在这些块上查询以获得完全匹配.这是一个“预过滤”步骤.然后,您可以对返回的结果执行按位汉明距离计算,该结果应该只是整个数据集的较小子集.这可以使用数据文件或SQL表来完成.

因此,简单来说:假设您在数据库中有一堆32位哈希,并且您希望找到位于“查询”哈希的4位汉明距离内的每个哈希:

>创建一个包含四列的表:每列包含32位哈希的8位(作为字符串或整数)切片,islice 1到4.
>在qslice 1到4中以相同的方式对查询哈希进行切片.
>查询此表,使得qslice1 = islice1或qslice2 = islice2或qslice3 = islice3或qslice4 = islice4中的任何一个.这为您提供了查询哈希的3位(4 – 1)内的每个数据库哈希.
>对于每个返回的哈希,使用查询哈希成对计算精确的汉明距离(从四个切片重建索引侧哈希)

步骤4中的操作次数应远小于整个表的完全成对汉明计算.

这种方法首先在Moses Charikar的“simhash”开创性paper和相应的Google patent中被描述为:

  1. APPROXIMATE NEAREST NEIGHBOR SEARCH IN HAMMING SPACE

[…]

Given bit vectors consisting of d bits each, we choose
N = O(n 1/(1+ ) ) random permutations of the bits. For each
random permutation σ, we maintain a sorted order O σ of
the bit vectors, in lexicographic order of the bits permuted
by σ. Given a query bit vector q, we find the approximate
nearest neighbor by doing the following:

For each permutation σ, we perform a binary search on O σ to locate the
two bit vectors closest to q (in the lexicographic order obtained by bits permuted by σ). We now search in each of
the sorted orders O σ examining elements above and below
the position returned by the binary search in order of the
length of the longest prefix that matches q.

Monika Henziger在她的论文“Finding near-duplicate web pages: a large-scale evaluation of algorithms”中对此进行了扩展:

3.3 The Results for Algorithm C

We partitioned the bit string of each page into 12 non-
overlapping 4-byte pieces, creating 20B pieces, and computed the C-similarity of all pages that had at least one
piece in common. This approach is guaranteed to find all
pairs of pages with difference up to 11, i.e., C-similarity 373,
but might miss some for larger differences.

Gurmeet Singh Manku,Arvind Jain和Anish Das Sarma在论文Detecting Near-Duplicates for Web Crawling中对此进行了解释:

  1. THE HAMMING DISTANCE PROBLEM

Definition: Given a collection of f -bit fingerprints and a
query fingerprint F, identify whether an existing fingerprint
differs from F in at most k bits. (In the batch-mode version
of the above problem, we have a set of query fingerprints
instead of a single query fingerprint)

[…]

Intuition: Consider a sorted table of 2 d f -bit truly random fingerprints. Focus on just the most significant d bits
in the table. A listing of these d-bit numbers amounts to
“almost a counter” in the sense that (a) quite a few 2 d bit-
combinations exist, and (b) very few d-bit combinations are
duplicated. On the other hand, the least significant f − d
bits are “almost random”.

Now choose d such that |d − d| is a small integer. Since
the table is sorted, a single probe suffices to identify all fingerprints which match F in d most significant bit-positions.
Since |d − d| is small, the number of such matches is also
expected to be small. For each matching fingerprint, we can
easily figure out if it differs from F in at most k bit-positions
or not (these differences would naturally be restricted to the
f − d least-significant bit-positions).

The procedure described above helps us locate an existing
fingerprint that differs from F in k bit-positions, all of which
are restricted to be among the least significant f − d bits of
F. This takes care of a fair number of cases. To cover all
the cases, it suffices to build a small number of additional
sorted tables, as formally outlined in the next Section.

PS:大多数这些优秀的大脑在某种程度上与谷歌有关,或者在某些时候与这些相关联,FWIW.

相关文章
相关标签/搜索