Approximate Nearest Neighbor Oh Yeah (Annoy)

Introduction
Welcome back to Vector Database 101.
In the previous tutorial, we deep-dived into Hierarchical Navigable Small Worlds (HNSW). HNSW is a graph-based indexing algorithm that today is one of the most popular indexing strategies used in vector databases.
In this tutorial, we'll switch gears and talk about tree-based vector indexes. Specifically, we'll talk about Approximate Nearest Neighbor Oh Yeah (Annoy), an algorithm that uses a forest of trees to conduct the nearest neighbor search. For those familiar with random forests or gradient-boosted decision trees, Annoy can seem like a natural extension of these algorithms, only for the nearest neighbor search rather than machine learning. As with our HNSW tutorial, we'll first walk through how Annoy works from a high level before developing our own simple Python implementation.

Annoy, visualized (from https://github.com/spotify/annoy).
Annoy basics
While HNSW is built upon the connected graph and skip list, Annoy uses binary search trees as the core data structure. The key idea behind Annoy (and other tree-based indexes) is to repeatedly partition our vector space and search only a subset of the partitions for nearest neighbors. If this sounds like IVF, you're right; the idea is the same, but the execution is slightly different.
The best way to understand Annoy is to visualize how a single tree is built. However, remember that high-dimensional hyperspaces are very different from 2D/3D Euclidean spaces from an intuitive perspective, so the images below are only for reference.
Let's start with indexing. For Annoy, this is a recursive process where the maximum size of the call stack is the depth of the tree. In the first iteration, two random dataset vectors, a and b, are selected, and the full hyperspace is split along a hyperplane equidistant from both a and b. Then, vectors in the "left" half of the hyperspace get assigned to the left half of the tree, while vectors in the "right" half of the subspace are given to the right half of the tree. Note that this can be done without actually computing the hyperplane itself - for every dataset vector, we need to determine whether a (left) or b (right) is closer.

After the first, second, and Nth iteration, respectively. Source.
The second iteration repeats this process for both the left and right subtrees generated by the first iteration, resulting in a tree with a depth of two and four leaf nodes. This process continues for the third, fourth, and subsequent iterations until a leaf node has fewer than a pre-defined number of elements K. In the original Annoy implementation, K
is a variable value that the user can set.
With the index fully built, we can now move on to querying. Given some query vector q, we can search by traversing the tree. A hyperplane splits each intermediate node, and we can determine which side of the hyperplane the query vector falls on by computing its distance to the left and right vectors. We repeat this process until we reach a leaf node containing an array of, at most, K
vectors. We can then rank and return these vectors to the user.
Implementing Annoy
Now we know how Annoy works, and let's start with the implementation. As usual, we'll first create a dataset of (128 dimensional) vectors:
>>> import numpy as np
>>> dataset = np.random.normal(size=(1000, 128))
Let's first define a Node
class containing left and right subtrees:
class Node(object):
def __init__(vecs=[]):
self._vecs = vecs
self._left = None
self._right = None
@property
def vecs(self):
return self._vecs
@property
def left(self):
return self._left
@property
def right(self):
return self._right
The vecs
variable contains a list of all vectors within the node. If the length of this list is less than some value K
, then they will remain as-is; otherwise, these vectors will then get propagated to left
and right
, with vecs[0]
and vecs[1]
remaining as the two randomly selected vectors used to split the hyperplane.
Let's now move to indexing. First, Recall that every node in the tree is split by a hyperplane orthogonal to the line connecting two randomly selected dataset vectors. Conveniently, we can determine which side of the hyperplane a query vector lies on by computing distance. As usual, we'll use NumPy's vectorized math for this:
def _is_query_in_left_half(q, node):
# returns `True` if query vector resides in left half
dist_l = np.linalg.norm(q - node.vecs[0])
dist_r = np.linalg.norm(q - node.vecs[1])
return dist_l < dist_r
Now let's move to building the actual tree.
import random
def _split_node(node, K=64, imb=0.95):
# stopping condition: maximum # of vectors for a leaf node
if len(node.vecs) <= K:
return
node.left = Node()
node.right = Node()
for n in range(5):
# take two random indexes and swap to [0] and [1]
idxs = random.sample(range(len(node.vecs)), 2)
(node.vecs[0], node.vecs[idx[0]]) = (node.vecs[idx[0]], node.vecs[0])
(node.vecs[1], node.vecs[idx[1]]) = (node.vecs[idx[1]], node.vecs[1])
# split vectors into halves
for vec in node.vecs:
if _is_query_in_left_half(vec, node):
node.left.vecs.append(vec)
else:
node.right.vecs.append(vec)
# redo tree build process if imbalance is high
rat = len(node.left.vecs) / len(node.vecs)
if rat > imb or rat < (1 - imb):
continue
# we're done; remove vectors from input-level node
# first two vectors correspond to `left` and `right`, respectively
del node.vecs[2:]
def _build_tree(node, K, imb):
_split_node(node, K=K, imb=imb)
if node.left and node.right:
_build_tree(node.left, K=K, imb=imb)
_build_tree(node.right, K=K, imb=imb)
def build_tree(vecs, K=64, imb=0.95)
root = Node()
root.vecs = vecs
_build_tree(root, K=K, imb=imb)
This is a denser code block, so let's walk through it step-by-step. First, given an already-initialized Node
, we randomly select two vectors and split the dataset into left and right halves. We then use the function we defined earlier to determine which of the two halves the subvectors belong to. Note that we've added an imb
parameter to maintain tree balance - if one side of the tree contains more than 95% of all the subvectors, we redo the split process.
With node splitting in place, the build_tree
function will recursively call itself on all nodes. Leaf nodes are defined as those which contain fewer than K
subvectors.
Great, so we've built a binary tree that lets us significantly reduce the scope of our search. Now let's implement querying as well. Querying is pretty straightforward; we traverse the tree, continuously moving along the left or right branches until we've arrived at the one we're interested in:
def query_tree(q, root):
node = root
while not node.vecs:
# iteratively determine whether right or left node is closer
if _is_query_in_left_half(q, node):
node = node.left
else:
node = node.right
# find nearest neighbor in leaf node
(nn, m_dist) = (None, float("inf"))
for v in node.vecs:
dist = np.linalg.norm(v - q)
if dist < m_dist:
(nn, m_dist) = (v, dist)
return nn
This chunk of code will greedily traverse the tree, returning a single nearest neighbor (nq = 1
). However, recall that we're often interested in finding multiple nearest neighbors. Additionally, multiple nearest neighbors can live in other leaf nodes as well. So how can we solve these issues?
Run, forest, run
(Yes, I do realize that the main character's name is spelled "Forrest" in the American classic.)
In a previous tutorial on IVF, recall that we often expanded our search beyond the Voronoi cell closest to the query vector. The reason is due to cell edges - if a query vector is close to a cell edge, it's very likely that some of its nearest neighbors may be in a neighboring cell. These "edges" are much more common in high-dimensional spaces, so a large-ish value of nprobe
is often used when a high recall is needed.
We face the same problem for tree-based indexes - some of our nearest neighbors may be outside the nearest leaf node/polygon. Annoy solves this by 1) allowing searches on both sides of a split and 2) creating a forest of trees.
Let's first expand on our implementation in the previous section to search both sides of a split:
std::vector<S> nns;
while (nns.size() < (size_t)search_k && !q.empty()) {
const pair<T, S>& top = q.top();
T d = top.first;
S i = top.second;
Node* nd = _get(i);
q.pop();
if (nd->n_descendants == 1 && i < _n_items) {
nns.push_back(i);
} else if (nd->n_descendants <= _K) {
const S* dst = nd->children;
nns.insert(nns.end(), dst, &dst[nd->n_descendants]);
} else {
T margin = D::margin(nd, v, _f);
q.push(make_pair(D::pq_distance(d, margin, 1), static_cast<S>(nd->children[1])));
q.push(make_pair(D::pq_distance(d, margin, 0), static_cast<S>(nd->children[0])));
}
}
// Get distances for all items
// To avoid calculating distance multiple times for any items, sort by id
std::sort(nns.begin(), nns.end());
vector<pair<T, S> > nns_dist;
S last = -1;
for (size_t i = 0; i < nns.size(); i++) {
S j = nns[i];
if (j == last)
continue;
last = j;
if (_get(j)->n_descendants == 1) // This is only to guard a really obscure case, #284
nns_dist.push_back(make_pair(D::distance(v_node, _get(j), _f), j));
}
size_t m = nns_dist.size();
size_t p = n < m ? n : m; // Return this many items
std::partial_sort(nns_dist.begin(), nns_dist.begin() + p, nns_dist.end());
for (size_t i = 0; i < p; i++) {
if (distances)
distances->push_back(D::normalized_distance(nns_dist[i].first));
result->push_back(nns_dist[i].second);
}
def _select_nearby(q, node, thresh=0)
# functions identically to _is_query_in_left_half, but can return both
dist_l = np.linalg.norm(q - node.vecs[0])
dist_r = np.linalg.norm(q - node.vecs[1])
diff = math.abs(dist_l - dist_r)
if diff < thresh:
return (node.left, node.right)
if dist_l < dist_r:
return (node.left,)
return (node.right,)
def query_tree(q, root, thresh=0.5):
nodes = [root]
nns = []
while True:
# iteratively determine whether right or left node is closer
if _select_nearby(q, node):
node = node.left
else:
node = node.right
# find nearest neighbor in leaf node
(nn, m_dist) = (None, float("inf"))
for v in node.vecs:
dist = np.linalg.norm(v - q)
if dist < m_dist:
(nn, m_dist) = (v, dist)
return nn
Next, we'll add a function to allow us to build the full index as a forest of trees:
def build_index(vecs, nt=8, K=64, imb=0.95):
return [build_tree(vecs, K=K, imb=imb) for _ in range(nt)]
With everything implemented, let's now put it all together, as we've done for IVF, SQ, PQ, and HNSW:
And that's it for Annoy!
Wrapping up
In this tutorial, we did a deep dive into Annoy, a tree-based indexing strategy with a playful name. There are better languages than Python for implementing vector search data structures due to interpreter overhead. Still, we use as much numpy-based array math. We can do many optimizations to prevent copying memory back and forth, but I'll leave them as an exercise for the reader.
In the following tutorial, we'll continue our deep dive into indexing strategies with a rundown of the Vamana algorithm - also known more commonly as DiskANN - a unique graph-based indexing algorithm that is tailored specifically towards querying directly from solid state hard drives.
All code for this tutorial is freely available on my Github.