Accelerating Similarity Search on Really Big Data with Vector Indexing (Part II)
Discover how indexes dramatically accelerate vector similarity search, different index types, and how to choose the correct index for your AI application.
Join the DZone community and get the full member experience.Join For Free
This article is supplementary to the previous blog, "Accelerating Similarity Search on Really Big Data with Vector Indexing," covering the role indexing plays in making vector similarity search efficient and different indexes, including FLAT, IVF_FLAT, IVF_SQ8, and IVF_SQ8H. This article also provides the performance test results of the four indexes. We recommend reading this blog first.
This article provides an overview of the four main types of indexes and continues to introduce four different indexes: IVF_PQ, HNSW, ANNOY, and E2LSH.
Approximate nearest neighbor search (ANNS) is an essential concept in indexing. To better understand this concept and its advantage over KNN (K nearest neighbor), read the blog article written by Marie Stephen Leo.
What Are the Different Types of Common Indexes?
There are numerous indexes designed for high-dimensional vector similarity search, and each one comes with tradeoffs in performance, accuracy, and storage requirements. Common index types are quantization-based, graph-based, hash-based, and tree-based. In addition, disk-based ANNS is a product of recent advances in approximate nearest neighbor search (ANNS). An introduction to this index type will be available in a subsequent article. (Stay tuned!)
Quantization-based indexes work by dividing a large set of points into a desired number of clusters and ensuring that the sum of distances between points and their centroids is the smallest minimum. The centroid is calculated by taking the arithmetic average of all points in the cluster.
Dividing points into groups are represented by a centroid.
Graphs, per se, contains nearest neighbor information. Take the image below as an example; the pink point is our query vector. We need to find its nearest vertex. The entry point is selected on a random basis. Sometimes the entry point could be the first vector inserted into the database. We randomly select A as the entry point, for instance. B, C, and D are the neighbors of A, and among them, D is the closest to the pink point. Therefore, we continue to start from D, and among its neighbors, E, F, and J, the nearest to the pink point is E. However, all neighbors of E are farther compared to the pink point. Thus, we can say that E is the nearest to the pink point.
Finding the nearest neighbor of the pink point using graph-based indexes. Image source: CSDN blog.
However, there are some issues with the above approach:
- First, some vertices like K are not queried.
- Vertices E and L are close to each other but not connected. So, during ANNS, L is not considered the neighbor of E, which significantly hampers query efficiency.
- The more number of neighboring vertices to query, the more significant consumption of storage. However, insufficient connection lines lead to lower query efficiency. These issues are easy to solve, as long as:
- When building a graph, all vertices need to have a neighboring vertex. In this case, vertex K can be queried.
- All neighboring vertices must be connected. (For a more detailed explanation of the connecting strategies, read Proximity Graph-based Approximate Nearest Neighbor Search.)
- The number of neighboring vertices to query can be specified via configuration.
With the issues addressed, query efficiency is still not enhanced. Therefore, we need to introduce the "expressway" mechanism here.
Since we may start the query from an entry point far from the target query vector, we probably have to travel a long way and go through many vertices to find its nearest neighbor. By introducing the expressway mechanism, we can skip some neighboring vertices to speed up the search. As shown in the illustration below, we can follow the arrows in red (the expressways) to find the nearest neighbor of the green query point. It goes through only 3 vertices.
Expressway mechanism to accelerate the search.
Hash-based indexes use a series of hash functions. The probability of a collision among multiple hash functions represents the similarity of the vectors. Therefore, each vector in the database is placed into multiple buckets. During the nearest neighbor search, the search radius is continuously increased to have enough buckets. Then the distance between the query vector and vectors in the buckets is further calculated.
Splitting vectors into several buckets.
Most tree-based indexes divide the entire high-dimensional space from top to bottom with specific rules. For example, KD-tree selects the dimension with the most significant variance and divides the vectors in the space into two subspaces based on the median of that dimension. Meanwhile, ANNOY generates a hyperplane by random projection and uses the hyperplane to divide the vector space into two.
Which Scenarios Are the Indexes Best Suited For?
The table below is an overview of several indexes and their categorization.
The first four types of indexes and their performance test results are covered in the previous article, "Accelerating Similarity Search on Really Big Data with Vector Indexing." This post will mainly focus on IVF_PQ, HNSW, ANNOY, and E2LSH.
Note: Though in this post, we categorize IVF indexes as quantization-based indexes, this is still controversial in academia as some researchers believe IVF indexes should not be treated as quantization-based.
IVF_PQ: The Fastest Type of IVF Index With Substantial Compromise in Recall Rate
PQ (Product Quantization) uniformly decomposes the original high-dimensional vector space into Cartesian products of
m low-dimensional vector spaces and then quantizes the decomposed low-dimensional vector spaces. Instead of calculating the distances between the target vector and the database vectors, IVF_PQ encodes the residual vectors with a product quantizer - the difference between the vector and its corresponding coarse centroid vector. So the vectors in the database are represented by their centroid and the quantized residual vectors.
When calculating the distance between the target query vector and vectors from selected buckets, IVF_PQ uses a lookup table to quickly decode the residual vector's distance.
IVF_PQ performs IVF index clustering before quantizing the product of vectors. Its index file is even smaller than IVF_SQ8, but it also causes a loss of accuracy during searching vectors.
HNSW: A High-Speed, High-Recall Index
HNSW (Hierarchical Navigable Small World Graph) is a graph-based indexing algorithm that incrementally builds a multi-layer structure consisting of a hierarchical set of proximity graphs (layers) for nested subsets of the stored elements. The upper layers are selected randomly with an exponentially decaying probability distribution from the lower layers. The search starts from the uppermost layer, finds the node closest to the target in this layer, and then goes down to the next layer to begin another search. After multiple iterations, it can quickly approach the target.
HNSW limits the maximum degree of nodes on each layer of the graph to
M improve performance. Furthermore, a search range can be specified by using
efConstruction when building an index or
ef when searching targets.
Efficient and robust ANNS using HNSW graphs.
ANNOY: A Tree-Based Index on Low-Dimensional Vectors
ANNOY (Approximate Nearest Neighbors Oh Yeah) is an index that uses a hyperplane to divide a high-dimensional space into multiple subspaces and then stores them in a tree structure.
n_tree represents the number of trees in the forest) distinct binary trees, each node on the tree is a hyperplane subspace of the database. A random hyperplane is chosen at every intermediate node on the tree, which divides the space into two subspaces. The leaves are the hyperplane subspaces with less than
As shown in the image below, ANNOY cuts the space into several subspaces. First, the space is cut into 2 subspaces.
Then each of the 2 subspaces will be cut into halves, generating 4 smaller subspaces.
If we keep dividing the whole space, we will ultimately obtain something like the illustration below.
This whole process of partitioning the space resembles a binary tree.
A binary tree.
When searching for vectors, ANNOY follows the tree structure to find subspaces closer to the target vector and then compares all the vectors in these subspaces to obtain the final result. ANNOY inspects up to
search_k nodes, the value of which is
n By default, if not provided. Annoy uses a priority queue to search all trees at the same time for ensuring enough nodes are traversed. Therefore, when the target vector is close to the edge of a certain subspace, it is necessary to increase significantly the number of searched subspaces to obtain a high recall rate.
ANNOY supports multiple distance metrics, including Euclidean, Manhattan, cosine, and Hamming distance. Furthermore, ANNOY index can be shared between multiple processes. However, its shortcoming is that it is only suitable for low-dimensional and small datasets. In addition, ANNOY only supports integers as data ID, meaning extra maintenance is necessary if the vector IDs are not integers. Finally, GPU processing is also not supported by ANNOY.
E2LSH: Highly Efficient Index for Extremely Large Volumes of Data
E2LSH constructs LSH functions based on p-stable distributions. E2LSH uses a set of LSH functions to increase the conflict probability of similar vectors and reduce the conflict probability of dissimilar vectors. The LSH function partitions an object into a bucket by first projecting the object along the random line, then giving the projection an unexpected shift, and finally using the floor function to locate the interval of width
w in which the shifted projection falls.
During search and query, E2LSH first finds the bucket where the vector object is located according to the hash function group and then uses the hash function group
h2 to locate the position in the bucket. The secondary hash structure of E2LSH can not only reduce the high storage overhead caused by storing the hash projection values of all vectors in the database but also quickly locate the hash bucket where the query vector is.
E2LSH is especially suitable for scenarios where the dataset volume is significant as the algorithm can significantly enhance search speed.
Many thanks to Xiaomeng Yi, Qianya Cheng, and Weizhi Xu, who helped review and provided valuable suggestions to this article.
Published at DZone with permission of Angela Ni. See the original article here.
Opinions expressed by DZone contributors are their own.