Vantage-point tree
A vantage-point tree is a special case of metric tree, a space-partitioning data structure used by efficient nearest neighbor search algorithms.
If you are in Mexico, looking for the nearest sushi bar in the giant World Sushi Bar Database, you can ignore those that are known to be less than 1000 km away from Kuala Lumpur. That's what the vp-tree does : group together items that are close to a particular one (here Kuala Lumpur is the vantage point), so that the whole group can easily be skipped in the search phase.
Advantages
To construct the vp-tree, a particular item (the vantage-point) is chosen in the initial dataset. The other items are then divided into two near
and far
subsets, which are recursively subdivided the same way.
The search algorithm exploits the triangle inequality to exclude large portions of the tree
that are too far from the query object.
One of the great things about this method is that it applies to any kind of objects. Not only sushi bars but also DNA sequences, images, faces. You only need a function to measure the similarity between two objects, and this is not limited to the standard euclidean distance. Any metric, i.e. any distance function which satisfies these conditions can be used. Here are a few examples of distances.
Demos
data:image/s3,"s3://crabby-images/cdff7/cdff7d54561667d94b7e75d1ebb7a42abfe7bfcd" alt="Search cities"
Search nearest cities by coordinates
Enter geographic coordinates, and find the k-nearest cities in a list of more than 50 thousands of cities, instantly.
data:image/s3,"s3://crabby-images/ab0ee/ab0ee9ace3499e9315561cf84248a8578e854b0d" alt="Worldmap"
Search nearest cities by hovering the globe
Enter geographic coordinates, and find the k-nearest cities in a list of more than 50 thousands of cities, instantly.
About fuzzy string matching
Some string metrics may not be very good candidates to use with vp-trees. Common ones like the Levenshtein distance or the Damerau-Levenshtein distance only take small integer values. Even if they are metrics per se and can in theory be used to build vp-trees, the resulting trees won't be very efficient. To get the most of a vp-tree you need the distances between items to spread over a wide range of different value.