Vector Similarity Search Hides in Plain View
Discover what vector similarity search is, its various applications, and the public resources making artificial intelligence more accessible than ever.
Join the DZone community and get the full member experience.Join For Free
Imagine a room with a wall of screens displaying closed-circuit video feeds from dozens of cameras, like a security office in a film. In the movies, there is often a guard responsible for keeping an eye on the screens that inevitably falls asleep, allowing something bad to happen. Although intuition and other distinctly “people skills” are useful in security, most would agree that the human attention span isn’t well-suited for always-on, 24/7 video monitoring. Of course, footage can always be reviewed after something happens, but it’s easy to see the security value of detecting something out of the ordinary as it unfolds.
Now imagine a video artificial intelligence (AI) application capable of processing thousands of camera feeds in real-time. The AI constantly compares new footage to historical footage, then classifies anomalous events by their threat level. Humans are still involved, both to manage the system as well as review and respond to potential threats, but AI takes over where we fall short. This isn’t a hypothetical situation: from smart police drones to intelligent doorbells sold by Amazon and Google, AI-powered surveillance solutions are becoming increasingly sophisticated, affordable, and ubiquitous.
Video AI is just one of many applications for vector similarity search, a process that uses artificial intelligence to analyze massive, trillion-scale unstructured datasets. This article provides an overview of vector search technology including what it is, how it can be used, as well as the open-source software and resources making it more accessible than ever before.
What Is Vector Similarity Search?
Video data is incredibly detailed and increasingly common, so logically it seems like it would be a great unsupervised learning signal for building video AI. In reality, this is not the case. Processing and analyzing video data, especially in large volumes, remains a challenge for artificial intelligence. Recent progress in this field, like much of the progress made in unstructured data analytics, is owed in large part to vector similarity search.
The problem with video, like all unstructured data, is that it doesn’t follow a predefined model or organizational structure, making it difficult to process and analyze at scale. Unstructured data includes things like images, audio, social media behavior, and documents, collectively accounting for an estimated 80–90%+ of all data. Companies are increasingly aware of the business-critical insights buried in massive, enigmatic unstructured datasets, driving demand for AI applications that can tap into this unrealized potential.
Using neural networks such as CNN, RNN, and BERT, unstructured data can be converted into feature vectors (aka embeddings), a machine-readable numerical data format. Algorithms are then used to calculate the similarity between vectors using measures like cosine similarity or Euclidean distance. Vector embedding and similarity search make it possible to analyze and build machine learning applications using previously indiscernible datasets.
Vector similarity is calculated using established algorithms however, unstructured datasets are typically massive. This means efficient and accurate search requires vast storage and compute power. To accelerate similarity search and reduce resource requirements, approximate nearest neighbor (ANN) search algorithms are used. By clustering similar vectors together, ANN algorithms make it possible to send queries to the clusters of vectors most likely to contain similar vectors, rather than searching the entire dataset. Although this approach is faster, it sacrifices some degree of accuracy. Leveraging ANN algorithms allows vector search to comb through billions of deep learning model insights in milliseconds.
What Are Some Applications of Vector Similarity Search?
Vector similarity search has applications spanning a wide variety of artificial intelligence, deep learning, and traditional vector calculation scenarios. The following provides a high-level overview of various vector similarity search applications:
E-commerce: Vector similarity search has broad applicability in e-commerce, including reverse image search engines that allow shoppers to search for products using an image captured on their smartphone or found online. Additionally, personalized recommendations based on user behavior, interests, purchase history, and more can be served by specialized recommender systems that rely on vector search.
Physical and Cyber Security: Video AI is just one of many applications for vector similarity search in the security field. Other scenarios include facial recognition, behavior tracking, identity authentication, intelligent access control, and more. Additionally, the vector similarity search plays an important role in thwarting an increasingly common and sophisticated cyberattacks. For example, code similarity search can be used to identify security risks by comparing a piece of software to a database of known vulnerabilities or malware.
Recommendation Engines: Recommendation engines are systems that use machine learning and data analysis to suggest products, services, content, and information to users. User behavior, the behavior of similar users, and other data are processed using deep learning methods to generate recommendations. With enough data, algorithms can be trained to understand relationships between entities and invent ways to represent them autonomously. Recommendation systems have broad applicability and are something people already interact with every day, including content recommendations on Netflix, shopping recommendations on Amazon, and news feeds on Facebook.
Chatbots: Traditionally, chatbots are built using a regular knowledge graph that requires a large training dataset. However, chatbots built using deep learning models don’t need to preprocess data — instead, a map between frequent questions and answers is created. Using a pre-trained natural language processing (NLP) model, feature vectors can be extracted from the questions and then stored and queried using a vector data management platform.
Image or Video Search: Deep learning networks have been used to recognize visual patterns since the late 1970s, and modern technology trends have made the image and video search more powerful and accessible than ever before.
Chemical Similarity Search: Chemical similarity is key to predicting the properties of chemical compounds and finding chemicals with specific attributes, making it indispensable to the development of new drugs. Fingerprints represented by feature vectors are created for each molecule, and then the distances between vectors are used to measure similarity. Using AI for new drug discovery is gaining momentum in the tech industry, with ByteDance (TikTok’s Chinese parent company) starting to hire talent in the field.
Open-Source Vector Similarity Search Software and Resources
Moore’s law, cloud computing, and declining resource costs are macro trends that have made artificial intelligence more accessible than ever. Thanks to open-source software and other publicly available resources, building AI/ML applications aren’t just for big tech companies. Below we provide a brief overview of Milvus, an open-source vector data management platform, and also highlight some publicly available datasets that help put AI within everyone’s reach.
Milvus, an Open-Source Vector Data Management Platform
Milvus is an open-source vector data management platform built specifically for massive-scale vector data. Powered by Facebook AI Similarity Search (Faiss), Non-Metric Space Library (NMSLIB), and Annoy, Milvus brings a variety of powerful tools together under a single platform while extending their standalone functionality. The system was purpose-built for storing, processing, and analyzing large vector datasets, and can be used to build all the AI applications (and more) mentioned above.
More information about Milvus can be found on its website. Tutorials, instructions for setting up Milvus, benchmark testing, and information on building a variety of different applications is available in the Milvus boot camp. Developers interested in making contributions to the project can join Milvus’ open-source community on GitHub.
Public Datasets for Artificial Intelligence and Machine Learning
It is no secret that technology giants like Google and Facebook have a data advantage over the little guys, with some pundits even advocating for a “progressive data-sharing mandate” that would force companies that exceed a certain size to share some anonymized data with smaller rivals. Fortunately, there are thousands of publicly available datasets that can be used for AL/ML projects:
- The People’s Speech Dataset: This dataset from ML Commons offers the largest speech dataset in the world, with over 87,000 hours of transcribed speech in 59 different languages.
- UC Irvine Machine Learning Repository: The University of California at Irvine maintains hundreds of public datasets in an effort to help the machine learning community.
- Data.gov: The U.S. government offers hundreds of thousands of open datasets that span education, climate, COVID-19, and more.
- Eurostat: The European Union’s statistical office provides open datasets spanning a variety of industries from the economy and finance to population and social conditions.
Although this list is by no means exhaustive, it is a good starting point for discovering the surprisingly wide variety of open datasets. For more information on public datasets as well as choosing the right data for your next ML or data science project, check out this blog post.
Last but not least, you are welcome to join the Milvus community to learn more about vector similarity search and how it can help you with your next AI applications!
Opinions expressed by DZone contributors are their own.