DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Last call! Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Essential Python Libraries: Introduction to NumPy and Pandas
  • Norm of a One-Dimensional Tensor in Python Libraries
  • Enumerate and Zip in Python
  • Difference Between High-Level and Low-Level Programming Languages

Trending

  • AI, ML, and Data Science: Shaping the Future of Automation
  • Agile and Quality Engineering: A Holistic Perspective
  • A Guide to Developing Large Language Models Part 1: Pretraining
  • Stateless vs Stateful Stream Processing With Kafka Streams and Apache Flink
  1. DZone
  2. Coding
  3. Languages
  4. How To Implement Cosine Similarity in Python

How To Implement Cosine Similarity in Python

Cosine similarity is an indispensable tool that has a wide range of applications, from simplifying searches in large datasets to understanding natural language.

By 
Phil Miesle user avatar
Phil Miesle
·
Nov. 23, 23 · Tutorial
Likes (1)
Comment
Save
Tweet
Share
4.9K Views

Join the DZone community and get the full member experience.

Join For Free

Cosine similarity has several real-world applications, and by using embedding vectors, we can compare real-world meanings in a programmatic manner. Python is one of the most popular languages for data science, and it offers various libraries to calculate cosine similarity with ease. In this article, we’ll discuss how you can implement cosine similarity in Python using the help of Scikit-Learn and NumPy libraries. 

What Is Cosine Similarity?

Cosine similarity is a measure of similarity between two non-zero vectors in an n-dimensional space. It is used in various applications, such as text analysis and recommendation systems, to determine how similar two vectors are in terms of their direction in the vector space.

Cosine Similarity Formula

The cosine similarity between two vectors, A and B, is calculated using the following formula:

Cosine Similarity (A, B) = (A · B) / (||A|| * ||B||)

In this formula, A · B represents the dot product of vectors A and B. This is calculated by multiplying the corresponding components of the two vectors and summing up the results. ||A|| represents the Euclidean norm (magnitude) of vector A, which is the square root of the sum of the squares of its components. It's calculated as ||A|| = √(A₁² + A₂² + ... + Aₙ²). ||B|| represents the Euclidean norm (magnitude) of vector B, calculated in the same way as ||A||.

How To Calculate Cosine Similarity

To calculate cosine similarity, you first complete the calculation for the dot product of the two vectors. Then, divide it by the product of their magnitudes. The resulting value will be in the range of -1 to 1, where:

  • If the cosine similarity is 1, it means the vectors have the same direction and are perfectly similar.
  • If the cosine similarity is 0, it means the vectors are perpendicular to each other and have no similarity.
  • If the cosine similarity is -1, it means the vectors have opposite directions and are perfectly dissimilar.

In text analysis, cosine similarity is used to measure the similarity between document vectors, where each document is represented as a vector in a high-dimensional space, with each dimension corresponding to a term or word in the corpus. By calculating the cosine similarity between document vectors, you can determine how similar or dissimilar two documents are to each other.

Libraries for Cosine Similarity Calculation

  • NumPy: Great for numerical operations, and it's optimized for speed.
  • scikit-learn: Offers various machine learning algorithms and includes a method for cosine similarity in its metrics package.

The following are some examples to show how cosine similarity can be calculated using Python. We’ll use our two now-familiar book review vectors [5,3,4] and [4,2,4]. 

Straight Python

While we previously calculated this by hand, of course, a computer can do it! Here is how you can compute cosine similarity using Python with no additional libraries:

Python
 
A = [5, 3, 4]

B = [4, 2, 4]

# Calculate dot product

dot_product = sum(a*b for a, b in zip(A, B))

# Calculate the magnitude of each vector

magnitude_A = sum(a*a for a in A)**0.5

magnitude_B = sum(b*b for b in B)**0.5

# Compute cosine similarity

cosine_similarity = dot_product / (magnitude_A * magnitude_B)

print(f"Cosine Similarity using standard Python: {cosine_similarity}")


NumPy

Embedding vectors will typically have many dimensions — hundreds, thousands, even millions, or more! With NumPy, you can calculate cosine similarity using array operations, which are highly optimized. 

Python
 
import numpy as np

A = np.array([5, 3, 4])

B = np.array([4, 2, 4])

dot_product = np.dot(A, B)

magnitude_A = np.linalg.norm(A)

magnitude_B = np.linalg.norm(B)

cosine_similarity = dot_product / (magnitude_A * magnitude_B)

print(f"Cosine Similarity using NumPy: {cosine_similarity}")


Scikit-Learn

Scikit-learn's cosine_similarity function makes it even easier to calculate highly optimized cosine similarity operations:

Python
 
from sklearn.metrics.pairwise import cosine_similarity

A = np.array([[5, 3, 4]])

B = np.array([[4, 2, 4]])

cosine_similarity_result = cosine_similarity(A, B)

print(f"Cosine Similarity using scikit-learn: {cosine_similarity_result[0][0]}")


Tips for Optimizing Cosine Similarity Calculations in Python

If you are going to use Python to directly compute cosine similarity, there are some things to consider:

  • Use optimized libraries like NumPy or scikit-learn: These libraries are optimized for performance and are generally faster than vanilla Python.
  • Use Numba: Numba is an open-source JIT compiler for Python and NumPy code, built specifically to optimize scientific computing functions. 
  • Use GPUs: If you have access to a GPU, use Python libraries such as Tensorflow that have been optimized for use on a GPU.
  • Parallelize Computations: If you have the hardware capabilities, consider parallelizing your computations to speed them up.

Search Large Numbers of Vectors With Vector Search on AstraDB

If you need to search large numbers of vectors, you may find it more efficient and scalable to use a vector database such as DataStax Astra’s Vector Search capability. Vector Search on Astra DB offers a powerful platform to help you execute vector searches with built-in cosine similarity calculations so you can get more insights from your data.

NumPy Python (language) Language code

Opinions expressed by DZone contributors are their own.

Related

  • Essential Python Libraries: Introduction to NumPy and Pandas
  • Norm of a One-Dimensional Tensor in Python Libraries
  • Enumerate and Zip in Python
  • Difference Between High-Level and Low-Level Programming Languages

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!