DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • Query Federation in Data Virtualization and Best Practices
  • Which Tool Is Better for Code Completion — Azure Data Studio or dbForge SQL Complete?
  • Cloud Database Services Compared: AWS, Microsoft, Google, and Oracle
  • Part 2 - How to Hive on GCP using Google DataProc and Cloud Storage

Trending

  • Run Gemma 4 on Your Laptop: A Hands-On Guide to Google's Latest Open Multimodal LLM
  • A Hands-On ABAP RESTful Programming Model Guide
  • A Deep Dive into Tracing Agentic Workflows (Part 1)
  • Ujorm3: A New Lightweight ORM for JavaBeans and Records
  1. DZone
  2. Data Engineering
  3. Databases
  4. Dot Product, Euclidean Distance, and Cosine Similarity in SingleStoreDB

Dot Product, Euclidean Distance, and Cosine Similarity in SingleStoreDB

In the below article, learn examples of using Dot Product, Euclidean Distance, and Cosine Similarity in SingleStoreDB to assess vector relationships.

By 
Akmal Chaudhri user avatar
Akmal Chaudhri
DZone Core CORE ·
Jan. 08, 24 · Analysis
Likes (2)
Comment
Save
Tweet
Share
3.2K Views

Join the DZone community and get the full member experience.

Join For Free

SingleStoreDB has supported vector functions since 2017. In this article, we'll see examples of using Dot Product, Euclidean Distance, and Cosine Similarity in SingleStoreDB to assess vector relationships. New vector capabilities will be available in the next major release of SingleStoreDB - stay tuned.

The Dot Product

The Dot Product is a mathematical procedure combining two numbers lists into a single value. This is achieved by multiplying corresponding numbers from each list and then adding up the products. This operation is useful for tasks like measuring the commonality between two sets of values or assessing the influence of one set on another.

Euclidean Distance

Euclidean Distance is a mathematical metric that measures the straight-line distance between two points in a Cartesian coordinate system. To compute it, square the differences between the x and y coordinates of the two points, sum these squared differences, and then take the square root of the result. The final output is a single numerical representation of the spatial separation between the points, like the direct walking distance between them on a graph.

Cosine Similarity

Cosine Similarity is a mathematical measure that gauges the similarity between two sets, such as two lists of words. Instead of focusing on the specific words, it considers the angle between the sets, irrespective of their sizes. Visualize these sets as vectors in a multi-dimensional space. Cosine Similarity essentially assesses how well the directions of these vectors align. High similarity arises when the vectors point in the same direction, while low similarity occurs when they are perpendicular. This method is commonly employed in comparing documents or texts, providing insights into whether they discuss similar topics, regardless of the specific wording.

SingleStoreDB

SingleStoreDB provides direct support for Dot Product and Euclidean Distance using the vector functions DOT_PRODUCT and EUCLIDEAN_DISTANCE, respectively. Cosine Similarity is supported by combining the DOT_PRODUCT and SQRT functions. In this article, we'll see some simple examples of each function.

Create a SingleStoreDB Cloud account

A previous article showed the steps to create a free SingleStoreDB Cloud account. We'll use the following settings:

  • Workspace Group Name: Iris Demo Group
  • Cloud Provider: AWS
  • Region: US East 1 (N. Virginia)
  • Workspace Name: iris-demo
  • Size: S-00

Create a Database and Table

In our SingleStore Cloud account, let's use the SQL Editor to create a new database. Call this iris_db, as follows:

SQL
 
CREATE DATABASE IF NOT EXISTS iris_db;


We'll also create the iris table, as follows:

SQL
 
USE iris_db;

CREATE TABLE IF NOT EXISTS iris (
     vector BLOB,
     species VARCHAR(20)
);


The Iris data set contains 150 rows of data for three different species of flowers. Each flower has four columns of data: sepal_length, sepal_width, petal_length and petal_width. We can store these four column values together using JSON_ARRAY_PACK in a BLOB format, as follows:

SQL
 
INSERT INTO iris VALUES
(JSON_ARRAY_PACK('[5.1,3.5,1.4,0.2]'),'Iris-setosa'),
(JSON_ARRAY_PACK('[4.9,3,1.4,0.2]'),'Iris-setosa'),
(JSON_ARRAY_PACK('[4.7,3.2,1.3,0.2]'),'Iris-setosa'),
(JSON_ARRAY_PACK('[4.6,3.1,1.5,0.2]'),'Iris-setosa'),
(JSON_ARRAY_PACK('[5,3.6,1.4,0.2]'),'Iris-setosa'),
...
(JSON_ARRAY_PACK('[6.7,3,5.2,2.3]'),'Iris-virginica'),
(JSON_ARRAY_PACK('[6.3,2.5,5,1.9]'),'Iris-virginica'),
(JSON_ARRAY_PACK('[6.5,3,5.2,2]'),'Iris-virginica'),
(JSON_ARRAY_PACK('[6.2,3.4,5.4,2.3]'),'Iris-virginica'),
(JSON_ARRAY_PACK('[5.9,3,5.1,1.8]'),'Iris-virginica');


Only the first five and last five rows are shown above. The complete INSERT code listing is available in a GitHub Gist.

Query 1

First, let's try a query where we want to find the name of the flower species using an exact match for the sepal_length, sepal_width, petal_length and petal_width. We'll use the values [5.9,3,5.1,1.8] from the last row of the iris table, shown above.

Dot Product

SQL
 
SELECT species
FROM iris
WHERE DOT_PRODUCT(vector, JSON_ARRAY_PACK('[5.9,3,5.1,1.8]')) = DOT_PRODUCT(vector, vector);


This SQL query retrieves the species from the iris table for rows where the Dot Product of the vector in the vector column with the specified vector [5.9,3,5.1,1.8] is equal to the Dot Product of the vector a column with itself. Since there's only one row with the specified values in the vector column, the query essentially identifies the species for that particular set of values in the iris table.

Euclidean Distance

SQL
 
SELECT species
FROM iris
WHERE EUCLIDEAN_DISTANCE(vector, JSON_ARRAY_PACK('[5.9,3,5.1,1.8]')) = 0;


This SQL query retrieves the species from the iris table for rows where the Euclidean Distance between the vector in the vector column and the specified vector [5.9,3,5.1,1.8] is equal to 0. In other words, they are the same.

Cosine Similarity

SQL
 
SELECT species
FROM iris
WHERE DOT_PRODUCT(vector, JSON_ARRAY_PACK('[5.9,3,5.1,1.8]')) / SQRT(DOT_PRODUCT(vector, vector) * DOT_PRODUCT(JSON_ARRAY_PACK('[5.9,3,5.1,1.8]'), JSON_ARRAY_PACK('[5.9,3,5.1,1.8]'))) = 1;


This SQL query retrieves the species values from the iris table for rows where the normalized Dot Product of the vector in the vector column with the specified vector [5.9,3,5.1,1.8] equals 1. The condition checks for parallel vectors, indicating a high similarity or identical direction between the two vectors.

The result in each case should be:

Plain Text
 
+----------------+
| species        |
+----------------+
| Iris-virginica |
+----------------+

Query 2

Now, let's use some fictitious data values [5.2,3.6,1.5,0.3] to make a prediction.

Dot Product

SQL
 
SELECT species
FROM iris
ORDER BY DOT_PRODUCT(vector, JSON_ARRAY_PACK('[5.2,3.6,1.5,0.3]')) DESC
LIMIT 1;


This SQL query retrieves the species from the iris table and orders the results in descending order based on the Dot Product between the vectors in the vector column and the specified vector [5.2,3.6,1.5,0.3]. The LIMIT 1 ensures that only the top result is returned.

The result should be:

Plain Text
 
+----------------+
| species        |
+----------------+
| Iris-virginica |
+----------------+

Euclidean Distance

SQL
 
SELECT species
FROM iris
ORDER BY EUCLIDEAN_DISTANCE(vector, JSON_ARRAY_PACK('[5.2,3.6,1.5,0.3]'))
LIMIT 1;


This SQL query retrieves the species from the iris table and orders the results based on the Euclidean Distance between the vectors in the vector column and the specified vector [5.2,3.6,1.5,0.3]. LIMIT 1 ensures that only the top result, closest in Euclidean Distance, is returned.

The output should be:

Plain Text
 
+-------------+
| species     |
+-------------+
| Iris-setosa |
+-------------+

Cosine Similarity

If we want to use DOT_PRODUCT and achieve a similar result EUCLIDEAN_DISTANCE, we can use the following approach. The DOT_PRODUCT doesn't directly represent distance, so we must modify the query. One possible adjustment is to consider the angle between vectors.

SQL
 
SELECT species
FROM iris
ORDER BY DOT_PRODUCT(vector, JSON_ARRAY_PACK('[5.2,3.6,1.5,0.3]')) / SQRT(DOT_PRODUCT(vector, vector) * DOT_PRODUCT(JSON_ARRAY_PACK('[5.2,3.6,1.5,0.3]'), JSON_ARRAY_PACK('[5.2,3.6,1.5,0.3]'))) DESC
LIMIT 1;


This SQL query normalizes the DOT_PRODUCT by dividing it by the product of the magnitudes of the vectors. This adjustment helps to account for differences in vector magnitudes and makes it DOT_PRODUCT more comparable to a Cosine Similarity, which considers the angle between vectors. The ordering is done in descending order to prioritize higher similarity. The output should be:

Plain Text
 
+-------------+
| species     |
+-------------+
| Iris-setosa |
+-------------+

Summary

In this short article, we've seen several examples of how to use SingleStoreDB's built-in DOT_PRODUCT and EUCLIDEAN_DISTANCE vector functions. We have also seen how we can easily represent Cosine Similarity by combining the DOT_PRODUCT and SQRT functions.

We've used the functions to find an exact match and made predictions based on new, previously unknown, values.

Database Cloud IRIS (transportation software) sql

Published at DZone with permission of Akmal Chaudhri. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Query Federation in Data Virtualization and Best Practices
  • Which Tool Is Better for Code Completion — Azure Data Studio or dbForge SQL Complete?
  • Cloud Database Services Compared: AWS, Microsoft, Google, and Oracle
  • Part 2 - How to Hive on GCP using Google DataProc and Cloud Storage

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook