DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workkloads.

Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • BigQuery DataFrames in Python
  • Building Neural Networks With Automatic Differentiation
  • Data Privacy and Security: A Developer's Guide to Handling Sensitive Data With DuckDB
  • How to Simplify Complex Conditions With Python's Match Statement

Trending

  • Zero Trust for AWS NLBs: Why It Matters and How to Do It
  • My LLM Journey as a Software Engineer Exploring a New Domain
  • Build an MCP Server Using Go to Connect AI Agents With Databases
  • Segmentation Violation and How Rust Helps Overcome It
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. Build Vector Embeddings for Video via Python Notebook and OpenAI CLIP

Build Vector Embeddings for Video via Python Notebook and OpenAI CLIP

Delve into AI's capabilities to analyze video data and how vector embeddings, created with Python and OpenAI CLIP, can help interpret and analyze video content.

By 
Akmal Chaudhri user avatar
Akmal Chaudhri
DZone Core CORE ·
Sep. 23, 24 · Tutorial
Likes (2)
Comment
Save
Tweet
Share
2.6K Views

Join the DZone community and get the full member experience.

Join For Free

As AI continues to impact many types of data processing, vector embeddings have also emerged as a powerful tool for video analysis. This article delves into some of the capabilities of AI in analyzing video data. We'll explore how vector embeddings, created using Python and OpenAI CLIP, can be used to interpret and analyze video content. Discuss the significance of vector embeddings in video analysis, offering a step-by-step guide to building these embeddings using a simple example.

The notebook file used in this article is available on GitHub.

Tutorial

1. Create a SingleStore Cloud Account

A previous article showed the steps to create a free SingleStore Cloud account. We'll use the Free Shared Tier and take the default names for the Workspace and Database.

2. Import the Notebook

We'll download the notebook from GitHub (linked in the article introduction).

From the left navigation pane in the SingleStore cloud portal, we'll select DEVELOP > Data Studio.

In the top right of the web page, we'll select New Notebook > Import From File. We'll use the wizard to locate and import the notebook we downloaded from GitHub.

3. Run the Notebook

After checking that we are connected to our SingleStore workspace, we'll run the cells one by one.

We'll start by downloading an example video from GitHub and then playing the short video directly in the notebook. The example video is 142 seconds long.

Contrastive Language-Image Pretraining (CLIP) is a model by OpenAI that understands both images and text by associating them in a shared embedding space. We'll load it as follows:

Python
 
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device = device)


We'll break down a video into its individual picture frames, as follows:

Python
 
def extract_frames(video_path):
    frames = []
    cap = cv2.VideoCapture(video_path)
    frame_rate = cap.get(cv2.CAP_PROP_FPS)
    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    total_seconds = total_frames / frame_rate
    target_frame_count = int(total_seconds)
    target_frame_index = 0
    for i in range(target_frame_count):
        cap.set(cv2.CAP_PROP_POS_FRAMES, target_frame_index)
        ret, frame = cap.read()
        if not ret:
            break
        frames.append(frame)
        target_frame_index += int(frame_rate)
    cap.release()
    return frames


Next, we'll summarise what's happening in a picture in a simpler form:

Python
 
def generate_embedding(frame):
    frame_tensor = preprocess(PILImage.fromarray(frame)).unsqueeze(0).to(device)
    with torch.no_grad():
        embedding = model.encode_image(frame_tensor).cpu().numpy()
    return embedding[0]


We'll now extract and summarise visual information from a video into a structured format for further analysis:

Python
 
def store_frame_embedding_and_image(video_path):
    frames = extract_frames(video_path)
    data = [
        (i+1, generate_embedding(frame), frame)
        for i, frame in enumerate(tqdm(
            frames,
            desc = "Processing frames",
            bar_format = "{l_bar}{bar}| {n_fmt}/{total_fmt} [{elapsed}<{remaining}, {rate_fmt}{postfix}]")
        )
    ]
    return pd.DataFrame(data, columns = ["frame_number", "embedding_data", "frame_data"])


Let's examine the size characteristics of the data stored in the DataFrame:

Python
 
embedding_lengths = df["embedding_data"].str.len()
frame_lengths = df["frame_data"].str.len()

# Calculate min and max lengths for embeddings and frames
min_embedding_length, max_embedding_length = embedding_lengths.min(), embedding_lengths.max()
min_frame_length, max_frame_length = frame_lengths.min(), frame_lengths.max()

# Print results
print(f"Min length of embedding vectors: {min_embedding_length}")
print(f"Max length of embedding vectors: {max_embedding_length}")
print(f"Min length of frame data vectors: {min_frame_length}")
print(f"Max length of frame data vectors: {max_frame_length}")


Example output:

Plain Text
 
Min length of embedding vectors: 512
Max length of embedding vectors: 512
Min length of frame data vectors: 1080
Max length of frame data vectors: 1080


Now, let's quantify how similar the query embedding is to each frame's embedding in the DataFrame, providing a measure of similarity between a query and the frames:

Python
 
def calculate_similarity(query_embedding, df):
    # Convert the query embedding to a tensor
    query_tensor = torch.tensor(query_embedding, dtype = torch.float32).to(device)

    # Convert the list of embeddings to a numpy array
    embeddings_np = np.array(df["embedding_data"].tolist())

    # Create a tensor from the numpy array
    embeddings_tensor = torch.tensor(embeddings_np, dtype = torch.float32).to(device)

    # Compute similarities using matrix multiplication
    similarities = torch.mm(embeddings_tensor, query_tensor.unsqueeze(1)).squeeze().tolist()
    return similarities


Now, we'll summarize the meaning of a text query in a simpler numerical form:

Python
 
def encode_text_query(query):
    # Tokenize the query text
    tokens = clip.tokenize([query]).to(device)
    
    # Compute text features using the pretrained model
    with torch.no_grad():
        text_features = model.encode_text(tokens)
    
    # Convert the tensor to a NumPy array and return it
    return text_features.cpu().numpy().flatten()


And enter the query string "Ultra-Fast Ingestion" when prompted:

Python
 
query = input("Enter your query: ")
text_query_embedding = encode_text_query(query)
text_similarities = calculate_similarity(text_query_embedding, df)
df["text_similarity"] = text_similarities


We'll now get the top 5 best text matches:

Python
 
# Retrieve the top 5 text matches based on similarity
top_text_matches = df.nlargest(5, "text_similarity")

print("Top 5 best matches:")
print(top_text_matches[["frame_number", "text_similarity"]].to_string(index = False))


Example output:

Plain Text
 
Top 5 best matches:
 frame_number  text_similarity
           40        36.456184
           39        36.081161
           43        33.295975
           42        32.423229
           45        31.931164


We can also plot the frames:

Python
 
def plot_frames(frames, frame_numbers):
    num_frames = len(frames)
    fig, axes = plt.subplots(1, num_frames, figsize = (15, 5))
    
    for ax, frame_data, frame_number in zip(axes, frames, frame_numbers):
        ax.imshow(frame_data)
        ax.set_title(f"Frame {frame_number}")
        ax.axis("off")
    
    plt.tight_layout()
    plt.show()

# Collect frame data and numbers for the top text matches
top_text_matches_indices = top_text_matches.index.tolist()
frames = [df.at[index, "frame_data"] for index in top_text_matches_indices]
frame_numbers = [df.at[index, "frame_number"] for index in top_text_matches_indices]

# Plot the frames
plot_frames(frames, frame_numbers)


Now, we'll summarise an image query in a simpler numerical form:

Python
 
def encode_image_query(image):
    # Preprocess the image and add batch dimension
    image_tensor = preprocess(image).unsqueeze(0).to(device)
    
    # Extract features using the model
    with torch.no_grad():
        image_features = model.encode_image(image_tensor)
    
    # Convert features to NumPy array and flatten
    return image_features.cpu().numpy().flatten()


Download an example image to use for a query:

Python
 
image_url = "https://github.com/VeryFatBoy/clip-demo/raw/main/thumbnails/1_what_makes_singlestore_unique.png"

response = requests.get(image_url)

if response.status_code == 200:
    display(Image(url = image_url))
    image_file = PILImage.open(BytesIO(response.content))

    image_query_embedding = encode_image_query(image_file)
    image_similarities = calculate_similarity(image_query_embedding, df)
    df["image_similarity"] = image_similarities
else:
    print("Failed to download the image, status code:", response.status_code)


We'll now get the top 5 best image matches:

Python
 
top_image_matches = df.nlargest(5, "image_similarity")

print("Top 5 best matches:")
print(top_image_matches[["frame_number", "image_similarity"]].to_string(index = False))


Example output:

Plain Text
 
Top 5 best matches:
 frame_number  image_similarity
            7         57.674603
            9         43.669739
            6         42.573799
           15         40.296551
           93         40.201733


We can also plot the frames:

Python
 
# Collect frame data and numbers for the top image matches
top_image_matches_indices = top_image_matches.index.tolist()
frames = [df.at[index, "frame_data"] for index in top_image_matches_indices]
frame_numbers = [df.at[index, "frame_number"] for index in top_image_matches_indices]

# Plot the frames
plot_frames(frames, frame_numbers)


Now let's combine both text and image by using element-wise averaging:

Python
 
# Normalise
text_query_embedding /= np.linalg.norm(
    text_query_embedding,
    axis = -1,
    keepdims = True
)
image_query_embedding /= np.linalg.norm(
    image_query_embedding,
    axis = -1,
    keepdims = True
)

combined_query_embedding = (text_query_embedding + image_query_embedding) / 2
combined_similarities = calculate_similarity(combined_query_embedding, df)
df["combined_similarity"] = combined_similarities


We'll now get the top 5 best-combined matches:

Python
 
top_combined_matches = df.nlargest(5, "combined_similarity")

print("Top 5 best matches:")
print(top_combined_matches[["frame_number", "combined_similarity"]].to_string(index = False))


Example output:

Plain Text
 
Top 5 best matches:
 frame_number  combined_similarity
            7             4.304160
            6             3.673842
            5             3.613622
           93             3.595592
           94             3.559316


We can also plot the frames:

Python
 
# Collect frame data and numbers for the top combined matches
top_combined_matches_indices = top_combined_matches.index.tolist()
frames = [df.at[index, "frame_data"] for index in top_combined_matches_indices]
frame_numbers = [df.at[index, "frame_number"] for index in top_combined_matches_indices]

# Plot the frames
plot_frames(frames, frame_numbers)


Next, we'll store the data in SingleStore. First, we'll prepare the data:

Python
 
frames_df = df.copy()
frames_df.drop(
    columns = ["text_similarity", "image_similarity", "combined_similarity"],
    inplace = True
)

query_string = combined_query_embedding.copy()


We'll also need to perform a little data cleanup:

Python
 
def process_data(arr):
    return np.array2string(arr, separator = ",").replace("\n", "")

frames_df["embedding_data"] = frames_df["embedding_data"].apply(process_data)
frames_df["frame_data"] = frames_df["frame_data"].apply(process_data)
query_string = process_data(query_string)


We'll check if we are running on the Free Shared Tier:

Python
 
shared_tier_check = %sql SHOW VARIABLES LIKE "is_shared_tier"
if not shared_tier_check or shared_tier_check[0][1] == "OFF":
    %sql DROP DATABASE IF EXISTS video_db;
    %sql CREATE DATABASE IF NOT EXISTS video_db;


And then get a connection to the database:

Python
 
from sqlalchemy import *

db_connection = create_engine(connection_url)


We'll ensure a table is available to store the data:

SQL
 
DROP TABLE IF EXISTS frames;

CREATE TABLE IF NOT EXISTS frames (
    frame_number INT(10) UNSIGNED NOT NULL,
    embedding_data VECTOR(512) NOT NULL,
    frame_data TEXT,
    KEY(frame_number)
);


Then write the DataFrame to SingleStore:

Python
 
frames_df.to_sql(
    "frames",
    con = db_connection,
    if_exists = "append",
    index = False,
    chunksize = 1000
)


We can read some data back from SingleStore:

SQL
 
SELECT frame_number,
    SUBSTRING(embedding_data, 1, 50) AS embedding_data,
    SUBSTRING(frame_data, 1, 50) AS frame_data
FROM frames
LIMIT 1;


We can also create an ANN index:

SQL
 
ALTER TABLE frames ADD VECTOR INDEX (embedding_data)
     INDEX_OPTIONS '{
          "index_type":"AUTO",
          "metric_type":"DOT_PRODUCT"
     }';


First, let's run a query without using the ANN index:

SQL
 
SELECT frame_number,
    embedding_data <*> :query_string AS similarity
FROM frames
ORDER BY similarity USE INDEX () DESC
LIMIT 5;


Example output:

Plain Text
 
frame_number         similarity
           7  4.304159641265869
           6  3.673842668533325
           5 3.6136221885681152
          93 3.5955920219421387
          94 3.5593154430389404


Now, we'll run a query using the ANN index:

SQL
 
SELECT frame_number,
    embedding_data <*> :query_string AS similarity
FROM frames
ORDER BY similarity DESC
LIMIT 5;


Example output:

Plain Text
 
frame_number         similarity
           7  4.304159641265869
           6  3.673842668533325
           5 3.6136221885681152
          93 3.5955920219421387
          94 3.5593154430389404


We can also use Python as an alternative:

Python
 
sql_query = """
SELECT frame_number, embedding_data, frame_data
FROM frames
ORDER BY embedding_data <*> %s DESC
LIMIT 5;
"""

new_frames_df = pd.read_sql(
    sql_query,
    con = db_connection,
    params = (query_string,)
)

new_frames_df.head()


Since we are only storing a small quantity of data (142 rows), the results are identical whether we use the ANN index or not. Our results from querying the database agree with our earlier results for the combined query.

Summary

In this article, we applied vector embeddings for video analysis using Python and OpenAI's CLIP model. We saw how to extract frames from a video, generate embeddings for each frame, and use these embeddings to perform similarity searches based on text and image queries. This allowed us to retrieve relevant video segments, making it a useful tool for video content analysis.

Today, many modern LLMs are offering multimodal capabilities and quite extensive support for audio, images, and video. However, the example in this article showed that it is possible to use freely available software to achieve some of the same capabilities.

Data (computing) Frame (networking) Python (language) neural network

Published at DZone with permission of Akmal Chaudhri. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • BigQuery DataFrames in Python
  • Building Neural Networks With Automatic Differentiation
  • Data Privacy and Security: A Developer's Guide to Handling Sensitive Data With DuckDB
  • How to Simplify Complex Conditions With Python's Match Statement

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!