DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Last call! Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Hybrid Search: A New Frontier in Enterprise Search
  • Transforming Customer Feedback With Automation of Summaries and Labels Using TAG and RAG
  • How to Improve RAG Quality by Storing Knowledge Graphs in Vector Databases
  • Pivoting Database Systems Practices to AI: Create Efficient Development and Maintenance Practices With Generative AI

Trending

  • From Zero to Production: Best Practices for Scaling LLMs in the Enterprise
  • Performing and Managing Incremental Backups Using pg_basebackup in PostgreSQL 17
  • Intro to RAG: Foundations of Retrieval Augmented Generation, Part 1
  • Building Enterprise-Ready Landing Zones: Beyond the Initial Setup
  1. DZone
  2. Data Engineering
  3. Data
  4. Performing Advanced Facebook Event Data Analysis With a Vector Database

Performing Advanced Facebook Event Data Analysis With a Vector Database

Learn how MyScale enhances event data analysis with Apify to scrape and analyze Facebook event data, offering semantic search capabilities for better insights.

By 
Usama Jamil user avatar
Usama Jamil
·
Sep. 18, 24 · Tutorial
Likes (2)
Comment
Save
Tweet
Share
2.6K Views

Join the DZone community and get the full member experience.

Join For Free

In today's digital age, professionals across all industries must stay updated with upcoming events, conferences, and workshops. However, efficiently finding events that align with one's interests amidst the vast ocean of online information presents a significant challenge.

This blog introduces an innovative solution to this challenge: a comprehensive application designed to scrape event data from Facebook (opens new window)and analyzes the scraped data using MyScale(opens new window. While MyScale is commonly associated with the RAG tech stack or used as a vector database, its capabilities extend beyond these realms. We will utilize it for data analysis, leveraging its vector search functionality to analyze events that are semantically similar, thus providing better results and insights.

You may notice that Grok AI (opens new window)utilized the Qdrant vector database as the search engine to retrieve real-time information from X (formerly known as Twitter) data. You can also assess the power of vector databases in this way with MyScale by integrating MyScale with other platforms like Apify to enhance daily life tasks through the development of simple personalized applications.

So in this blog, let’s develop an application that takes only the name of a city as input and scrapes all related events from Facebook. Subsequently, we will conduct data analysis and semantic search using the advanced SQL vector capabilities of MyScale.

Tools and Technologies

We’ll use several tools, including Apify, MyScale(opens new window, and OpenAI, to develop this useful application.

  • Apify: A popular web scraping (opens new window)and automation platform that significantly streamlines the process of data collection, it provides the capability to scrape data and subsequently feed it to LLMs. This allows us to train LLMs on real-time data and develop applications.
  • MyScale: MyScale is a SQL vector database that we use to store and process both structured and unstructured data in an optimized way.
  • OpenAI: We will use the model text-embedding-3-small from OpenAI (opens new window)to get the embeddings of the text and then save those embeddings in MyScale for data analysis and semantic search.

How To Set Up MyScale and Apify

To start setting up MyScale and Apify, you'll need to create a new directory and a Python (opens new window)file. You can do this by opening your terminal or command line and entering the following commands:

Shell
 
mkdir MyScale
cd MyScale
touch main.ipynb


Let's install the packages. Copy the command below, and paste it into your terminal. These packages will provide the tools and libraries we need to develop our application.

Shell
 
pip install openai apify-client clickhouse-connect pandas numpy


This should install all the dependencies in your system. To confirm that everything is installed properly, you can enter the following command in your terminal.

Shell
 
pip freeze | egrep '(openai|apify-client|clickhouse-connect|pandas|numpy)'


This should include all the installed dependencies with their versions. If you spot any missing dependencies, you may need to re-run the installation command for that specific package. Now, we're ready to write our code after the installations.

Note: We will be working in a Python notebook. Consider every code block a notebook cell.

How To Scrape Data With Apify

Now, we will use the Apify API to scrape event data of New York City using Facebook Events scraper(opens new window).

Python
 
import pandas as pd
from apify_client import ApifyClient
# Initialize the ApifyClient with your API token
client = ApifyClient("Enter_your_apify_key_here")

# Prepare the Actor input
run_input = {
    "searchQueries": ["Sport New York"],
    "startUrls": [],
    "maxEvents": 50,
}

# Run the Actor and wait for it to finish
run = client.actor("UZBnerCFBo5FgGouO").call(run_input=run_input)

df_columns = ['Name', 'Datetime', 'Description', 'Users_Going', 'Users_Interested', 'Users_Responded', 'City', 'Organized_By', 'Street_Address']
dataframe1 = pd.DataFrame(columns=df_columns)

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    # Use a dictionary comprehension to replace None values with an empty string
    row = {
        'Name': item.get('name', ''),
        'Datetime': item.get('dateTimeSentence', ''),
        'Description': item.get('description', ''),
        'Users_Going': item.get('usersGoing', ''),
        'Users_Interested': item.get('usersInterested', ''),
        'Users_Responded': item.get('usersResponded', ''),
        'City': item.get('location', {}).get('city', '') if item.get('location') else '',
        'Organized_By': item.get('organizedBy', ''),
        'Street_Address': item.get('location', {}).get('streetAddress', '') if item.get('location') else ''
    }
    # Ensure all None values are replaced by an empty string
    row = {k: (v if v is not None else '') for k, v in row.items()}
    dataframe1 = dataframe1._append(row, ignore_index=True)

# Cleaning the data
dataframe1['Description'] = dataframe1['Description'].replace('\\n', '', regex=True)


This script gives us the upcoming event details in the form of a pandas DataFrame. 

Note: Don’t forget to add your Apify API key in the script above. You can find your API token on the Integrations (opens new window)page in the Apify Console.

Data Pre-Processing

When we gather raw data, it comes in various formats. In this script, we will bring the event dates into a single format so that our data filtering can be more efficiently done.

Python
 
# Import necessary libraries for data manipulation and date parsing
import pandas as pd
import numpy as np
from datetime import datetime
from dateutil import parser

# Function to parse date strings that may represent a range or a single date
def parse_dates(date_str):
    # Check if the date string contains a dash, indicating a range
    if '-' in date_str:
        parts = date_str.split('-')
        # If the string splits into two parts, it's a valid range
        if len(parts) == 2:
            try:
                # Parse start and end dates, formatting them to a readable format
                start_date = parser.parse(parts[0], fuzzy=True).strftime('%a, %b %d')
                end_date = parser.parse(parts[1], fuzzy=True).strftime('%a, %b %d')
                return start_date, end_date
            except ValueError:
                # In case of parsing error, do nothing (will handle below)
                pass  
    # If not a range or if parsing the range failed, try parsing as a single date
    try:
        parsed_date = parser.parse(date_str, fuzzy=True)
        # Format the single date for start_date and format differently for end_date
        start_date = parsed_date.strftime('%a, %b %d AT %I:%M %p EDT')
        end_date = parsed_date.strftime('%a, %b %d')  # Omitting time for end_date
        return start_date, end_date
    except ValueError:
        # Return NaN for both dates if parsing fails
        return np.nan, np.nan  

# Function to extract detailed date, time, and day from a date string
def extract_date_time_day(date_str):
    try:
        # Parse the date string, allowing for some flexibility in input format
        parsed_date = parser.parse(date_str, fuzzy=True)
        # Extract and format the date, time, and day parts
        date = parsed_date.strftime('%Y-%m-%d')
        day = parsed_date.strftime('%a')
        # Determine if the original string included a time component
        time_component = parsed_date.strftime('%I:%M %p') not in date_str
        time = parsed_date.strftime('%H:%M:%S') if not time_component else np.nan
    except ValueError:
        # If parsing fails, set date, time, and day to NaN
        date, time, day = np.nan, np.nan, np.nan
    
    return date, time, day

# Apply parse_dates function across the dataframe, creating new columns for start and end dates
dataframe1[['Start_Date', 'End_Date']] = dataframe1.apply(lambda row: pd.Series(parse_dates(row['Datetime'])), axis=1)

# Drop rows where Start_Date is NaN, indicating parsing was unsuccessful
dataframe = dataframe1.dropna(subset=['Start_Date'])

# Apply extract_date_time_day to split the start and end dates into separate date, time, and day columns
dataframe['Start_Date'], dataframe['Start_Time'], dataframe['Start_Day'] = zip(*dataframe['Start_Date'].apply(extract_date_time_day))
dataframe['End_Date'], _, dataframe['End_Day'] = zip(*dataframe['End_Date'].apply(extract_date_time_day))

# Remove the original 'Datetime' column as it's no longer needed
dataframe=dataframe.drop(['Datetime'], axis=1)

# Convert 'Start_Date' and 'End_Date' to datetime format, extracting just the date part
dataframe['Start_Date'] = pd.to_datetime(dataframe['Start_Date']).dt.date
dataframe['End_Date'] = pd.to_datetime(dataframe['End_Date']).dt.date

# Convert 'Start_Time' to datetime format, retaining the time information
dataframe['Start_Time'] = pd.to_datetime(dataframe['Start_Time'])


This code snippet uses pandas with datetime and dateutil packages of Python to format the data.

Generating Embeddings

To deeply understand and search events, we will generate embeddings from their descriptions using the text-embedding-3-small. These embeddings capture the semantic essence of each event, helping the application to return better results.

Python
 
# Import OpenAI library for API access.
from openai import OpenAI 
# Initialize OpenAI client with an API key.
openai_client = OpenAI(api_key="your_openai_api_key_here")
# Function to get text embeddings 
def get_embedding(text, model="text-embedding-3-small"):
    return openai_client.embeddings.create(input=text, model=model).data
embeddings = get_embedding(dataframe["Description"].tolist())
# Extract embedding vectors from the embeddings object
vectors = [embedding.embedding for embedding in embeddings]
array = np.array(vectors)
embeddings_series = pd.Series(list(array))
# Add embeddings as a new column in the DataFrame.
dataframe['Description_Embeddings'] = embeddings_series


Now, we will insert the new DataFrame with embeddings to MyScale.

Connecting With MyScale

As we have discussed at the start we will use the MyScale as a vector database for storing and managing data. Here, we will connect to MyScale in preparation for data storage.

Python
 
import clickhouse_connect
client = clickhouse_connect.get_client(
    host='host_name_here',
    port=443,
    username='username_here',
    password='passwd_here'
)


This connection setup ensures our application can communicate with MyScale, and use SQL’s power for data manipulation and analysis.

Note: See Connection Details (opens new window) for more information on how to connect to the MyScale cluster.

Create Tables and Indexes Using MyScale

We now create a table according to our DataFrame. All the data will be stored in this table, including the embeddings.

Python
 
client.command("""
    CREATE TABLE default.Events (
    Name String,
    Description String,
    Users_Going Int64,
    Users_Interested Int64,
    Users_Responded Int64,
    City String,
    Organized_By String,
    Street_Address String,
    Start_Date Date32,
    End_Date Nullable(Date32),
    Start_Time Nullable(DateTime64),
    Start_Day String,
    End_Day String,
    Description_Embeddings Array(Float32),
    CONSTRAINT check_data_length CHECK length(Description_Embeddings) = 1536
    ) ENGINE = MergeTree()
    ORDER BY (Name);
    """)


The above SQL statements create a table named Events on the cluster. The CONSTRAINT makes sure that all the vectors embedding are of the same length 1536.

Storing the Data and Creating an Index in MyScale

In this step, we insert the processed data into MyScale. This involves batch-inserting the data to ensure efficient storage and retrieval.

Python
 
batch_size = 10  # Adjust based on your needs

num_batches = len(dataframe) // batch_size

for i in range(num_batches):
    start_idx = i * batch_size
    end_idx = start_idx + batch_size
    batch_data = dataframe[start_idx:end_idx]
    # print(batch_data["Description_Embeddings"])
    client.insert("default.Events", batch_data.to_records(index=False).tolist(), column_names=batch_data.columns.tolist())
    print(f"Batch {i+1}/{num_batches} inserted.")

client.command("""
ALTER TABLE default.Events
    ADD VECTOR INDEX vector_index Description_Embeddings
    TYPE MSTG
""")


Using pandas, the above code efficiently transfers our prepared dataset into the MyScale database.

Data Analysis Using MyScale

Finally, we use MyScale’s analytical capabilities to perform analysis and enable semantic search. By executing SQL queries, we can analyze events based on topics, locations, and dates. So, let’s try to write some queries.

Simple SQL Query

Let’s first try to get the top 10 results from the table.

Python
 
results=client.query("""
        SELECT Name,Description FROM default.Events LIMIT 10
    """)
for row in results.named_results():
        print(row["Name"])
        print(row['Description'])


This query will simply return the top 10 results from the events table.

Discover Events by Semantic Relevance

Let’s try to find the top 10 upcoming events with a vibe similar to a reference event, like this, “One of the Longest Running Shows in the Country - Operating since 1974 ...NOW our 50th YEAR !!!Our Schenectady”. This is achieved by comparing semantic embeddings of event descriptions, ensuring a match in themes and emotions.

Python
 
embeddings=get_embedding(["One of the Longest Running Shows in the Country - Operating since 1974 ...NOW our 50th YEAR !!!Our Schenectady"])
embedding=embeddings[0].embedding
results = client.query(f"""
        SELECT Name, Description,
        distance(Description_Embeddings, {embedding}) as dist FROM default.Events ORDER BY dist LIMIT 10
    """)
for row in results.named_results():
        print("Title of the event  ", row["Name"])
        print("Description of the event  ", row['Description'])
        print("Distance : ", row["dist"])


Trending Events by Popularity

This query ranks the top 10 events by the number of attendees and interested users, highlighting popular events from large city festivals to major conferences. It is ideal for those seeking to join large, energetic gatherings.

Python
 
results = client.query(f"""
        SELECT Name, City, Users_Going, Users_Interested, Users_Responded
        FROM default.Events
        ORDER BY Users_Going DESC, Users_Interested DESC
        LIMIT 10
    """)
for row in results.named_results():
        print("Event Name  ", row["Name"])
        print("City ", row["City"])
        print("Users Going ", row["Users_Going"])
        print("Interested Users ", row["Users_Interested"])


Popular Local Events in New York

Combining relevance and popularity, this query identifies similar events in New York City related to a specific event and ranks them by attendance, offering a curated list of events that reflect the city's vibrant culture and attract local interest.

Python
 
embeddings=get_embedding(["One of the Longest Running Shows in the Country - Operating since 1974 ...NOW our 50th YEAR !!!Our Schenectady"])
embeddi=embeddings[0].embedding
results = client.query(f"""
        SELECT Name,City, Description, Users_Going,distance(Description_Embeddings, {embeddi}) as dist
        FROM default.Events
        WHERE City LIKE '%New York%' and dist < 1.5
        ORDER BY Users_Going DESC,dist
        LIMIT 10
    """)
for row in results.named_results():
        print("Event Name  ", row["Name"])
        print("Description ", row["Description"])
        print("Users Going ", row["Users_Going"])


Leading Event Organizers

This query ranks the top 10 event organizers by the total number of attendees and interested users, highlighting those who excel in creating compelling events and attracting large audiences. It provides insights for event planners and attendees interested in top-tier events.

Python
 
# Which client has attracted the most number of users 
results = client.query(f"""
       SELECT Organized_By, SUM(Users_Going + Users_Interested) AS Total_Users
        FROM default.Events
        GROUP BY Organized_By
        ORDER BY Total_Users DESC
        Limit 10
    """)
for row in results.named_results():
        print("Event Name  ", row["Organized_By"])
        print("Total_Users ", row["Total_Users"])


Implement RAG

Previously, we have explored MyScale for data analysis, highlighting its capabilities in enhancing our data workflows. Moving forward, we'll go one step ahead by implementing Retrieval-Augmented Generation (RAG), an innovative framework combining an external knowledge base with LLMs. This step will help you to understand your data better and find more detailed insights. Next, you'll see how to use RAG with MyScale, which will make working with data more interesting and productive.

Python
 
from langchain.chat_models import ChatOpenAI
from langchain.prompts import ChatPromptTemplate

query="Can you please suggest me some events related to basketball"
# Use the get_embedding method defined above, it accepts a list of sentences
embeddings=get_embedding([query])
embeddings=embeddings[0].embedding
results = client.query(f"""
        SELECT Name, City, Users_Going, Description, distance(Description_Embeddings, {embeddings}) as dist
        FROM default.Events
        ORDER BY Users_Going DESC,dist
        LIMIT 5
    """)
PROMPT_TEMPLATE = """
Your objective is to formulate a response to a question using only the information provided below:
{context}
---
Your task is to carefully analyze the provided context and provide an answer to the following question based solely on the information given:
{question}
"""
# Combine the descriptions of the top results. 
descriptions = [row["Description"] for row in results.named_results()]
context_text = "\n\n---\n\n".join(descriptions)
prompt_template = ChatPromptTemplate.from_template(PROMPT_TEMPLATE)
prompt = prompt_template.format(context=context_text, question=query)
model = ChatOpenAI(openai_api_key="your_api_key_here")
response_text = model.predict(prompt)
formatted_response = f"Response: {response_text}\n"
print(formatted_response)


Throughout this blog, we have observed that we can use MyScale as a simple SQL database or for advanced AI applications, covering the majority of the development domain. You can give it a try and explore the advanced features by signing up for the free tier and getting 5 million free vector storage.

Conclusion

We have explored the abilities and functionalities of MyScale with Apify Scraper through the process of developing an event analytics application. MyScale has demonstrated its capabilities in high-performance vector search while retaining all the functionalities of SQL databases, which helps developers make semantic searches using familiar SQL syntax with improved speed and accuracy.

Data analysis Data structure Event facebook vector database

Published at DZone with permission of Usama Jamil. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Hybrid Search: A New Frontier in Enterprise Search
  • Transforming Customer Feedback With Automation of Summaries and Labels Using TAG and RAG
  • How to Improve RAG Quality by Storing Knowledge Graphs in Vector Databases
  • Pivoting Database Systems Practices to AI: Create Efficient Development and Maintenance Practices With Generative AI

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!