DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Over 2 million developers have joined DZone. Join Today! Thanks for visiting DZone today,
Edit Profile Manage Email Subscriptions Moderation Admin Console How to Post to DZone Article Submission Guidelines
View Profile
Sign Out
Refcards
Trend Reports
Events
View Events Video Library
Zones
Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Migrate, Modernize and Build Java Web Apps on Azure: This live workshop will cover methods to enhance Java application development workflow.

Modern Digital Website Security: Prepare to face any form of malicious web activity and enable your sites to optimally serve your customers.

Kubernetes in the Enterprise: The latest expert insights on scaling, serverless, Kubernetes-powered AI, cluster security, FinOps, and more.

E-Commerce Development Essentials: Considering starting or working on an e-commerce business? Learn how to create a backend that scales.

Related

  • Unlocking the Power of ChatGPT
  • ChatGPT Applications: Unleashing the Potential Across Industries
  • ReactJS With ChatGPT: Building Conversational AI Into Your Web Apps
  • Training ChatGPT on Your Own Data: A Guide for Software Developers

Trending

  • Unleashing Greatness: Alexander the Great's Journey With Generative AI
  • Architecture Decision Records
  • Leonardo AI: Midjourney’s New Competitor
  • Integrating Data Management With Business Intelligence (BI) for Enhanced Insights
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. Apache SeaTunnel, Milvus, and OpenAI Improve Accuracy and Efficiency of Book Title Similarity Search

Apache SeaTunnel, Milvus, and OpenAI Improve Accuracy and Efficiency of Book Title Similarity Search

Using Apache SeaTunnel, Milvus, and OpenAI, we can achieve more accurate book title similarity searches through large language models.

Debra Chen user avatar by
Debra Chen
·
Oct. 10, 23 · Tutorial
Like (1)
Save
Tweet
Share
2.2K Views

Join the DZone community and get the full member experience.

Join For Free

Currently, existing book search solutions (such as those used in public libraries) heavily rely on keyword matching rather than a semantic understanding of the actual content of book titles. As a result, search results may not meet our needs very well or even be vastly different from what we expect. This is because relying solely on keyword matching is not enough, as it cannot achieve semantic understanding and, therefore, cannot understand the searcher’s true intent.

So, is there a better way to conduct book searches more accurately and efficiently? The answer is yes! In this article, I will introduce how to combine the use of Apache SeaTunnel, Milvus, and OpenAI for similarity search to achieve a semantic understanding of the entire book title and make search results more accurate.

Using trained models to represent input data is called semantic search, and this approach can be extended to various text-based use cases, including anomaly detection and document search. Therefore, the technology introduced in this article can bring significant breakthroughs and impacts to the field of book search.

Next, I will briefly introduce several concepts and tools/platforms related to this article in order to better understand this article.

What Is Apache SeaTunnel?

Apache SeaTunnel is an open-source, high-performance, distributed data management and computing platform. It is a top-level project supported by the Apache Foundation, capable of handling massive data, providing real-time data queries and computing, and supporting multiple data sources and formats. The goal of SeaTunnel is to provide a scalable, enterprise-level data management and integration platform to meet various large-scale data processing needs.

What Is Milvus?

Milvus is an open-source vector similarity search engine that supports the storage, retrieval, and similarity search of massive vectors. It is a high-performance, low-cost solution for large-scale vector data. Milvus can be used in various scenarios, such as recommendation systems, image search, music recommendation, etc.

What Is OpenAI?

ChatGPT is a conversational AI system based on the Generative Pre-trained Transformer (GPT) model developed by OpenAI. The system mainly uses natural language processing and deep learning technologies to generate natural language text similar to human conversation. ChatGPT has a wide range of applications, including intelligent customer service, chatbots, intelligent assistants, and language model research and development. In recent years, ChatGPT has become one of the research hotspots in the field of natural language processing.

What Is LLM?

A Large Language Model (LLM) is a natural language processing model based on deep learning technology that can analyze and understand a given text and generate text content related to it. Large language models typically use deep neural networks to learn the grammar and semantic rules of natural language and convert text data into vector representations in continuous vector space. During training, large language models use a large amount of text data to learn language patterns and statistical rules, which enables them to generate high-quality text content such as articles, news, and conversations. Large language models have a wide range of applications, including machine translation, text generation, question-answering systems, speech recognition, etc. Currently, many open-source deep learning frameworks provide implementations of large language models, such as TensorFlow, PyTorch, etc.

Tutorial

Here we go! I will show you how to combine Apache SeaTunnel, OpenAI’s Embedding API, with our vector database to perform a semantic search over the entire book title.

Preparation

Before the experiment, we need to obtain an OpenAI token from their official website and then deploy a Milvus experimental environment. We also need to prepare the data that will be used for this example. You can download the data from here.

Importing data into Milvus through SeaTunnel.

First, place book.csv under /tmp/milvus_test/book, then configure the task configuration as milvus.conf and place it under config. Please refer to the Quick Start Guide.

env {
  # You can set engine configuration here
  execution.parallelism = 1
  job.mode = "BATCH"
  checkpoint.interval = 5000
  #execution.checkpoint.data-uri = "hdfs://localhost:9000/checkpoint"
}
source {
  # This is a example source plugin **only for test and demonstrate the feature source plugin**
  LocalFile {
    schema {
      fields {
        bookID = string
        title_1 = string
        title_2 = string
      }
    }
    path = "/tmp/milvus_test/book"
    file_format_type = "csv"
  }
}transform {
}sink {
  Milvus {
    milvus_host = localhost
    milvus_port = 19530
    username = root
    password = Milvus
    collection_name = title_db
    openai_engine = text-embedding-ada-002
    openai_api_key = sk-xxxx
    embeddings_fields = title_2
  }
}


Execute the following command:

./bin/SeaTunnel.sh --config ./config/milvus.conf -e local


Viewing the data in the database, you can see that data has been written into it.

titledb

Then, use the following code to perform a semantic search on book titles:

import json
import random
import openai
import time
from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection, utility
COLLECTION_NAME = 'title_db'  # Collection name
DIMENSION = 1536  # Embeddings size
COUNT = 100  # How many titles to embed and insert.
MILVUS_HOST = 'localhost'  # Milvus server URI
MILVUS_PORT = '19530'
OPENAI_ENGINE = 'text-embedding-ada-002'  # Which engine to use
openai.api_key = 'sk-******'  # Use your own Open AI API Key hereconnections.connect(host=MILVUS_HOST, port=MILVUS_PORT)collection = Collection(name=COLLECTION_NAME)collection.load()
def embed(text):
    return openai.Embedding.create(
        input=text, 
        engine=OPENAI_ENGINE)["data"][0]["embedding"]
def search(text):
    # Search parameters for the index
    search_params={
        "metric_type": "L2"
    }    results=collection.search(
        data=[embed(text)],  # Embeded search value
        anns_field="title_2",  # Search across embeddings
        param=search_params,
        limit=5,  # Limit to five results per search
        output_fields=['title_1']  # Include title field in result
    )    ret=[]
    for hit in results[0]:
        row=[]
        row.extend([hit.id, hit.score, hit.entity.get('title_1')])  # Get the id, distance, and title for the results
        ret.append(row)
    return retsearch_terms=['self-improvement', 'landscape']for x in search_terms:
    print('Search term:', x)
    for result in search(x):
        print(result)
    print()


Here is the result:

Search term: self-improvement
[96, 0.4079835116863251, "The Dance of Intimacy: A Woman's Guide to Courageous Acts of Change in Key Relationships"]
[56, 0.41880303621292114, 'Nicomachean Ethics']
[76, 0.4309804439544678, 'Possession']
[19, 0.43588975071907043, 'Vanity Fair']
[7, 0.4423919916152954, 'Knowledge Is Power (The Amazing Days of Abby Hayes: #15)']
Search term: landscape
[9, 0.3023473024368286, 'The Lay of the Land']
[1, 0.3906732499599457, 'The Angry Hills']
[78, 0.392495334148407, 'Cloud Atlas']
[95, 0.39346450567245483, 'Alien']
[94, 0.399422287940979, 'The Known World']


If we use the old method of keyword search, book titles must contain keywords such as “self-improvement” and “improvement.” However, by using large language models for semantic understanding, we can retrieve book titles that are more relevant to our needs. For example, in the example above, when we searched for the keyword “self-improvement,” the displayed book titles such as “The Dance of Intimacy: A Woman’s Guide to Courageous Acts of Change in Key Relationships” and “Nichomachean Ethics” did not contain relevant keywords, but were clearly more relevant to our needs.

Therefore, by using Apache SeaTunnel, Milvus, and OpenAI, we can achieve more accurate book title similarity searches through large language models, bringing significant technological breakthroughs to the field of book search while also providing valuable references for semantic understanding. I hope this can provide some inspiration to everyone.

Language model Data (computing) Open source ChatGPT

Published at DZone with permission of Debra Chen. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Unlocking the Power of ChatGPT
  • ChatGPT Applications: Unleashing the Potential Across Industries
  • ReactJS With ChatGPT: Building Conversational AI Into Your Web Apps
  • Training ChatGPT on Your Own Data: A Guide for Software Developers

Comments

Partner Resources

X

ABOUT US

  • About DZone
  • Send feedback
  • Careers
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends: