DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Related

  • Improving DAG Failure Detection in Airflow Using AI Techniques
  • SELECT AI Query Integration Using Oracle Autonomous Database 26AI and OpenAI
  • Best Practices to Make Your Data AI-Ready
  • How Healthy Is Your Data in the Age of AI? An In-Depth Checklist to Assess Data Accuracy, Governance, and AI Readiness

Trending

  • Detecting Bugs and Vulnerabilities in Java With SonarQube
  • When Snowflake Lies to You: Understanding False Failures in dbt Pipelines
  • Offline-First Patch Management for 10,000 Edge Nodes: A Practical Architecture That Scales
  • Exactly-Once Processing: Myth vs Reality
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. Similarity Search for Embedding: A Game Changer in Data Analysis

Similarity Search for Embedding: A Game Changer in Data Analysis

Oracle has added generative AI functionality to its Cloud data analysis service, to ingest, store, and retrieve documents based on their meaning.

By 
Frederic Jacquet user avatar
Frederic Jacquet
DZone Core CORE ·
Oct. 02, 23 · Opinion
Likes (2)
Comment
Save
Tweet
Share
2.6K Views

Join the DZone community and get the full member experience.

Join For Free

Since OpenAI's meteoric rise to the forefront of innovation, a number of technology heavyweights - including AWS, Google, IBM, Microsoft, Databricks, Meta or Oracle, to name but a few, have integrated their own approach to generative AI into their research and development programs. 

This is how Oracle announced at its annual CloudWorld conference that the company is adding generative AI capabilities to its Cloud data analysis service.

“Generative AI. Is it the most important technology ever? Probably” — Larry Ellison, Oracle CTO and co-founder.

Oracle has added generative AI functionality to its Cloud data analysis service. The aim is to ingest documents in a wide variety of formats, store them, and retrieve them based on their meaning. To achieve this, Oracle implements a method that involves integrating documents in the form of embeddings.

"Vector similarity search uses machine learning to translate the similarity of text, images, or audio into a vector space, making search faster, more accurate, and more scalable". — Martin Heller — Ph.D., Physics — Brown University

Embedding

In the context of text analysis, "similarity search for embeddings" is used to find text documents or passages whose meaning is most similar to that of a given query or input text. 

Embedding involves representing words within a textual analysis context as vectors. Within the domain of NLP and LLMs, these advanced technologies empower systems to use (some might say "comprehend") more effectively textual content. 

A vector database doesn’t keep track of words, but instead, it works with the numerical vectors that encode the very meaning of the text. In the same way, user queries are also transformed into numerical vectors. This is how the database can be searched to find relevant articles or passages, whether or not they contain the same terms.

Text Vectorization and Similarity Search 

In the realm of natural language processing, the process of converting text into numerical vectors and conducting similarity searches plays a pivotal role. Here’s an overview of the fundamental concepts and techniques behind vector representation and the retrieval of relevant documents.

  1. Vector representation: Text documents must be converted into numerical vectors using techniques such as word embedding, or more advanced methods such as transformer-based embedding. Each word or document is represented as a vector in a high-dimensional space. In a way, word embedding is a form of word representation that tends to bridge the gap between human understanding of language and that of a machine.
  2. Query vector: The input query text is also transformed into a vector using the same integration techniques. This query vector represents the meaning or content of the query. Vector databases are engineered for high-speed similarity searches within massive datasets. They excel in handling vector data by leveraging unique data indexing and querying techniques that significantly reduce the search space, thereby expediting the retrieval process. Vector databases effectively manage complex data structures.
  3. Similarity search: The system then searches other text documents, themselves represented as vectors, for those most similar to the query vector. Within the context of Large Language Models (LLMs) and generative AI, vector similarity search’s role is to identify similar items or data points within large and complex datasets which is particularly important when it comes to dealing with high-dimensional spaces. While conventional search methods could struggle, by transforming text and data into numerical vectors and utilizing specialized algorithms, vector similarity search streamlines the process of finding related information.
  4. Retrieval of relevant documents: Documents or passages whose vectors are closest to the query vector are considered the most relevant. They are retrieved as search results. This approach enables text analysis systems to find documents or passages which do not contain exactly the same words as the query, but which have a similar semantic meaning. It's a powerful tool for information retrieval and natural language understanding.

Why Is This Important Beyond the Performance Aspect?

It's certainly worth remembering that the use of generative AI technologies must be accompanied by ongoing monitoring and a commitment to responsible use and ethical reflection. These technologies must be used with care to avoid potential problems and errors.

Data Quality

Quality of training data can significantly impact the effectiveness of embedding and similarity search, noisy or biased data can lead to inaccurate results. It is essential to be in a position to guarantee the quality of information before sharing it, particularly in areas such as health, finance or security.

Privacy

Avoid disclosing sensitive personal or corporate information when using LLMs, as this can potentially compromise the privacy of individuals or corporations. It happened within Samsung where company employees shared confidential information three times. First, one person copied source code into ChatGPT for a problem-solving request. Then, someone shared code optimization details. Lastly, another person converted a meeting report for ChatGPT to create a presentation.

Scalability

Scaling these techniques to handle extremely large datasets and the computational resources required can appear like a real limitation. Whether you consider the cost or the carbon footprint.

Semantic Understanding

While embedding captures semantic meaning to some extent, it may not always fully capture the context or nuances of human language.

Privacy and Ethics

The ethical considerations surrounding the use of embedding and similarity search in AI, such as privacy concerns and potential biases in search results.

"It is possible to differentiate between chicken eggs and cow eggs by observing their size and color; cow eggs are generally larger than chicken eggs". - ChatGPT

Limiting the Dissemination of Incorrect Information (AKA Hallucinations)

Generative AI’s can produce incorrect or misleading information. It‘s essential to check the veracity of information before sharing it. The phenomenon of hallucinations, in fact, refers to the whole range of LLM inaccuracies. This can involve providing fanciful references or quotes, confident dissertation on wacky subjects such as "cow eggs," totally inventing facts or historical figures, mixing concepts or information inappropriately etc...

I cannot recommend blindly accepting unsupervised information generated especially when they are used in important contexts such as health, finance, security or generally in the field of decision-making.

Although Yann Lecun argues that it cannot be solved without a complete redesign of the underlying models, a blend of techniques and methods can decrease the impact of these issues and make them acceptable for many use cases. But that will be the subject of a separate article.

Conclusion

Embedding is a technique in text analysis that transforms words into numerical vectors, enabling efficient similarity searches for documents with similar meaning to a given query. This method plays a vital role in LLMs and generative AI, allowing them to find related data points in high-dimensional datasets, enhancing information retrieval and natural language understanding.

Oracle has implemented this innovative approach to improve document search in its Cloud data analytics service. 

Now, finding relevant data is easier than telling a chicken egg from a cow egg ;-)

AI Data analysis Data quality Oracle Database

Opinions expressed by DZone contributors are their own.

Related

  • Improving DAG Failure Detection in Airflow Using AI Techniques
  • SELECT AI Query Integration Using Oracle Autonomous Database 26AI and OpenAI
  • Best Practices to Make Your Data AI-Ready
  • How Healthy Is Your Data in the Age of AI? An In-Depth Checklist to Assess Data Accuracy, Governance, and AI Readiness

Partner Resources

×

Comments

The likes didn't load as expected. Please refresh the page and try again.

  • RSS
  • X
  • Facebook

ABOUT US

  • About DZone
  • Support and feedback
  • Community research

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 215
  • Nashville, TN 37211
  • [email protected]

Let's be friends:

  • RSS
  • X
  • Facebook