DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Last call! Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Boosting Similarity Search With Stream Processing
  • Why To Run Open Source Data Technologies in Your Own Cloud Account
  • Database Replication: Open-Source Tools and Options
  • Deploying AI With an Event-Driven Platform

Trending

  • How the Go Runtime Preempts Goroutines for Efficient Concurrency
  • Blue Skies Ahead: An AI Case Study on LLM Use for a Graph Theory Related Application
  • How to Practice TDD With Kotlin
  • Immutable Secrets Management: A Zero-Trust Approach to Sensitive Data in Containers
  1. DZone
  2. Data Engineering
  3. Databases
  4. How I Turned My Company’s Docs Into a Searchable Database With OpenAI

How I Turned My Company’s Docs Into a Searchable Database With OpenAI

In this article, the reader will learn how to make their documents semantically searchable with vector search and OpenAI.

By 
Jacob Marks user avatar
Jacob Marks
·
May. 17, 23 · Tutorial
Likes (1)
Comment
Save
Tweet
Share
3.0K Views

Join the DZone community and get the full member experience.

Join For Free

For the past six months, I’ve been working at series A startup Voxel51 and as a creator of the open-source computer vision toolkit FiftyOne. As a machine learning engineer and developer evangelist, my job is to listen to our open-source community and bring them what they need — new features, integrations, tutorials, workshops, you name it.

A few weeks ago, we added native support for vector search engines and text similarity queries to FiftyOne, so that users can find the most relevant images in their (often massive — containing millions or tens of millions of samples) datasets via simple natural language queries.

This put us in a curious position: it was now possible for people using open-source FiftyOne to readily search datasets with natural language queries, but using our documentation still required traditional keyword search.

We have a lot of documentation, which has its pros and cons. As a user myself, I sometimes find that given the sheer quantity of documentation, finding precisely what I’m looking for requires more time than I’d like.

I was not going to let this fly… so I built this in my spare time:

Semantically search your company’s docs from the command line. Image courtesy of the author.

So, here’s how I turned our docs into a semantically searchable vector database:

  • Converted all of the docs to a unified format
  • Split docs into blocks and added some automated cleanup
  • Computed embeddings for each block
  • Generated a vector index from these embedding
  • Defined the index query
  • Wrapped it all in a user-friendly command line interface and Python API

You can find all the code for this post in the voxel51/fiftyone-docs-search repo, and it’s easy to install the package locally in edit mode with pip install -e ..

Better yet, if you want to implement the semantic search for your own website using this method, you can follow along! Here are the ingredients you’ll need:

  • Install the openai Python package and create an account: you will use this account to send your docs and queries to an inference endpoint, which will return an embedding vector for each piece of text.
  • Install the qdrant-client Python package and launch a Qdrant server via Docker: you will use Qdrant to create a locally hosted vector index for the docs, against which queries will be run. The Qdrant service will run inside a Docker container.

Converting the Docs to a Unified Format

My company’s docs are all hosted as HTML documents here. A natural starting point would have been to download these docs with Python’s requests library and parse the document with Beautiful Soup.

As a developer (and author of many of our docs), however, I thought I could do better. I already had a working clone of the GitHub repository on my local computer that contained all of the raw files used to generate the HTML docs. Some of our docs are written in Sphinx ReStructured Text (RST), whereas others, like tutorials, are converted to HTML from Jupyter Notebooks.

I figured (mistakenly) that the closer I could get to the raw text of the RST and Jupyter files, the simpler things would be.

RST

In RST documents, sections are delineated by lines consisting only of strings of =, - or _. For example, here’s a document from the FiftyOne User Guide which contains all three delineators:

https://docs.voxel51.com

RST document from open source FiftyOne Docs. Image courtesy of author.

I could then remove all of the RST keywords, such as toctree, code-block, and button_link (there were many more), as well as the :, ::, and .. that accompanied a keyword, the start of a new block, or block descriptors.

Links were easy to take care of too:

Python
 
no_links_section = re.sub(r"<[^>]+>_?","", section)


Things started to get dicey when I wanted to extract the section anchors from RST files. Many of our sections had anchors specified explicitly, whereas others were left to be inferred during the conversion to HTML.

Here is an example:

.. _brain-embeddings-visualization:

Visualizing embeddings
______________________

The FiftyOne Brain provides a powerful
:meth:`compute_visualization() <fiftyone.brain.compute_visualization>` method
that you can use to generate low-dimensional representations of the samples
and/or individual objects in your datasets.

These representations can be visualized natively in the App's
:ref:`Embeddings panel <app-embeddings-panel>`, where you can interactively
select points of interest and view the corresponding samples/labels of interest
in the :ref:`Samples panel <app-samples-panel>`, and vice versa.

.. image:: /images/brain/brain-mnist.png
   :alt: mnist
   :align: center

There are two primary components to an embedding visualization: the method used
to generate the embeddings, and the dimensionality reduction method used to
compute a low-dimensional representation of the embeddings.

Embedding methods
-----------------

The `embeddings` and `model` parameters of
:meth:`compute_visualization() <fiftyone.brain.compute_visualization>`
support a variety of ways to generate embeddings for your data:


In the brain.rst file in our User Guide docs (a portion of which is reproduced above), the Visualizing embeddings section has an anchor #brain-embeddings-visualization specified by .. _brain-embeddings-visualization:. The Embedding methods subsection, which immediately follows, however, is given an auto-generated anchor.

Another difficulty that soon reared its head was how to deal with tables in RST. List tables were fairly straightforward. For instance, here’s a list table from our View Stages cheat sheet:

reStructuredText
 
.. list-table::

   * - :meth:`match() <fiftyone.core.collections.SampleCollection.match>`
   * - :meth:`match_frames() <fiftyone.core.collections.SampleCollection.match_frames>`
   * - :meth:`match_labels() <fiftyone.core.collections.SampleCollection.match_labels>`
   * - :meth:`match_tags() <fiftyone.core.collections.SampleCollection.match_tags>`


Grid tables, on the other hand, can get messy fast. They give docs writers great flexibility, but this same flexibility makes parsing them a pain. Take this table from our Filtering cheat sheet:

Plain Text
 
+-----------------------------------------+-----------------------------------------------------------------------+
| Operation                               | Command                                                               |
+=========================================+=======================================================================+
| Filepath starts with "/Users"           |  .. code-block::                                                      |
|                                         |                                                                       |
|                                         |     ds.match(F("filepath").starts_with("/Users"))                     |
+-----------------------------------------+-----------------------------------------------------------------------+
| Filepath ends with "10.jpg" or "10.png" |  .. code-block::                                                      |
|                                         |                                                                       |
|                                         |     ds.match(F("filepath").ends_with(("10.jpg", "10.png"))            |
+-----------------------------------------+-----------------------------------------------------------------------+
| Label contains string "be"              |  .. code-block::                                                      |
|                                         |                                                                       |
|                                         |     ds.filter_labels(                                                 |
|                                         |         "predictions",                                                |
|                                         |         F("label").contains_str("be"),                                |
|                                         |     )                                                                 |
+-----------------------------------------+-----------------------------------------------------------------------+
| Filepath contains "088" and is JPEG     |  .. code-block::                                                      |
|                                         |                                                                       |
|                                         |     ds.match(F("filepath").re_match("088*.jpg"))                      |
+-----------------------------------------+-----------------------------------------------------------------------+


Within a table, rows can take up arbitrary numbers of lines, and columns can vary in width. Code blocks within grid table cells are also difficult to parse, as they occupy space on multiple lines, so their content is interspersed with content from other columns. This means that code blocks in these tables need to be effectively reconstructed during the parsing process.

Not the end of the world. But also not ideal.

Jupyter

Jupyter notebooks turned out to be relatively simple to parse. I was able to read the contents of a Jupyter notebook into a list of strings, with one string per cell:

Python
 
import json
ifile = "my_notebook.ipynb"
with open(ifile, "r") as f:
    contents = f.read()
contents = json.loads(contents)["cells"]
contents = [(" ".join(c["source"]), c['cell_type'] for c in contents]


Furthermore, the sections were delineated by Markdown cells starting with #.

Nevertheless, given the challenges posed by RST, I decided to turn to HTML and treat all of our docs on equal footing.

HTML

I built the HTML docs from my local install with bash generate_docs.bash, and began parsing them with Beautiful Soup. However, I soon realized that when RST code blocks and tables with inline code were being converted to HTML, although they were rendering correctly, the HTML itself was incredibly unwieldy. Take our filtering cheat sheet, for example.

When rendered in a browser, the code block preceding the Dates and times section of our filtering cheat sheet looks like this:

screenshot from cheat sheet in open source fiftyone docs
Screenshot from cheat sheet in open source FiftyOne Docs. Image courtesy of author.

The raw HTML, however, looks like this:

RST cheat sheet converted to HTML
RST cheat sheet converted to HTML. Image courtesy of author.

This is not impossible to parse, but it is also far from ideal.

Markdown

Fortunately, I was able to overcome these issues by converting all of the HTML files to Markdown with markdownify. Markdown had a few key advantages that made it the best fit for this job.

  1. Cleaner than HTML: code formatting was simplified from the spaghetti strings of span elements to inline code snippets marked with single ` before and after, and blocks of code were marked by triple quotes ```before and after. This also made it easy to split into text and code.
  2. Still contained anchors: unlike raw RST, this Markdown included section heading anchors, as the implicit anchors had already been generated. This way, I could link not just to the page containing the result but to the specific section or subsection of that page.
  3. Standardization: Markdown provided mostly uniform formatting for the initial RST and Jupyter documents, allowing us to give their content consistent treatment in the vector search application.

Note on LangChain

Some of you may know about the open-source library LangChain for building applications with LLMs and may be wondering why I didn’t just use LangChain’s Document Loaders and Text Splitters. The answer: I needed more control!

Processing the Documents

Once the documents had been converted to Markdown, I proceeded to clean the contents and split them into smaller segments.

Cleaning

Cleaning must consist in removing unnecessary elements, including:

  • Headers and footers
  • Table row and column scaffolding — e.g. the |’s in |select()| select_by()|
  • Extra newlines
  • Links
  • Images
  • Unicode characters
  • Bolding — i.e. **text** → text

I also removed escape characters that were escaping from characters that have special meaning in our docs: _ and *. The former is used in many method names, and the latter, as usual, is used in multiplication, regex patterns, and many other places:

Python
 
document = document.replace("\_", "_").replace("\*", "*")


Splitting Documents Into Semantic Blocks

With the contents of our docs cleaned, I proceeded to split the docs into bite-sized blocks.

First, I split each document into sections. At first glance, it seems like this can be done by finding any line that starts with a # character. In my application, I did not differentiate between h1, h2, h3, and so on (# , ## , ###), so checking the first character is sufficient. However, this logic gets us in trouble when we realize that # is also employed to allow comments in Python code.

To bypass this problem, I split the document into text blocks and code blocks:

Python
 
text_and_code = page_md.split('```')
text = text_and_code[::2]
code = text_and_code[1::2]


Then I identified the start of a new section with a # to start a line in a text block. I extracted the section title and anchor from this line:

Python
 
def extract_title_and_anchor(header):
    header = " ".join(header.split(" ")[1:])
    title = header.split("[")[0]
    anchor = header.split("(")[1].split(" ")[0]
    return title, anchor


And assigned each block of text or code to the appropriate section.

Initially, I also tried splitting the text blocks into paragraphs, hypothesizing that because a section may contain information about many different topics, the embedding for that entire section may not be similar to an embedding for a text prompt concerned with only one of those topics. This approach, however, resulted in top matches for most search queries disproportionately being single-line paragraphs, which turned out to not be terribly informative as search results.

Check out the accompanying GitHub repo for the implementation of these methods that you can try out on your own docs!

Embedding Text and Code Blocks With OpenAI

With documents converted, processed, and split into strings, I generated an embedding vector for each of these blocks. Because large language models are flexible and generally capable by nature, I decided to treat both text blocks and code blocks on the same footing as pieces of text and embed them with the same model.

I used OpenAI’s text-embedding-ada-002 model because it is easy to work with, achieves the highest performance out of all of OpenAI’s embedding models (on the BEIR benchmark), and is also the cheapest. It’s so cheap, in fact ($0.0004/1K tokens), that generating all of the embeddings for the FiftyOne docs only costs a few cents! As OpenAI themselves put it, “We recommend using text-embedding-ada-002 for nearly all use cases. It’s better, cheaper, and simpler to use.”

With this embedding model, you can generate a 1536-dimensional vector representing any input prompt, up to 8,191 tokens (approximately 30,000 characters).

To get started, you need to create an OpenAI account, generate an API key, and export this API key as an environment variable with the following:

Python
 
export OPENAI_API_KEY="<MY_API_KEY>"


You will also need to install the openai Python library:

pip install openai


I wrote a wrapper around OpenAI’s API that takes in a text prompt and returns an embedding vector:

Python
 
MODEL = "text-embedding-ada-002"

def embed_text(text):
    response = openai.Embedding.create(
        input=text,
        model=MODEL
    )
    embeddings = response['data'][0]['embedding']
    return embeddings


To generate embeddings for all of our docs, we just apply this function to each of the subsections — text and code blocks — across all of our docs.

Creating a Qdrant Vector Index

With embeddings in hand, I created a vector index to search against. I chose to use Qdrant for the same reasons we chose to add native Qdrant support to FiftyOne: it’s open source, free, and easy to use.

To get started with Qdrant, you can pull a pre-built Docker image and run the container:

Shell
 
docker pull qdrant/qdrant
docker run -d -p 6333:6333 qdrant/qdrant


Additionally, you will need to install the Qdrant Python client:

pip install qdrant-client


I created the Qdrant collection:

Python
 
import qdrant_client as qc
import qdrant_client.http.models as qmodels

client = qc.QdrantClient(url="localhost")
METRIC = qmodels.Distance.DOT
DIMENSION = 1536
COLLECTION_NAME = "fiftyone_docs"

def create_index():
    client.recreate_collection(
    collection_name=COLLECTION_NAME,
    vectors_config = qmodels.VectorParams(
            size=DIMENSION,
            distance=METRIC,
        )
    )


I then created a vector for each subsection (text or code block):

Python
 
import uuid
def create_subsection_vector(
    subsection_content,
    section_anchor,
    page_url,
    doc_type
    ):

    vector = embed_text(subsection_content)
    id = str(uuid.uuid1().int)[:32]
    payload = {
        "text": subsection_content,
        "url": page_url,
        "section_anchor": section_anchor,
        "doc_type": doc_type,
        "block_type": block_type
    }
    return id, vector, payload


For each vector, you can provide additional context as part of the payload. In this case, I included the URL (an anchor) where the result can be found, the type of document, so the user can specify if they want to search through all of the docs or just certain types of docs, and the contents of the string which generated the embedding vector. I also added the block type (text or code), so if the user is looking for a code snippet, they can tailor their search to that purpose.

Then I added these vectors to the index, one page at a time:

Python
 
def add_doc_to_index(subsections, page_url, doc_type, block_type):
    ids = []
    vectors = []
    payloads = []
    
    for section_anchor, section_content in subsections.items():
        for subsection in section_content:
            id, vector, payload = create_subsection_vector(
                subsection,
                section_anchor,
                page_url,
                doc_type,
                block_type
            )
            ids.append(id)
            vectors.append(vector)
            payloads.append(payload)
    
    ## Add vectors to collection
    client.upsert(
        collection_name=COLLECTION_NAME,
        points=qmodels.Batch(
            ids = ids,
            vectors=vectors,
            payloads=payloads
        ),
    )


Querying the Index

Once the index has been created, running a search on the indexed documents can be accomplished by embedding the query text with the same embedding model and then searching the index for similar embedding vectors. With a Qdrant vector index, a basic query can be performed with the Qdrant client’s search() command.

To make my company’s docs searchable, I wanted to allow users to filter by section of the docs, as well as by the type of block that was encoded. In the parlance of the vector search, filtering results while still ensuring that a predetermined number of results (specified by the top_k argument) will be returned is referred to as pre-filtering.

To achieve this, I wrote a programmatic filter:

Python
 
def _generate_query_filter(query, doc_types, block_types):
    """Generates a filter for the query.
    Args:
        query: A string containing the query.
        doc_types: A list of document types to search.
        block_types: A list of block types to search.
    Returns:
        A filter for the query.
    """
    doc_types = _parse_doc_types(doc_types)
    block_types = _parse_block_types(block_types)

    _filter = models.Filter(
        must=[
            models.Filter(
                should= [
                    models.FieldCondition(
                        key="doc_type",
                        match=models.MatchValue(value=dt),
                    )
                for dt in doc_types
                ],
        
            ),
            models.Filter(
                should= [
                    models.FieldCondition(
                        key="block_type",
                        match=models.MatchValue(value=bt),
                    )
                for bt in block_types
                ]  
            )
        ]
    )

    return _filter


The internal _parse_doc_types() and _parse_block_types() functions handle cases where the argument is string or list-valued or is None.

Then I wrote a function query_index() that takes the user’s text query, pre-filters, searches the index, and extracts relevant information from the payload. The function returns a list of tuples of the form (url, contents, score), where the score indicates how good of a match the result is to the query text.

Python
 
def query_index(query, top_k=10, doc_types=None, block_types=None):
    vector = embed_text(query)
    _filter = _generate_query_filter(query, doc_types, block_types)
    
    results = CLIENT.search(
        collection_name=COLLECTION_NAME,
        query_vector=vector,
        query_filter=_filter,
        limit=top_k,
        with_payload=True,
        search_params=_search_params,
    )

    results = [
        (
            f"{res.payload['url']}#{res.payload['section_anchor']}",
            res.payload["text"],
            res.score,
        )
        for res in results
    ]

    return results


Writing the Search Wrapper

The final step was providing a clean interface for the user to semantically search against these “vectorized” docs.

I wrote a function print_results(), which takes the query, results from query_index(), and a score argument (whether or not to print the similarity score), and prints the results in an easy-to-interpret way. I used the rich Python package to format hyperlinks in the terminal so that when working in a terminal that supports hyperlinks, clicking on the hyperlink will open the page in your default browser. I also used webbrowser to automatically open the link for the top result, if desired.

Display search results with rich hyperlinks. Image courtesy of author.

For Python-based searches, I created a class FiftyOneDocsSearch to encapsulate the document search behavior so that once a FiftyOneDocsSearch object has been instantiated (potentially with default settings for search arguments):

Python
 
from fiftyone.docs_search import FiftyOneDocsSearch
fosearch = FiftyOneDocsSearch(open_url=False, top_k=3, score=True)


You can search within Python by calling this object. To query the docs for “How to load a dataset,” for instance, you just need to run the following:

Python
 
fosearch(“How to load a dataset”)


Semantically search your company’s docs within a Python process. Image courtesy of author.

I also used argparse to make this docs search functionality available via the command line. When the package is installed, the docs are CLI searchable with the following:

fiftyone-docs-search query "<my-query>" <args 


Just for fun, because fiftyone-docs-search query is a bit cumbersome. I added an alias to my .zsrch file:

Shell
 
alias fosearch='fiftyone-docs-search query'


With this alias, the docs are searchable from the command line with:

fosearch "<my-query>" args


Conclusion

Coming into this, I already fashioned myself a power user of my company’s open-source Python library, FiftyOne. I had written many of the docs, and I had used (and continue to use) the library on a daily basis. But the process of turning our docs into a searchable database forced me to understand our docs on an even deeper level. It’s always great when you’re building something for others, and it ends up helping you as well!

Here’s what I learned:

  • Sphinx RST is cumbersome: it makes beautiful docs, but it is a bit of a pain to parse
  • Don’t go crazy with preprocessing: OpenAI’s text-embeddings-ada-002 model is great at understanding the meaning behind a text string, even if it has slightly atypical formatting. Gone are the days of stemming and painstakingly removing stop words and miscellaneous characters.
  • Small semantically meaningful snippets are best: break your documents up into the smallest possible meaningful segments, and retain context. For longer pieces of text, it is more likely that the overlap between a search query and a part of the text in your index will be obscured by less relevant text in the segment. If you break the document up too small, you run the risk that many entries in the index will contain very little semantic information.
  • Vector search is powerful: with minimal lift and without any fine-tuning, I was able to dramatically enhance the searchability of our docs. From initial estimates, it appears that this improved docs search is more than twice as likely to return relevant results than the old keyword search approach. Furthermore, the semantic nature of this vector search approach means that users can now search with arbitrarily phrased, arbitrarily complex queries and are guaranteed to get the specified number of results.

If you find yourself (or others) constantly digging or sifting through treasure troves of documentation for specific kernels of information, I encourage you to adapt this process for your own use case. You can modify this to work for your personal documents or your company’s archives. And if you do, I guarantee you’ll walk away from the experience seeing your documents in a new light!

Here are a few ways you could extend this for your own docs!

  • Hybrid search: combine vector search with traditional keyword search
  • Go global: Use Qdrant Cloud to store and query the collection in the cloud
  • Incorporate web data: use requests to download HTML directly from the web
  • Automate updates: use GitHub Actions to trigger the recomputation of embeddings whenever the underlying docs change
  • Embed: wrap this in a Javascript element and drop it in as a replacement for a traditional search bar

All code used to build the package is open source and can be found in the voxel51/fiftyone-docs-search repo.

Database Open source

Published at DZone with permission of Jacob Marks. See the original article here.

Opinions expressed by DZone contributors are their own.

Related

  • Boosting Similarity Search With Stream Processing
  • Why To Run Open Source Data Technologies in Your Own Cloud Account
  • Database Replication: Open-Source Tools and Options
  • Deploying AI With an Event-Driven Platform

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!