Topic Tagging Using Large Language Models

Explore various techniques for organizing large amounts of content into topics using Large Language Models.

Vikram Rao Sudarshan

Jul. 19, 24 · Tutorial

Likes (1)

Comment

Save

4.2K Views

Topic Tagging

Topic tagging is an important and widely applicable problem in Natural Language Processing, which involves tagging a piece of content — like a webpage, book, blog post, or video — with its topic. Despite the availability of ML models like topic models and Latent Dirichlet Analysis [1], topic tagging has historically been a labor-intensive task, especially when there are many fine-grained topics. There are numerous applications to topic-tagging, including:

Content organization, to help users of websites, libraries, and other sources of large amounts of content to navigate through the content
Recommender systems, where suggestions for products to buy, articles to read, or videos to watch are generated wholly or in part using their topics or topic tags
Data analysis and social media management — to understand the popularity of topics and subjects to prioritize

Large Language Models (LLMs) have greatly simplified topic tagging by leveraging their multimodal and long-context capabilities to process large documents effectively. However, LLMs are computationally expensive and require the user to understand the trade-offs between the quality of the LLM and the computational or dollar cost of using them.

LLMs for Topic Tagging

There are various ways of casting the topic tagging problem for use with an LLM.

Zero-shot/few-shot prompting
Prompting with options
Dual encoder

We illustrate the above techniques using the example of tagging Wikipedia articles.

1. Zero-Shot/Few-Shot Prompting

Prompting is the simplest method for using an LLM, but the quality of the results depends on the size of the LLM.

Zero-shot prompting [2] involves directly instructing the LLM to perform the task. For instance:

    Plain Text
   
   <wikipedia webpage text>
What are the 3 topics the above text is talking about?

Zero-shot is completely unconstrained, and the LLM is free to output text in any format. To alleviate this issue, we need to add constraints to the LLM.

Zero-shot prompting

Few-shot prompting provides the LLM examples to guide its output. In particular, we can give the LLM a few examples of content along with their topics, and ask the LLM for the topics of new content.

    Plain Text
   
   <wikipedia page of physics>
Topics: Physics, Science, Modern Physics

<wikipedia page of baseball>
Topics: Baseball, Sport

<wikipedia page you want to tag with topics>
Topics:

Few-shot prompting

Advantages

Simplicity: The technique is straightforward and easy to understand.
Ease of comparison: It is simple to compare the results of multiple LLMs.

Disadvantages

Less control: There is limited control over the LLM's output, which can lead to issues like duplicate topics (e.g., "Science" and "Sciences").
Possible high cost: Few-shot prompting can be expensive, especially with large content like entire Wikipedia pages. More examples increase the LLM's input length, thus raising costs.

2. Prompting With Options

This technique is beneficial when you have a small and predefined set of topics, or a method of narrowing down to a manageable size, and want to use the LLM to select from this small set of options.

Since this is still prompting, both zero-shot and few-shot prompting could work. In practice, since the task of selecting from a small set of topics is much simpler than coming up with the topics, zero-shot prompting can be preferred due to its simplicity and lower computational cost.

An example prompt is:

    Plain Text
   
   <wikipedia page of physics>

Possible topics: Physics, Biology, Science, Computing, Baseball …

Which of the above possible topics is relevant to the above text? Select up to 3 topics.

Prompting with options

Advantages of Prompting With Options

Higher control: The LLM selects from provided options, ensuring more consistent outputs.
Lower computational cost: Simpler task allows the use of a smaller LLM, reducing costs.
Alignment with existing structures: Useful when adhering to pre-existing content organization, such as library systems or structured webpages.

Disadvantages of Prompting With Options

Need to narrow down topics: Requires a mechanism to accurately reduce the topic options to a small set.
Validation requirement: Additional validation is needed to ensure the LLM does not output topics outside the provided set, particularly if using smaller models.

3. Dual Encoder

A dual encoder leverages encoder-decoder LLMs to convert text into embeddings, facilitating topic tagging through similarity measurements. This is in contrast to prompting, which works with both encoder-decoder and decoder-only LLMs.

Process

Convert topics to embeddings: Generate embeddings for each topic, possibly including detailed descriptions. This step can be done offline.
Convert content to embeddings: Use an LLM to convert the content into embeddings.
Similarity measurement: Use cosine similarity to find the closest matching topics.

Advantages of Dual Encoder

Cost-effective: When already using embeddings, this method avoids reprocessing documents through the LLM.
Pipeline integration: This can be combined with prompting techniques for a more robust tagging system.

Disadvantage of Dual Encoder

Model constraint: Requires an encoder-decoder LLM, which can be a limiting factor since many newer LLMs are decoder-only.

Hybrid Approach

A hybrid approach can leverage the strengths of both prompting with options and the dual encoder method:

Narrow down topics using the dual encoder: Convert the content and topics to embeddings and narrow the topics based on similarity.
Final topic selection using prompting with options: Use a smaller LLM to refine the topic selection from the narrowed set.

Hybrid approach

Conclusion

Topic tagging with LLMs offers significant advantages over traditional methods, providing greater efficiency and accuracy. By understanding and leveraging different techniques — zero-shot/few-shot prompting, prompting with options, and dual encoder — one can tailor the approach to specific needs and constraints. Each method has unique strengths and trade-offs, and combining them appropriately can yield the most effective results for organizing and analyzing large volumes of content using topics.

References

[1] LDA Paper

[2] Fine-tuned Language Models Are Zero-Shot Learners

Data analysis HTTPS Latent Dirichlet allocation NLP large language model

Opinions expressed by DZone contributors are their own.

Related

Trending