DZone
Thanks for visiting DZone today,
Edit Profile
  • Manage Email Subscriptions
  • How to Post to DZone
  • Article Submission Guidelines
Sign Out View Profile
  • Post an Article
  • Manage My Drafts
Over 2 million developers have joined DZone.
Log In / Join
Please enter at least three characters to search
Refcards Trend Reports
Events Video Library
Refcards
Trend Reports

Events

View Events Video Library

Zones

Culture and Methodologies Agile Career Development Methodologies Team Management
Data Engineering AI/ML Big Data Data Databases IoT
Software Design and Architecture Cloud Architecture Containers Integration Microservices Performance Security
Coding Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks
Culture and Methodologies
Agile Career Development Methodologies Team Management
Data Engineering
AI/ML Big Data Data Databases IoT
Software Design and Architecture
Cloud Architecture Containers Integration Microservices Performance Security
Coding
Frameworks Java JavaScript Languages Tools
Testing, Deployment, and Maintenance
Deployment DevOps and CI/CD Maintenance Monitoring and Observability Testing, Tools, and Frameworks

Last call! Secure your stack and shape the future! Help dev teams across the globe navigate their software supply chain security challenges.

Modernize your data layer. Learn how to design cloud-native database architectures to meet the evolving demands of AI and GenAI workloads.

Releasing software shouldn't be stressful or risky. Learn how to leverage progressive delivery techniques to ensure safer deployments.

Avoid machine learning mistakes and boost model performance! Discover key ML patterns, anti-patterns, data strategies, and more.

Related

  • Navigating Innovations and Challenges of Conversational AI
  • Exploring Foundations of Large Language Models (LLMs): Tokenization and Embeddings
  • Improving the Capabilities of LLM-Based Analytics Copilots With Semantic Search and Fine-Tuning
  • How BERT Enhances the Features of NLP

Trending

  • Building Scalable and Resilient Data Pipelines With Apache Airflow
  • Rethinking Recruitment: A Journey Through Hiring Practices
  • Segmentation Violation and How Rust Helps Overcome It
  • Doris: Unifying SQL Dialects for a Seamless Data Query Ecosystem
  1. DZone
  2. Data Engineering
  3. AI/ML
  4. Topic Tagging Using Large Language Models

Topic Tagging Using Large Language Models

Explore various techniques for organizing large amounts of content into topics using Large Language Models.

By 
Vikram Rao Sudarshan user avatar
Vikram Rao Sudarshan
·
Jul. 19, 24 · Tutorial
Likes (1)
Comment
Save
Tweet
Share
4.2K Views

Join the DZone community and get the full member experience.

Join For Free

Topic Tagging

Topic tagging is an important and widely applicable problem in Natural Language Processing, which involves tagging a piece of content — like a webpage, book, blog post, or video — with its topic. Despite the availability of ML models like topic models and Latent Dirichlet Analysis [1], topic tagging has historically been a labor-intensive task, especially when there are many fine-grained topics. There are numerous applications to topic-tagging, including:

  • Content organization, to help users of websites, libraries, and other sources of large amounts of content to navigate through the content
  • Recommender systems, where suggestions for products to buy, articles to read, or videos to watch are generated wholly or in part using their topics or topic tags
  • Data analysis and social media management — to understand the popularity of topics and subjects to prioritize

Large Language Models (LLMs) have greatly simplified topic tagging by leveraging their multimodal and long-context capabilities to process large documents effectively. However, LLMs are computationally expensive and require the user to understand the trade-offs between the quality of the LLM and the computational or dollar cost of using them.

LLMs for Topic Tagging

There are various ways of casting the topic tagging problem for use with an LLM.

  1. Zero-shot/few-shot prompting
  2. Prompting with options
  3. Dual encoder

We illustrate the above techniques using the example of tagging Wikipedia articles.

1. Zero-Shot/Few-Shot Prompting

Prompting is the simplest method for using an LLM, but the quality of the results depends on the size of the LLM.

Zero-shot prompting [2] involves directly instructing the LLM to perform the task. For instance:

Plain Text
 
<wikipedia webpage text>
What are the 3 topics the above text is talking about?


Zero-shot is completely unconstrained, and the LLM is free to output text in any format. To alleviate this issue, we need to add constraints to the LLM.Zero-shot prompting

Zero-shot prompting

Few-shot prompting provides the LLM examples to guide its output. In particular, we can give the LLM a few examples of content along with their topics, and ask the LLM for the topics of new content.

Plain Text
 
<wikipedia page of physics>
Topics: Physics, Science, Modern Physics

<wikipedia page of baseball>
Topics: Baseball, Sport

<wikipedia page you want to tag with topics>
Topics:


Few-shot prompting

Few-shot prompting

Advantages

  • Simplicity: The technique is straightforward and easy to understand.
  • Ease of comparison: It is simple to compare the results of multiple LLMs.

Disadvantages

  • Less control: There is limited control over the LLM's output, which can lead to issues like duplicate topics (e.g., "Science" and "Sciences").
  • Possible high cost: Few-shot prompting can be expensive, especially with large content like entire Wikipedia pages. More examples increase the LLM's input length, thus raising costs.

2. Prompting With Options

This technique is beneficial when you have a small and predefined set of topics, or a method of narrowing down to a manageable size, and want to use the LLM to select from this small set of options.

Since this is still prompting, both zero-shot and few-shot prompting could work. In practice, since the task of selecting from a small set of topics is much simpler than coming up with the topics, zero-shot prompting can be preferred due to its simplicity and lower computational cost.

An example prompt is:

Plain Text
 
<wikipedia page of physics>

Possible topics: Physics, Biology, Science, Computing, Baseball …

Which of the above possible topics is relevant to the above text? Select up to 3 topics.


Prompting with options

Prompting with options

Advantages of Prompting With Options

  • Higher control: The LLM selects from provided options, ensuring more consistent outputs.
  • Lower computational cost: Simpler task allows the use of a smaller LLM, reducing costs.
  • Alignment with existing structures: Useful when adhering to pre-existing content organization, such as library systems or structured webpages.

Disadvantages of Prompting With Options

  • Need to narrow down topics: Requires a mechanism to accurately reduce the topic options to a small set.
  • Validation requirement: Additional validation is needed to ensure the LLM does not output topics outside the provided set, particularly if using smaller models.

3. Dual Encoder

A dual encoder leverages encoder-decoder LLMs to convert text into embeddings, facilitating topic tagging through similarity measurements. This is in contrast to prompting, which works with both encoder-decoder and decoder-only LLMs.

Process

  1. Convert topics to embeddings: Generate embeddings for each topic, possibly including detailed descriptions. This step can be done offline.
  2. Convert content to embeddings: Use an LLM to convert the content into embeddings.
  3. Similarity measurement: Use cosine similarity to find the closest matching topics.

Advantages of Dual Encoder

  • Cost-effective: When already using embeddings, this method avoids reprocessing documents through the LLM.
  • Pipeline integration: This can be combined with prompting techniques for a more robust tagging system.

Disadvantage of Dual Encoder

  • Model constraint: Requires an encoder-decoder LLM, which can be a limiting factor since many newer LLMs are decoder-only.

Hybrid Approach

A hybrid approach can leverage the strengths of both prompting with options and the dual encoder method:

  1. Narrow down topics using the dual encoder: Convert the content and topics to embeddings and narrow the topics based on similarity.
  2. Final topic selection using prompting with options: Use a smaller LLM to refine the topic selection from the narrowed set.

Hybrid approach

Hybrid approach

Conclusion

Topic tagging with LLMs offers significant advantages over traditional methods, providing greater efficiency and accuracy. By understanding and leveraging different techniques — zero-shot/few-shot prompting, prompting with options, and dual encoder — one can tailor the approach to specific needs and constraints. Each method has unique strengths and trade-offs, and combining them appropriately can yield the most effective results for organizing and analyzing large volumes of content using topics.

References

[1] LDA Paper

[2] Fine-tuned Language Models Are Zero-Shot Learners

Data analysis HTTPS Latent Dirichlet allocation NLP large language model

Opinions expressed by DZone contributors are their own.

Related

  • Navigating Innovations and Challenges of Conversational AI
  • Exploring Foundations of Large Language Models (LLMs): Tokenization and Embeddings
  • Improving the Capabilities of LLM-Based Analytics Copilots With Semantic Search and Fine-Tuning
  • How BERT Enhances the Features of NLP

Partner Resources

×

Comments
Oops! Something Went Wrong

The likes didn't load as expected. Please refresh the page and try again.

ABOUT US

  • About DZone
  • Support and feedback
  • Community research
  • Sitemap

ADVERTISE

  • Advertise with DZone

CONTRIBUTE ON DZONE

  • Article Submission Guidelines
  • Become a Contributor
  • Core Program
  • Visit the Writers' Zone

LEGAL

  • Terms of Service
  • Privacy Policy

CONTACT US

  • 3343 Perimeter Hill Drive
  • Suite 100
  • Nashville, TN 37211
  • support@dzone.com

Let's be friends:

Likes
There are no likes...yet! 👀
Be the first to like this post!
It looks like you're not logged in.
Sign in to see who liked this post!