Challenges of Using LLMs in Production: Constraints, Hallucinations, and Guardrails
By integrating post-processing validation, RAG, and customizable guardrails, developers can bridge the gap between prototyping and production in LLM applications.
Join the DZone community and get the full member experience.
Join For FreeLarge language models (LLMs) have risen in popularity after the release of Chat-GPT. These pre-trained foundation models enable rapid prototyping, and companies want to use this cool technology. However, their probabilistic nature and lack of built-in constraints often lead to challenges once they are out of prototyping mode.
Let us consider an example of classifying news articles based on the content in the article to discuss the challenges one would encounter. Current LLMs have issues such as non-adherence to instructions, hallucinations, and possibly spitting out something that you don't want. This article explores these challenges with an example of classifying news articles into categories based on the content in the article and offers actionable strategies to mitigate them.
Challenge 1: Constraint Adherence in Outputs
Problem: Uncontrolled Category Generation
When classifying news articles into categories, LLMs may categorize them into a large list of categories, making categorization ineffective. They may categorize an article as Sports and another similar article related to sports as entertainment. This could result in a large list of categories.
Initial Solution: Predefined Labels and "Others" Buckets
A common solution is to restrict outputs to a predefined list such as "Sports" or "Entertainment," with an "Others" category for all articles that cannot be categorized in any predefined categories. This can be addressed by using prompt engineering, which is the process of designing and refining inputs to guide LLMs in producing the desired output.
In this example, the prompt can be updated to generate output by choosing a value from the predefined list of categories. While this may work in small tests, at scale, this could result in intermittent results not honoring the instructions provided in the prompts. LLMs may categorize articles as "Political Science" despite explicit instructions to choose from predefined categories. This undermines consistency, especially in systems relying on fixed taxonomies. Also, the "Others" category bucket often balloons due to:
- Ambiguity. Articles may have overlapping multiple categories.
- Model uncertainty. The model may have low confidence in some categories, so it is forced to make category choices.
- Edge cases. Some novel topics may not be covered by existing categories.
Improved Approach: Validation Layers in Post-Processing Steps
Instead of relying solely on prompts, implement a two-tiered validation system:
Use a combination of deterministic and probabilistic post-processing. Use lookup tables to verify if the generated output honors the constraints. Resend the same request to the LLM again if the response does not honor the constraint, and discard the results if the response in the second attempt also does not honor the constraint. With good prompt engineering and this two-tiered post-processing, the occurrences of results not honoring the constraints would drop significantly.
This reduces over-reliance on prompt engineering to enforce constraints and ensures higher accuracy.
Challenge 2: Grounding Outputs in Truth
Problem: Hallucinations and Fabrications
LLMs lack intrinsic knowledge of ground truth, resulting in fabricating answers instead of acknowledging that they don’t know the answer. For instance, when classifying scientific articles, models might mislabel speculative content as peer-reviewed based on linguistic patterns alone.
Solution: Augment With Retrieval-Augmented Generation (RAG)
Retrieval-augmented generation (RAG) is the process of combining a user’s prompt with relevant external information to form a new, expanded prompt for an LLM. Giving the LLM all the information it needs to answer a question enables it to provide answers about topics it was not trained on and reduces the likelihood of hallucinations.
An effective RAG solution must be able to find information relevant to the user’s prompt and supply it to the LLM. Vector search is the most commonly used approach for finding relevant data to be provided in the prompt to the model.
Integrate RAG to anchor outputs in verified data:
- Step 1: Retrieve relevant context (e.g., a database of known peer-reviewed journal names or author credentials).
- Step 2: Prompt the LLM to cross-reference classifications with retrieved data.
This forces the model to align outputs with trusted sources, reducing hallucinations.
Challenge 3: Filtering Undesirable Content
Problem: Toxic or Sensitive Outputs
Even "safe" LLMs can generate harmful content or leak sensitive data from inputs (e.g., personal identifiers in healthcare articles). LLMs have in-built controls to prevent this, and these controls vary from model to model. Having guardrails outside of the model will help address gaps that models may have, and these guardrails can be used with any LLM.
Solution: Layered Guardrails
- Input sanitization. Anonymize or scrub sensitive data (e.g., credit card numbers) in inputs before providing it to the model.
- Output sanitization. Sanitize the output from the model to remove toxic phrases or sensitive information
- Audit trails. Log all inputs/outputs for compliance reviews.
Most hyperscalers provide services that can be used for data sanitization. For example, Amazon Bedrock Guardrails can implement safeguards for your generative AI applications based on your use cases and responsible AI policies.
Score the Results
Let the model be its own critic. For this example, force the model to provide reasoning for each label it assigns. Provide this as input to the same model or a different one to provide a score on a pre-defined scale. Monitor this metric to understand the consistency and accuracy of the model. This score can be used to discard a label assigned if the score is less than a pre-defined value and re-run the analysis to generate a new label. This score can also be used to run A/B tests to experiment with multiple prompts
Best Practices for Production-Grade LLM Systems
- Multi-layered validation. Combine prompt engineering, post-processing, and scoring the results to validate the generated results.
- Domain-specific grounding. Use RAG for factual accuracy to reduce the hallucination frequency of the model.
- Guard rails and continuous monitoring. Track metrics like Others rate in this example, results quality score, and guardrail services to provide production-ready services.
Conclusion
Developers can move LLMs from prototype to production by implementing post-processing validation, RAG, and monitoring to manage constraints, hallucinations, and safety.
Opinions expressed by DZone contributors are their own.
Comments