Engineering Closed-Loop Graph-RAG Systems, Part 2: From Prompts to Rules
LLMs generate convincing recommendations, but convincing isn't the same as compliant. Here's how rule-augmented generation fixes that.
Join the DZone community and get the full member experience.
Join For FreeThis article is part 2 of a 4-part series on 'Engineering Closed-Loop Graph-RAG Systems.'
One of the easiest errors made when using LLM systems is to rely upon a recommendation because it appears logical.
- The model writes confidently.
- The answer looks polished.
- The writing is calm.
- The recommendation may even be helpful.
In many workflows, that logic will not satisfy.
- A recommendation can be fluent and still wrong.
- A recommendation can be relevant yet violate policy.
- A recommendation can be helpful in general; however, it could be appropriate for the user’s role, level of experience, context or constraints.
That is why recommendation systems built upon LLMs require more than retrieval and prompting. They require rules.
By rules, I do not mean using rules instead of an LLM. By rules, I mean adding a layer around generation to provide a means to check for constraints. The LLM is good at synthesis. The rule layer is good at checking constraints. When used together, they produce something much more reliable than either one alone.
Why "Just Make the Prompt Better" Is Often the First Response After Failure
After an LLM recommendation fails, the most common initial response is to improve the prompt:
Be accurate.
Follow policy.
Only use provided evidence.
Do not make unsupported claims.
Generate a personalized recommendation.
Prompt improvements are beneficial. However, prompts are only requests. Rules are checks.
To ensure a system meets all required policies, eligibility constraints, safety requirements, or workflow rules, those requirements must be checked outside the model. Otherwise, the model is both generating output (the recommendation) and judging the validity of its own output. That is risky.
Example of how that risk works:
For instance, suppose an AI assistant recommends training based on reviewing a professional interaction between two users. The recommendation should meet several requirements:
- It should address the identified performance gap.
- It should match the person's role and experience level.
- It should cite the evidence that triggered the recommendation.
- It should avoid any unsupported claims.
- It should include a measurable next step.
- It should not recommend content that has been completed recently.
All of these requirements can be included in a prompt. However, if the output violates any of those rules, then how will you know?
Rule-Augmented Generation
Rule-augmented generation separates the creative parts from the control parts.
A simple flow would look like this:
1. Retrieve relevant context.
2. Generate a draft recommendation.
3. Check the recommendation against rules.
4. If any part of the validation process fails, revisions must only be made to failed elements.
5. Save results along with results of rule-check process.
Although the LLM does what it is best at, summarizing evidence, explaining reasoning, and writing in a helpful tone, the rules do what they are best at, which is checking whether the output satisfies explicit constraints.
A Small Example
Assume the system finds the following gap:
{
"gap": "missed escalation criterion",
"evidence": [
"customer reported repeated account lockouts",
"agent continued troubleshooting without escalation"
],
"role": "support_agent_level_1",
"completed_resources": ["password_reset_basics"]
}
A very poor recommendation could have been:
You should review advanced account recovery architecture and lead a team session on escalation failures.
This appears to be written in a manner that is suitable to a senior manager; however, it is not an adequate recommendation for a level 1 support agent. The recommendation also does not reference any of the available evidence, and there is no direct instruction for the agent to take any measurable steps.
A better recommendation would be
Practice the escalation trigger scenario for repeated account lockouts. In the reviewed interaction, the customer reported multiple lockouts in the same week, but the case stayed in basic troubleshooting. Complete one escalation simulation and pass the follow-up assessment by identifying the escalation trigger in three sample cases.
This is not just better writing. It satisfies more constraints.
Converting Rules Into Checks
Each rule needs to be explicit, verifiable, and logged.
Below is a simple Python style checker:
from dataclasses import dataclass
from typing import List
@dataclass
class Recommendation:
text: str
target_gap: str
cited_evidence: List[str]
resource_id: str
role_level: str
measurable_next_step: bool
@dataclass
class ValidationResult:
passed: bool
failures: List[str]
def validate_recommendation(rec: Recommendation, context: dict) -> ValidationResult:
failures = []
if rec.target_gap != context["gap"]:
failures.append("Recommendation does not match the detected gap.")
if not rec.cited_evidence:
failures.append("Recommendation does not cite supporting evidence.")
if rec.resource_id in context.get("completed_resources", []):
failures.append("Recommendation repeats a recently completed resource.")
allowed_levels = context.get("allowed_role_levels", [])
if rec.role_level not in allowed_levels:
failures.append("Recommendation is not appropriate for this role level.")
if not rec.measurable_next_step:
failures.append("Recommendation lacks a measurable next step.")
return ValidationResult(passed=len(failures) == 0, failures=failures)
This is intended to be relatively low-level. In practice, validation will likely involve policy engines, graph lookups, workflow state, etc., or even human review. However, the idea is the same: do not rely solely on the LLM to quietly enforce all your rules.
The Revision Loop
After failing some form of validation, the system should not discard the answer. Instead, it can request that the model revise those aspects that were deemed invalid.
The prompt for revising the recommendation should be narrow:
The recommendation failed these checks:
- Recommendation lacks a measurable next step.
- Recommendation does not cite supporting evidence.
Revise the recommendation using only the provided evidence.
Do not introduce new claims.
Keep the recommendation appropriate for a Level 1 support agent.
This is better than regenerating completely from scratch. The model receives focused feedback, and the system retains its failure chain.
Why Rules Should Exist Outside of the Prompt
There are three reasons why rules should remain outside of a given prompt:
First, there must be ways to test Rules. If a requirement is worthy of consideration, you should be able to determine whether or not it has been satisfied.
Second, rules change. Policies, training requirements, escalation paths, compliance constraints, etc. Are subject to change over Time. Modifying a rule engine or validation layer tends to be less risky than modifying a significant prompt and risking other things changing as well.
Third, rules create audit trails. When a recommendation is contested/challenged by someone, the system needs to show not only what was recommended but which of the required checks were satisfied and which Evidence supported each one.
Reconstructing that trail from a single response by an LLM is difficult at best.
A Practical Rule Schema
Lightweight rule objects might resemble something like this:
{
"rule_id": "REC-004",
"name": "Recommendation must cite observed evidence",
"severity": "high",
"applies_to": ["training_recommendation"],
"check_type": "evidence_required",
"failure_action": "revise"
}
An example of another type of rule requiring human review might be:
{
"rule_id": "REC-011",
"name": "High-risk recommendations require expert approval",
"severity": "critical",
"applies_to": ["healthcare_training", "compliance_training"],
"check_type": "human_review_required",
"failure_action": "route_to_reviewer"
}
Not all failures will result in regeneration. Some failure types will block output. Some failure types will route to a human for review. Some failure types will update your graph or feedback store.
How a Knowledge Graph Helps
Rules become more powerful when they can query structured knowledge. For instance,
Is this resource connected to the detected gap?
Is this concept a prerequisite for the target skill?
Has this user already completed the same resource?
Is the assessment connected to the recommended training module?
Does the policy node allow this action for this role?
Those are graph questions. Vector database alone is unsuitable for them.
The rule layer can use graph relationships to verify whether a recommendation has a valid path:
Interaction → PerformanceGap → DomainConcept → TrainingResource → AssessmentItem
If no such path exists, then the recommendation may lack a basis.
What to Measure
In addition to answer quality, you should also consider measuring:
- Rule compliance rate
- Accuracy of supporting Evidence cited
- Number of revisions needed per recommendation
- Rate of human override
- Acceptance rate by user
- Rate of Repeat recommendations
- Rate of violations of Policy
- Time until recommendation is verified
A system that provides excellent suggestions but fails the rule checks is not ready for production.
Common Mistakes
Writing ambiguous(vague) rules. "Be helpful" is not a rule; "The recommendation must contain a specific next measurable step" is a rule.
Writing all prompts as rule instructions. Some rules go into Code; some go into policy engines; some go into graph constraints; and some go into human review workflows.
Not logging failed check results. Failing validations are important data. They indicate where the prompt, retrieval, graph, or rule designs require improvement.
Using rules only after generation. There are several types of rules that could affect how the retrieval function works before generation. For example, if a User has already accessed a Resource, then the Retrieval Function should filter out access to that same Resource.
Final Thoughts
LLM's are effective recommendation systems; however, they should not be the sole control point for generating recommendations.
Prompts can guide behavior. Rules can verify behavior.
That distinction matters.
When a System suggests an action (especially in professional, compliance, training, support, etc.), the question is no longer just "does this look like something I would do?" The question becomes, "Can the system provide evidence and proof that this suggested action was appropriate and allowed?"
Adding rules to the generation process allows you to answer that question.
Opinions expressed by DZone contributors are their own.
Comments