ITBench, Part 3: IT Compliance Automation with GenAI CISO Assessment Agent

The CISO Assessment Agent converts natural language compliance requirements into executable policies, enabling organizations to scale security operations.

Yuji Watanabe

Takumi Yanagawa

Hirokuni Kitahara

Anca Sailer

Dec. 12, 25 · Tutorial

Likes (1)

Comment

Save

947 Views

Developed as part of IBM's ITBench framework, which we introduced in ITBench, Part 1: Next-Gen Benchmarking for IT Automation Evaluation, the Chief Information Security Officer (CISO) Compliance Assessment Agent (CAA) represents a pioneering methodology for automating cybersecurity compliance processes in modern IT environments. This AI-powered agent addresses the critical challenge of scaling security compliance operations in complex, rapidly evolving IT environments and technologies.

Traditional compliance approaches that rely on dedicated security teams to manually identify weaknesses and assess compliance posture are no longer viable for modern organizations operating at scale.

The CISO Assessment Agent represents a paradigm shift from manual to automated processes, delivering scalable solutions that keep pace with the compliance demands of modern software development and environments. By leveraging large language models (LLMs) and specialized agentic frameworks, this agent bridges the gap between natural language compliance requirements and executable policy code, addressing a critical need in enterprise security operations.

The Challenge of Modern Compliance

The complexity of modern application and infrastructure environments has rendered traditional security approaches obsolete. Organizations now operate in multi-layered environments where security and regulatory controls span multiple teams with varying levels of cybersecurity and AI expertise.

This distributed responsibility model creates several challenges:

Scale and Frequency: Modern organizations require daily or on-demand compliance scans across complex environments, making manual processes unfeasible. The sheer volume of systems, configurations, and continuous deployment cycles overwhelms human capacity for thorough compliance assessment.
Knowledge Gap: The translation of natural language compliance recommendations into executable policy scripts requires unprecedented technical knowledge from compliance teams typically focused on legal and regulatory matters. This creates a bottleneck where domain expertise doesn't align with technical implementation capabilities.
Synchronization Requirements: Automating compliance validation demands unprecedented trust and coordination across different business units and expert domains. The disconnect between compliance authoring (done in natural language by CISOs and regulators) and policy validation (requiring specialized programming languages and tools) creates operational friction.
Tool Diversity: The compliance ecosystem encompasses various policy engines, e.g., Kyverno and OPA Gatekeeper for Kubernetes, Ansible for Platform-as-a-Service environments, and Cloud Security Posture Management (CSPM) solutions for cloud infrastructure. Each tool requires specific expertise and generates different types of validation scripts.

The CISO Agent Technology, Architecture, and Tooling

The CISO Assessment Agent addresses these challenges through a comprehensive AI-driven approach that automates the entire compliance assessment workflow. Built using open-source frameworks CrewAI and LangGraph, the agent demonstrates sophisticated capabilities in understanding natural language requirements and translating them into executable policies.

Innovative AI Technologies

Code Generation from Natural Language: The agent's primary innovation lies in its ability to interpret compliance requirements written in plain English, then automatically generate corresponding policy code. This capability eliminates the technical barrier that prevents compliance teams from directly implementing their requirements.

Prompt Declaration Language: Developed by IBM to declaratively define and modularize LLM prompts. PDL is an open-source language that allows the agent to express prompting patterns and aims to improve productivity through high-level, intuitive prompt interpretation and low-level action controls.

Compliance Automation Architecture

As illustrated in Figure 1 below, the CISO Assessment Agent is a hierarchical AI system in which a central orchestrator, the Planner, coordinates multiple specialized agents, each focused on specific security domains such as Kubernetes policy management, OPA compliance checking, or GitOps automation. The agent is triggered when a user inputs a high-level security goal, such as "assess the posture of our new RHEL9 security policy" or "update the compliance policies for our Kubernetes infrastructure." This goal flows into the CISO Assessment Agent Planner, which serves as the intelligent orchestration layer for the entire system.

The orchestration layer consists of two primary components working in tandem: Planner and Router. The Planner receives the user's goal and breaks it down into a structured execution plan, determining which specialized agents need to be involved and in what sequence.

This plan is then handed off to the Router, which is built using LangGraph—a robust framework for creating stateful agent workflows. The Router acts as the system's dynamic traffic controller, invoking the appropriate specialized agents based on the plan, receiving their responses, and coordinating the overall execution flow. This separation between planning and execution allows the system to handle complex, multi-step assessment scenarios that might require iterative refinement based on intermediate results.

Below the orchestration layer sits a collection of specialized agents, each implemented using PDL to ensure reliable, composable LLM prompts. This layer includes domain-expert agents such as the Kubernetes Kyverno Agent, Kubernetes OPA Agent, Ansible OPA Agent, and Kyverno GitOps Agent, among others.

Each agent brings deep expertise in its domain, enabling the system to generate sophisticated actions, analyses, or artifacts (e.g., code) that would typically require human specialists. When these agents need to perform concrete operations — whether that's querying a Kubernetes cluster, generating policy code, or executing compliance checks — they invoke the bottom layer of the architecture: a suite of Python-based tools.

The tools layer provides the actual execution capabilities that transform agent reasoning into concrete actions. Tools like Kubectl execute Kubernetes commands, GenKyverno and GenOPA generate policy scripts, while RunOPA and RunAnsible execute the scripts required by the evaluation tasks. This three-tier architecture—orchestration at the top, specialized agents in the middle, and executable tools at the bottom—creates an extensible system. This modular design means new agents can be added for emerging security domains, and new tools can be integrated as the security landscape evolves.

The end-to-end workflow follows the user's policy requirements through planning and routing to specialized agents, which leverage tools to perform operations, with results flowing back from the policy engines through the layers to deliver policy assessments and eventual normalization. This architecture embodies the promise of agentic AI in cybersecurity compliance: intelligent systems that can reason about complex policy challenges and autonomously validate them.

Figure 1. CISO Assessment Agent Hierarchical Architecture: Planner, Router, Agent Pool, Tool Pool

Core Tooling Capabilities

The current CISO Assessment Agent implementation leverages the following suite of tools:

Data Collection Automation: The agent automatically identifies appropriate collection mechanisms based on target system characteristics. For Kubernetes environments, it generates kubectl commands; for host configurations, it creates Ansible playbooks. This adaptive approach ensures compatibility across diverse infrastructure types.
GitOps Workflow Integration: Modern DevSecOps practices rely heavily on GitOps workflows for code management and deployment. The CISO Agent seamlessly integrates with these existing processes, managing policy code through git repositories and automating pull request creation and management.
Multi-Engine Policy Deployment: The agent supports deployment across various policy engines, including Kyverno for Kubernetes-specific configurations, Open Policy Agent (OPA) with Rego language for general scenarios, and Ansible for host-level compliance checks.

The current CISO Assessment Agent implementation and tool selection enable the following types of tasks to be autonomously accomplished on behalf of the CISO persona:

Identify Evidence Collector (IEC): Determines or selects the appropriate evidence collection mechanism based on the systems' characteristics or inventory.
Identify Policy Assessment Tool (IPA): Detects or selects suitable policy engines for the deployment, execution, and result evaluation of required policy checks.
Collect Evidence (CE): Executes the evidence collection through generated scripts.
Scan Assessment Posture (SAP): Executes the compliance posture assessments based on generated scripts and using the collected evidence.

Real-World Benchmarking Framework

The effectiveness of the CISO Agent is validated through ITBench, a comprehensive benchmarking framework that simulates real-world compliance scenarios. This framework, illustrated in Figure 2, addresses a critical gap in AI agent evaluation by providing systematic methods to assess agent effectiveness before production deployment.

Benchmarking Methodology

ITBench employs Cloud Internet Security (CIS) Benchmarks as the foundation for creating realistic compliance scenarios. These industry-recognized standards provide authentic compliance requirements that mirror real-world security challenges. The framework categorizes scenarios by complexity:

Easy (25%): Basic policy generation and deployment scenarios
Medium (50%): Multi-component scenarios requiring both evidence collection and policy validation
Hard (25%): Complex scenarios involving policy updates and advanced orchestration

Figure 2. ITBench Framework: Registered Agent (CAA) is evaluated against the pre-registered Scenarios (for Security and Compliance) in real-world setup; the results are displayed and ranked in the Leaderboard.

Scenario Classes

The benchmarking framework includes four primary scenario classes:

NEW-K8S-CIS-B-KYVERNO (Easy)

Focuses on Kubernetes Pod Security Policy compliance using the Kyverno policy engine. These scenarios test the agent's ability to generate and deploy Kyverno policies based on CIS benchmark requirements.

NEW-K8S-CIS-B-OPAREGO (Medium)

Involves dual output generation for Kubernetes environments, requiring both kubectl command scripts for evidence collection and OPA Rego policies for compliance verification.

NEW-RHEL9-CIS-B-ANSIBLE-OPA (Medium)

Targets RHEL9 host compliance using Ansible playbooks for evidence collection and OPA policies for validation, demonstrating the agent's versatility across different infrastructure types.

UPDATE-K8S-CIS-B-KYVERNO (Hard)

Involves updating existing Kyverno policies based on modified requirements, testing the agent's ability to understand and implement changes to existing compliance frameworks.

Performance Evaluation and Results

The CISO Agent's effectiveness is measured using rigorous metrics that reflect real-world operational requirements:

Evaluation Metrics

Success Rate (pass@1): Measures the agent's ability to correctly assess compliance posture, distinguishing between compliant ("pass") and non-compliant ("fail") configurations. This metric provides an unbiased estimator of correctness across all scenarios.

Time to Process (TTP): Quantifies the efficiency of compliance assessment by measuring the time required for successful posture identification. This metric is crucial for operational environments where rapid response to compliance issues is essential.

Comparative Performance

Evaluation across multiple LLM models reveals significant performance variations. GPT-4o demonstrates superior performance, with nearly 2x higher pass@1 rates than alternatives like Llama-3.3-70B-instruct and Granite-3.1-8B-instruct. The GPT-based models also excel in Time to Process metrics, handling scenarios with minimal processing time across all complexity classes.

The CISO Assessment Agent evaluation, illustrated in Figure 3, confirms the expected behavior regarding the scenario complexity — all models show a declining performance as scenarios progress from Easy to Hard categories. This not only validates the benchmark design, it also highlights the areas for improvement in agent capabilities. The evaluation also validates the impact of the prompt quality on the overall performance when using agents upgraded with PDL (Prompt Declaration Language) and thus delivering significant improvement.

Figure 3. CISO Assessment Agent performance results comparing the initial agent basic version to the upgraded version using PDL (Prompt Declaration Language)

Implications for Enterprise Security

The CISO Assessment Agent represents a significant advancement in automated compliance management with several important implications for enterprise security operations, including scalability, reliability, and knowledge democratization.

Organizations can now implement continuous compliance monitoring across complex environments without proportional increases in human resources. This addresses the fundamental scalability challenge in modern security operations. Automated policy generation eliminates human error and ensures consistent and reliable implementation of compliance requirements across diverse infrastructure environments. Also, by translating natural-language requirements into executable policies, the agent reduces the technical barrier that prevents compliance teams from directly implementing their requirements. This democratizes access to compliance automation capabilities.

Among the strategic benefits, the agent majors on GitOps integration capabilities, thus aligning with modern software development practices and enabling security-by-design principles without disrupting existing workflows. Organizations benefit of accelerated compliance programs as they can rapidly adopt new regulatory programs by automating the traditionally manual process of translating requirements into operational controls. And also worth noting, automation of evidence collection and validation significantly reduces operational costs while improving compliance coverage and frequency.

Conclusion

The CISO Assessment Agent represents a transformational approach to compliance assessment automation, addressing critical challenges in modern enterprise security and AI operations. Through its innovative combination of natural language understanding, policy generation, and automated deployment capabilities, the agent bridges the gap between compliance requirements and technical implementation.

The comprehensive evaluation through ITBench demonstrates the agent's effectiveness across realistic scenarios while highlighting areas for continued improvement. The success of GPT-based models in particular suggests that advanced language models are well-suited for compliance automation tasks, though performance varies significantly across levels of complexity.

As organizations continue to grapple with increasing compliance requirements and growing infrastructure complexity, solutions like the CISO Assessment Agent become essential tools for maintaining security posture at scale. The open-source nature of the implementation, combined with its integration capabilities and proven performance, positions it as a valuable contribution to the enterprise security automation landscape.

Coming Next

Stay tuned for upcoming posts, in which we will dive into the evolution of our other open-source agents FinOps, and SRE, featuring new scenarios, tasks, and evaluation metrics.

Learn More

For an in-depth technical presentation of Agentic AI architecture, please refer to "What is a ReAct agent?".
For more information on Prompt Declaration Language (PDL), its open-source repo is available at https://github.com/IBM/prompt-declaration-language/

Have questions or need help getting started with CISO Assessment Agent and ITBench?

Create a GitHub issue for bug reports or feature requests
Join our Discord community for real-time discussions

Below are the links to our other articles in this series:

ITBench, Part 1: Next-Gen Benchmarking for IT Automation Evaluation
ITBench, Part 2: ITBench User Experience: Democratizing AI Agent Evaluation

Opinions expressed by DZone contributors are their own.

Related

Trending