Automation of IT Incident Reports Using Observability API and GenAI

A summary of the integration of observability API and GenAI to automate preliminary incident reporting with a sample incident report from on-prem and cloud models.

Pandian Ramaiah

Jul. 18, 25 · Tutorial

Likes (0)

Comment

Save

3.6K Views

Recently, I was involved in automating the preparation of IT incident reports by integrating monitoring solutions and generative AI. This post is to share my observations during this ongoing exercise. It outlines how to leverage the power of the AI service to interpret alerts generated by an observability solution (in this case, eG Enterprise Suite).

It utilized the Gemma AI model, running locally via Ollama for on-premise users, and the Llama-3.1 AI model with OpenAI for those with internet connectivity. The outcome of this exercise resulted in faster understanding of system issues and potential remediation steps. It automates the incident report creation by a Level 1 engineer.

Prerequisites:

An active subscription to the eG Innovations SaaS service or installation of the eG Innovations Manager. You'll need your eG Manager's hostname or IP address, port, username, and API key or password.
For on-premise AI models
- A Windows server, preferably with a GPU card (this guide focuses on Microsoft Windows, but Ollama also supports macOS and Linux), or
- Internet connection for the initial Ollama and Gemma 2 setup.
For SaaS AI models, a valid OpenAI API key.

I. Setting Up Ollama for On-Premise Environments:

1. Download and install Ollama:

Visit the Ollama website.
Click the “Download” button. Choose the Windows version.
Install the application by following the on-screen instructions.

2. Verify Ollama server status:

After installation, Ollama should automatically start. You can verify this by checking the taskbar. An Ollama icon (the white icon shown below) should be present.

3. Confirm Ollama is running in your browser:

Open your web browser (e.g., Chrome, Firefox, Edge) and navigate to http://localhost:11434/.
You should see the following response: Ollama is running.

4. Configure command line access to Ollama:

Open an elevated command prompt (run as administrator).
Set the Ollama installation folder in your system's PATH environment variable. This allows you to run Ollama commands from any directory.
set PATH=%PATH%;C:\Users\<your_username>\AppData\Local\Programs\Ollama
Important: Replace <your_username> with your actual Windows username.

5. (Optional) Make the PATH permanent: To make the PATH change permanent, you can set it through the System Properties.

Open Control Panel > System and Security > System > Advanced System Settings > Environment Variables.
Under User variables, find the Path variable, select it, and click Edit.
Add the Ollama directory (C:\Users<your_username>\AppData\Local\Programs\Ollama) to the list.
Restart the command prompt or terminal for the changes to take effect.

II. Downloading the Gemma 2 Model:

Open an elevated command prompt: Ensure you have a command prompt open with administrator privileges.
Pull (Download) the Gemma 2 model: Execute the following command to download the Gemma 2 model from the Ollama library:
Ollama run gemma2

Ollama will start downloading the necessary components. This process may take a while, depending on your internet connection speed, as the download size is approximately 5.4 GB.

You should see progress indicators in the command prompt:

    Shell
   
 

   ollama run gemma2
pulling manifest
pulling ff1d1fc78170... 100% ▕████████████████████████████████████████████████████████▏ 5.4 GB
pulling 109037bec39c... 100% ▕████████████████████████████████████████████████████████▏  136 B
pulling 097a36493f71... 100% ▕████████████████████████████████████████████████████████▏ 8.4 KB
pulling 2490e7468436... 100% ▕████████████████████████████████████████████████████████▏   65 B
pulling 10aa81da732e... 100% ▕████████████████████████████████████████████████████████▏  487 B
verifying sha256 digest
writing manifest
success
  

The success message indicates that the model has been downloaded and installed correctly.

III. Integrating With eG Innovations (or Any Other Observability Solution)

Collect eG alarms using the API: I used Spring Boot with Spring AI libraries to connect to the eG Innovations Manager API and retrieve the alerts.

    Java
   
 

   private List<Alarm> getAlarmsFromEGManager() {
	    String egManagerUrl = "https://<eg-manager-hostname>:443/api/eg/analytics/getAlerts";
	    HttpHeaders headers = new HttpHeaders();
	    headers.setContentType(MediaType.APPLICATION_JSON);
	    headers.set("managerurl", "https://<eg-manager-hostname>:443");
	    headers.set("user", "<user name>");
	    headers.set("pwd", Base64.getEncoder().encodeToString("<api key>".getBytes()));
	    
	    String requestBody = "{\"filterBy\":\"zone\",\"filterValues\":\"ALL\"}";
	    HttpEntity<String> entity = new HttpEntity<>(requestBody, headers);

	    RestTemplate restTemplate = new RestTemplate();
	    ResponseEntity<String> response = restTemplate.exchange(egManagerUrl, HttpMethod.POST, entity, String.class);

	    return parseAlarmsFromResponse(response.getBody());
	}
  

Refer to the previous code snippets for this, ensuring I am using the /api/eg/analytics/getAlerts endpoint, including the filterBy and filterValues in the request body. The parsing of the response and preparing a collection of the entity objects is also completed. It is not shown here as it is insignificant.

Iterate through the alerts and construct the prompt: For each alert retrieved from the eG Manager API, construct a prompt that will be sent to the AI model.

    Java
   
 

   @Override
	public void run(String... args) throws Exception {
		while (true) {
			List<Alarm> alarmsFromEGManager = getAlarmsFromEGManager();
			for (Alarm alarm:alarmsFromEGManager) {
				StringBuilder sb = new StringBuilder();
				sb.append("This is an alarm from eG Innovations. ");
				sb.append("Component name:"+alarm.getComponentName());
				sb.append("\n");
				sb.append("Component type:"+alarm.getComponentType());
				sb.append("\n");
				sb.append("Description:"+alarm.getDescription());
				sb.append("\n");
				sb.append("Layer:"+alarm.getLayer());
				sb.append("\n");
				sb.append("Measure:"+alarm.getMeasure());
				sb.append("\n");
				sb.append("Priority:"+alarm.getPriority());
				sb.append("\n");
				sb.append("Business Service affected by this alert:"+alarm.getService());
				sb.append("\n");
				sb.append("Alarm start time:"+alarm.getStartTime());
				sb.append("\n");
				sb.append("Test or Probe name:"+alarm.getTest());
				sb.append("\n");
				sb.append("Interpret this alert for me in simple terms. Provide me potential causes. Provide remediation if any.");
				String promptString = sb.toString();
				processPrompt(promptString);
			}
		}
	}
  

IV. Processing the Prompt With Ollama:

The prompt prepared in the previous step is passed over here. It is processed in four steps:

Create a CountDownLatch to manage asynchronous operation completion.
Sends the prompt to an AI model (gemma2) using a streaming API.
Subscribes to the response stream:
1. Prints each part of the response as it arrives.
2. Handles errors by printing them and releasing the latch.
3. Signals completion by printing a message and releasing the latch.
Waits for the latch to be released (operation completion) or a 300-second timeout.
In essence, it processes an AI prompt asynchronously, prints the response in real-time, and ensures the operation completes (or times out) before proceeding.

    Java
   
 

   	private void processPrompt(String promptString) throws InterruptedException {
		CountDownLatch latch = new CountDownLatch(1);
		
		Flux<ChatResponse> responseFlux = chatModel.stream(new Prompt(
		        promptString,
		        OllamaOptions.create()
		            .withModel("gemma2")
		            .withTemperature(0.4D)
		    ));
		// Block and collect all responses
		
		responseFlux.subscribe(
			    chatResponse -> System.out.print(chatResponse.getResult().getOutput().getContent()),
			    error -> {
			        System.err.println("An error occurred: " + error.getMessage());
			        latch.countDown();
			    },
			    () -> {
			        System.out.println("\nGeneration complete.");
			        latch.countDown();
			    }
			);
		
		// Wait for the operation to complete
		latch.await(300, TimeUnit.SECONDS);
		System.out.println("3");
	}
  

A simple prompt created during the feasibility check is given below. It is an ineffective prompt. The coming sections will explain how I improved it later on.

    Shell
   
 

   This is an alarm from eG Innovations.
Component name: reverse_proxy_srv1:443
Component type: Apache Web
Description: The web page 'HomePage' running in reverse_proxy_srv1:443 is unavailable
Layer: Web Server
Measure: Web availability
Priority: critical
Business Service affected by this alert: -
Alarm start time: Feb 15, 2025 14:06
Test or Probe name: HTTP
Interpret this alert for me in simple terms. Provide me with potential causes. Provide remediation steps if any.
  

Here's an example:

    Shell
   
   ##  eG Innovations Alarm: Your Website is Down!

This alarm means that the webpage "HomePage" on your web server (called "reverse_proxy_srv1") running on port 443 (usually used for secure websites) is currently unavailable.  

**Think of it like this:** Imagine your website is a shop, and "HomePage" is the front door. This alarm is saying the front door is locked, and customers can't get in!

**Possible Causes:**

* **Server Crash:** The web server itself might have crashed or frozen.
* **Network Issues:** There could be problems with your internet connection or the network connecting your server to the outside world.
* **Software Glitch:** A bug in the website software (Apache Web) or a related program could be causing the problem.
* **Configuration Error:**  Something might be wrong with how the web server is set up, preventing it from serving the "HomePage" correctly.

**What to Do:**

1. **Check your Server Status:** Log into your web server and see if it's running. Look for any error messages.
2. **Test your Network Connection:** Make sure you have a stable internet connection. Try accessing other websites to rule out a broader network problem.
3. **Restart the Web Server:**  Sometimes a simple restart can fix things. 
4. **Review Logs:** Check the web server's logs for any clues about what went wrong. These logs often contain detailed error messages that can help pinpoint the issue.
5. **Contact Support:** If you're unable to resolve the problem yourself, contact your hosting provider or IT support for assistance.

Remember:  The sooner you address this alarm, the faster you can get your website back online

The prompt can still be customized with additional inputs from other API gateways to improve the accuracy of the response.

V. Addressing the Requirements of On-Premise and Internet-Based Users

While performance data is generally considered less sensitive, it contains sensitive information such as server hostnames, IP addresses, process locations, and usernames.

For on-premise deployments of the Ollama service, generating reports locally is a secure and efficient approach. This method eliminates the need to transmit sensitive data externally and allows for AI model responses without internet connectivity.

However, we have observed intermittent attempts by the AI model to establish internet connections, triggering alerts from my antivirus web safety features.

Attempt to reach internet from on-premise AI service.

On-premise Ollama application provides AI response generation, working reliably even with a blocked internet connection. Specifically, it uses Ollama with gemma:7b in production, running on an Ubuntu server without internet access.

For users who prefer AI SaaS services, I implemented access to the llama-3.1-sonar-small-128k-online model through Perplexity's API gateway. The application's architecture includes an AIService interface with a generateResponse() method, implemented by both OllamaService and OpenAiService. The preferred AI service is selected using the ai.provider=openai parameter in application.properties.

The following flowchart may help you to visualize it.

On-premise and cloud connectivity

To enhance data security and control, we conducted a detailed analysis of data transmission. The goal was to simplify the configuration of what data is sent and what data is masked. The implemented techniques include replacing original values with proxy values and providing options to selectively include or exclude data from transmission. This allows for greater flexibility and security in data handling.

VI. Preparation of An Effective Prompt:

In the world of observability, interpreting alerts is not enough. Correlation is essential. As illustrated in Processing the Prompt with Ollama (Section IV), which demonstrates the limitations of GenAI hallucination, simple alert interpretation is ineffective.

To generate meaningful incident reports, we must provide related data to enhance our prompts. Most monitoring solutions expose their data through APIs or CLIs. I suggest using four API calls to prepare the prompt for this requirement:

Retrieving the history of alarms raised in the target environment: This provides all alarm records and their parameters.
Retrieving historical data of a measure: This provides the trend of the measurement for which the alert is raised.
Retrieving detailed diagnosis of a measure: This provides additional data to justify the alert. For example, when CPU utilization increases, this reveals which processes consumed CPU cycles.
Retrieving live measures of a component: This provides all other metrics collected for the problematic server, application, or network equipment. This is a key parameter in correlation. For instance, if a database server is struggling to cope, we need numerous metric points—such as wait types, wait times, error logs, user processes, client processes, and top queries—to determine the root cause. The product already provides a KPI analysis report for this purpose. The email contains the alert interpretation even without GenAI. But the aim of this exercise is to integrate GenAI capabilities with the same data and automate it.

With that in mind, let's examine an incident report generated using a prompt that incorporates inputs from these four APIs. The report is provided in the next section.

A Sample Incident Report

There was an alert on the CPU utilization of a Linux server this morning.

Below is the response from the AI service to prepare incident report:

Response From Open AI Through Perplexity API

### Comprehensive Analysis of CPU Utilization Alert

#### Trend Analysis

**Summary:**
The CPU utilization on localhost.localdomain has experienced a significant spike over the past hour. The data shows a steady increase in CPU utilization, culminating in a critical alert at 06:31.

**Average and Maximum CPU Utilization:**
- **Average CPU Utilization:** The average CPU utilization over the last hour is roughly 10.5% (calculated from the given data points).
- **Maximum CPU Utilization:** The maximum CPU utilization reached 100% at 06:31, indicating severe CPU contention.

**Trend Description:**
The trend over the last hour can be described as follows:
- The CPU utilization starts at 0% at 05:36.
- It gradually increases to 1% and remains at 1% until 05:46.
- There is a minor spike to 4% at 06:01.
- The CPU utilization then increases sharply to 13% by 06:06.
- It drops to 0% multiple times until the critical spike at 06:31, reaching 100%.

**Conclusion:**
There was a consistent trend of increasing CPU utilization leading up to the critical alert, with significant spikes and drops. The final critical spike indicates severe CPU contention.

#### Root Cause Identification

**Detailed Diagnosis:**
The detailed diagnosis data indicates the following processes contributing to the high CPU utilization:
- **ollama:** The `ollama` process consumed 100% of the CPU at 06:31.
- **top:** The `top` process consumed an additional 25% of the CPU at 06:31.
- **kworker/2:2-events:** A kernel worker process consumed 1.55% of the CPU at 06:31.

**Correlation with Time-Series Data:**
The time-series data shows that the CPU utilization remained stable until the sharp increase at 06:31, where it reached 100%. This correlates with the `ollama` process consuming 100% of the CPU, indicating a dominant contributor to the high CPU utilization.

#### Recommendations

**Immediate Actions:**
1. **Identify and Terminate the Dominant Process:**
- The `ollama` process is identified as the primary cause of the high CPU utilization. Immediate action should be taken to identify and terminate this process if it is not crucial for the system's operation.

2. **Check for Resource Intensive Programs:**
- Review other processes running on the system to determine if any other applications are consuming significant CPU resources.

3. **Monitor System Logs:**
- Check system logs to see if there are any error messages related to the `ollama` process or other resource-intensive applications.

**Long-term Solutions:**
1. **Implement CPU Monitoring:**
- Ensure that continuous CPU monitoring is in place to detect such spikes promptly.

2. **Resource Optimization:**
- Optimize resource allocation to prevent future occurrences of high CPU utilization.

3. **Regular Maintenance:**
- Regularly review system performance and optimize resource usage to avoid future alerts.

#### eG Enterprise Suite Module Suggestions

**Monitoring Application Models:**
To effectively monitor and manage CPU utilization, the following eG Enterprise Suite monitoring applications can be beneficial:
- **System Details Test:** This test provides comprehensive details about system health, including CPU utilization.
- **Process Details Test:** This test provides detailed information about running processes, including their CPU usage.
- **Memory Details Test:** This test helps in monitoring memory utilization, which can be correlated with CPU usage to identify potential bottlenecks.

**Benefits:**
These monitoring models will provide direct value in solving the current CPU utilization issue by:
- Identifying contributing processes.
- Providing historical data for trend analysis.
- Offering real-time monitoring capabilities to detect future occurrences.

### Output Format

**Trend Analysis:**
- **CPU Utilization Trend:** The CPU utilization experienced a consistent increase leading to a critical spike.
- **Average CPU Utilization:** 10.5%
- **Maximum CPU Utilization:** 100%
- **Trend Description:** Gradual increase with minor spikes and drops culminating in a critical spike at 06:31.

**Root Cause Identification:**
- **Primary Cause:** The `ollama` process consumed 100% of the CPU.
- **Other Contributing Processes:** The `top` and `kworker/2:2-events` processes also consumed significant CPU resources.

**Recommendations:**
1. **Immediate Action:** Identify and terminate the `ollama` process.
2. **Check for Resource Intensive Programs:** Review other processes for significant CPU usage.
3. **Monitor System Logs:** Check for error messages related to resource-intensive applications.

**eG Enterprise Suite Module Suggestions:**
1. **System Details Test:** Comprehensive system health monitoring.
2. **Process Details Test:** Detailed process information including CPU usage.
3. **Memory Details Test:** Correlate memory and CPU usage for optimization.

Response From Ollama With Gemma Model

**Trend Analysis:**
Over the last hour, CPU utilization has been steadily increasing with a slight dip around 30 minutes ago. The average CPU utilization during this period is approximately **72.115%**, while the maximum reached was **85.421%**.

There seems to be no consistent trend; instead, it appears that there were multiple short-lived spikes in CPU utilization, possibly indicating intermittent system overload or resource-intensive processes running concurrently.

**Root Cause Identification:**
Based on the detailed diagnosis data, I identify the following contributing factors to high CPU utilization:

* **System Processes:** Several system processes are consistently consuming high CPU resources, `ollama.service`. These might be causing unnecessary overhead.
* **Resource-Intensive Programs:** One or more resource-intensive programs are running concurrently, likely contributing to the increased CPU load. These could be applications, services, or scripts that are not optimized for performance.

**Recommendations:**
To mitigate this alert on CPU utilization:

1. **Optimize System Processes:**
* Investigate and potentially disable or adjust system processes (`ollama.service`) to reduce their resource consumption.
2. **Identify Resource-Intensive Programs:**
* Use tools like `top`, `htop`, or `ps` to identify the specific programs consuming high CPU resources.
* Analyze these applications' performance and consider optimizing, terminating, or adjusting their configurations to minimize overhead.
3. **Configure System Resources:**
* Adjust system resource allocation (e.g., memory, CPU) as needed to opt optimize performance.

**eG Enterprise Suite Module Suggestions:**
To gain deeper insights into the system's performance and better monitor CPU utilization:

1. **Operating System Component Monitoring:** Use eG Enterprise's
Operating System component monitoring model to track CPU utilization,
memory usage, and other vital system metrics.
2. **Application Performance Monitoring:** Utilize eG Enterprise's
Application Performance monitoring module to analyze the performance of
resource-intensive programs and identify potential bottlenecks.

By implementing these recommendations and incorporating the suggested monitoring modules, you will be able to better monitor and manage CPU utilization, ultimately improving overall system performance.

Cost of Using AI to Generate the Report

It is important to monitor the Ollama server's CPU and memory utilization to understand the impact of running the Gemma model. As the monitoring agent is placed on the server, it tracks these metrics. Following are the screenshots from my virtual machine in the development environment, which has 16GB of RAM, 4 virtual CPUs and no GPU.

CPU usage on the Ollama server (on-premise)

Memory usage of the Ollama server (on-premise)

For Perplexity APIs, I do not have a per-prompt cost breakdown. Based on tracking the API credit balance in the development environment, I found that each incident report costs around $0.01 to $0.03.

Cost factors for SaaS-based GenAI service

Important Considerations

On-Premise Security: Because you're running Ollama and Gemma locally, your data remains within your own environment. This is crucial for security and compliance.
Effective response by SaaS AI models: When you use SaaS-based AI services, you have access to latest AI models that empower your report.
Hardware Requirements: Running LLMs like Gemma 2 can be resource-intensive. Ensure you provision a VM with sufficient CPU, GPU, and RAM. Consider the larger Gemma models if your hardware supports it, for potentially improved results.
Prompt Engineering: The quality of the output from Gemma depends heavily on the quality of your prompts. Experiment with different prompt structures to get the best results. When the output of related eG manager’s API are combined, it will replace the generic response with accurate root-cause.
API Rate Limiting: If you're processing a large number of alerts, be mindful of potential over-leverage of the eG Innovations Manager API. Provision sufficient resources on its key components.
Asynchronous Processing: Implement proper error handling and asynchronous processing to prevent your application from becoming unresponsive. I would suggest the producer (the one who prepares the prompt using monitoring APIs) and consumer (the one who sends the prompt to AI service) get a response and do post processing such as prepare email or updating tickets with PDF reports,

This exercise provides a more comprehensive and user-friendly guide for integrating alerts of an observability tool with on-premise and SaaS-based GenAI services. Remember to replace the placeholder values with your actual information. The output could be enhanced to HTML or PDF format to send the information to IT service management systems, email, or any other medium.

AI API systems

Opinions expressed by DZone contributors are their own.

Related

Trending