Creating and managing agents that automatically analyze and respond to infrastructure events
Unpage agents are the core building blocks for automating infrastructure operations with LLMs. Unlike traditional automation that requires extensive coding, Unpage agents are defined with simple YAML configurations that specify what they analyze and how they respond.
The description specifies when the agent should be used. This is particularly important for the routing system, which automatically selects the most appropriate agent for each incoming alert.
Copy
Ask AI
description: > Use this agent to analyze alerts that meet the following criteria: - CPU usage exceeds 90% on production EC2 instances - The alert has been active for more than 15 minutes - The instance is part of the main application cluster
The prompt contains detailed instructions for the agent, defining:
What information to gather
How to analyze the situation
What actions to take
Any constraints or guidelines to follow
Copy
Ask AI
prompt: > You are responsible for analyzing high CPU usage alerts on production EC2 instances. Follow these steps: 1. Verify that CPU usage is indeed exceeding 90% using current metrics 2. Check if the instance is experiencing unusual load by examining: - Recent log entries for errors or warnings - Unusual network traffic patterns - Scheduled jobs that might be running 3. Determine if this is an isolated incident or affecting multiple instances 4. If the issue appears to be temporary and non-critical: - Post a status update indicating it's likely transient 5. If the issue appears serious: - Post a detailed analysis with recommended next steps - Highlight any potential service impacts Always include specific metrics and timestamps in your analysis. Never make assumptions without data to support them.
The tools section explicitly grants the agent permission to use specific infrastructure tools. This follows the principle of least privilege, ensuring agents only have access to the tools they need.
Agent configuration files are stored in your Unpage profile directory:
Copy
Ask AI
~/.unpage/profiles/<profile_name>/agents/
Each agent has its own YAML file, named after the agent (e.g., cpu-alert-agent.yaml). You can have multiple agents, each specialized for different tasks.
Here’s an example of a complete agent configuration:
Copy
Ask AI
# cpu-alert-agent.yaml# Used by the router to determine which agent to use for an alertdescription: > Use this agent to analyze alerts that meet the following criteria: - The alert is related to CPU usage exceeding thresholds - The alert comes from AWS CloudWatch or Datadog - The affected resource is a compute instance (EC2, container, etc.)# Instructions for the agentprompt: > You are an agent specialized in analyzing high CPU usage alerts. When investigating a CPU alert, follow these steps: 1. Check the current CPU metrics to verify the alert is still active 2. Look at CPU metrics for the past hour to see if this is a spike or sustained usage 3. Check logs from around the time the alert started for any errors or unusual activity 4. Look for any recent deployments or changes that might explain the high usage 5. Check if similar resources are experiencing the same issue Based on your findings, update the incident with: - Current status of the issue - Likely cause based on available evidence - Recommended next steps - Whether this appears to be a critical issue requiring immediate human attention Be concise but thorough. Include specific metrics, timestamps, and log entries that support your analysis. NEVER make up information or assume values you haven't verified.# Tools the agent can usetools: - "core_current_datetime" - "core_convert_to_timezone" - "metrics_get_metrics_for_node" - "metrics_list_available_metrics_for_node" - "graph_get_resource_details" - "graph_get_neighboring_resources" - "graph_get_resource_topology" - "solarwinds_search_logs" - "pagerduty_post_status_update" - "pagerduty_get_incident_details" - "aws_describe_ec2_instance"
# Run with a JSON payload from a fileunpage agent run cpu-alert-agent @path/to/alert.json# Run with a direct JSON payloadunpage agent run cpu-alert-agent '{"alert": "CPU usage exceeds 90%"}'
Unpage agents provide a configuration-driven approach to infrastructure automation with LLMs. By defining specialized agents with clear instructions and appropriate tool access, you can create a responsive system that handles routine operations and provides valuable insight for complex incidents.To learn more about how alerts are routed to the appropriate agent, see the Router documentation.