AIOps Agent is an Alibaba Cloud intelligent operations and maintenance (O&M) platform built on large language models and agent technology. It provides enterprises with intelligent observability and O&M capabilities, such as runtime observability, data insights, intelligent diagnosis, and automated self-healing, to safeguard your applications and services in real time.
Get started
-
Quick starts:
-
Intelligent Session quick start: Query data using natural language to explore AIOps Agent's Intelligent Session capabilities.
-
Mission quick start: Walk through creating a Mission and viewing an inspection report, using the “Daily scheduled inspection for a Kubernetes cluster” example.
-
-
Configure permissions: Configure permissions for RAM users and RAM roles.
For more information, see What is AIOps Agent?.
Core advantages
|
Advantage |
Description |
|
Unified data platform |
Built on the unified data foundation of Alibaba Cloud's observable platform, it provides unified storage for logs, topologies, metrics, and traces. The platform supports petabyte-scale daily ingestion, exabyte-scale storage, and analysis of hundreds of billions of data points within seconds. A multi-availability zone deployment ensures 99.99% reliability. |
|
O&M digital twin |
Creates a digital twin of your system's runtime state based on UModel, enabling unified modeling of applications, services, resources, topologies, alerts, and change relationships. Supports custom extensions, real-time topology inference, and causal analysis. |
|
Data analysis operators |
Provides general-purpose data analysis and observability AI operators for metric anomaly detection, log clustering, trace analysis, performance profiling, and change history tracking. These operators improve root cause analysis (RCA) efficiency and reduce model inference costs. |
|
Flexible integration options |
Supports OpenAPI, page embedding, and IM integration with platforms such as DingTalk and Lark for flexible integration into your existing workflows. |
Security and compliance
-
Fine-grained authorization policies: Layered authorization through operator and Digital Employee RAM roles separates permissions into “what a person can do” and “what an agent can access”, enabling least-privilege access and significantly reducing the risk of unauthorized operations.
-
Human-in-the-loop intervention: Integrates with your tools via MCP and lets you configure a human-in-the-loop (HIL) process that converts high-risk write operations and dangerous commands into secure workflows requiring manual confirmation. As a final safeguard, the interception engine blocks abnormal executions to prevent misoperations and malicious actions.
-
Agent behavior auditing: Retains a complete record of conversation history, runtime artifacts, tool calls, CLI commands, and data access throughout the agent lifecycle, providing traceable and reviewable audit evidence for compliance and security reviews.
-
End-to-end data encryption: Uses HTTPS/TLS for encrypted data transmission. Observability data at rest can be encrypted with KMS, and agent runtime artifacts are also encrypted, ensuring data privacy and integrity across the entire data path.
Use cases
-
Scheduled intelligent inspections for Kubernetes clusters: Automatically inspect daily cluster health, generate structured reports, and compare them with historical data.
-
High availability assurance for core services: Continuously monitor core services and automatically perform root cause analysis (RCA) when an alert is triggered.
-
Natural language-driven fault diagnosis: Narrow down the scope of investigation through multi-turn conversations and perform correlation analysis using the UModel topology.
-
Periodic data quality checks: Regularly check the health of data pipelines and automatically send notifications upon detecting anomalies.
-
Automated O&M report generation: Automatically aggregates O&M data and generates structured reports on a weekly or monthly basis.