STAROps Skill Integration
The alibabacloud-starops-chat Agent Skill lets you invoke STAROps Digital Employees for AIOps diagnostics. You can also add and manage custom Skills for Digital Employees in the STAROps console.
Use Cases
|
Scenario |
Description |
Example Prompt |
|
Service error root cause analysis |
Analyzes root causes of errors for a specified service using multi-step reasoning across traces, logs, and metrics. |
Identify the root cause of errors in the inventory service. |
|
Workspace and service queries |
Retrieves service lists, counts, language distributions, and other metadata for the current workspace. |
How many APM services are in the current workspace? |
|
APM metrics analysis |
Analyzes request volume, error rates, latency, and other APM metrics with sorting and Top-N ranking by dimension. |
Which service has the highest request volume? |
|
Service topology and classification |
Displays programming languages, upstream and downstream dependencies, and resource states in the service topology. |
Show me services grouped by programming language in the current workspace. |
|
Multi-turn investigation |
Continues follow-up questions within the same thread based on prior diagnostic conclusions to progressively narrow the scope. |
Based on the notification service timeout identified earlier, check its error logs. |
Prerequisites
-
STAROps is activated on your Alibaba Cloud account.
-
The Digital Employee has access to APM, SLS, UModel, and other data sources. Without connected data sources, diagnostics cannot produce meaningful conclusions.
-
You have Alibaba Cloud account credentials with access to the target workspace and the following RAM permissions:
API Name
Action
Resource
CreateThread
starops:CreateThreadacs:starops:<region>:<uid>:digitalemployee/<employee_name>CreateChat
starops:CreateChatacs:starops:<region>:<uid>:digitalemployee/<employee_name> -
Alibaba Cloud CLI is installed and credentials are configured (recommended). The Skill resolves credentials through the Alibaba Cloud Credentials SDK default chain, which automatically reads
~/.aliyun/config.json. No additional configuration is needed.WarningTo prevent credential leakage, do not paste your AccessKey ID or AccessKey Secret into Agent conversations. Use the Alibaba Cloud CLI configuration file to manage credentials — the Skill reads them automatically.
-
Python 3 is installed on the local machine to run the Skill's built-in diagnostic scripts.
Supported Agents
The alibabacloud-starops-chat Skill follows an open Skill specification and works with all major coding agents, including Qwen Code, Claude Code, Codex, Qoder, OpenClaw, and others.
Any custom Agent that supports the Skill specification can also use this Skill. The specification requires the Agent to have the following capabilities:
-
Parse the
SKILL.mddescription file to retrieve the Skill's metadata, instructions, and tool definitions. -
Support running Skill built-in scripts via Bash tool calls.
-
Inject environment variables and credentials as declared in
SKILL.md.
Custom Agents that meet these requirements (such as agents built on LangChain, AutoGen, or Dify) can load this Skill by placing the Skill files in a recognized skills directory.
Install the Skill
The alibabacloud-starops-chat Skill is published on Alibaba Cloud Skills and ClawHub. The following installation methods are supported.
Method 1 (Recommended): Install via npx
The npx command is bundled with Node.js. Before installing, confirm that your local environment is ready:
node -v
npx -v
If the terminal reports that node or npx is not found, download and install Node.js from the Node.js website.
Run the following command to install the alibabacloud-starops-chat Skill:
npx skills add aliyun/alibabacloud-aiops-skills --skill alibabacloud-starops-chat
After installation, confirm that the alibabacloud-starops-chat directory exists in your skills directory, then restart the Agent to activate the Skill.
Method 2: Install Manually
Download the alibabacloud-starops-chat package from the GitHub Release page. Extract the archive and copy the files to the skills directory of your Agent.
After copying, confirm that the alibabacloud-starops-chat directory exists in your skills directory, then restart the Agent to load the Skill.
The skills installation directories for common Agents are listed below.
|
Agent |
Project-level Directory |
User-level Directory |
|
Claude Code |
|
|
|
Codex |
|
|
|
Qoder |
|
|
|
QwenCode |
|
|
|
OpenClaw |
|
|
Configure Environment Variables
The Skill uses the following environment variables to locate the target Digital Employee and workspace. If your platform does not inject them automatically, set them manually before invoking the Skill:
|
Variable |
Required |
Description |
How to Obtain |
|
|
Yes |
Digital Employee name |
STAROps console > Digital Employees > Digital Employee name |
|
|
Yes |
Workspace identifier |
CMS 2.0 console > Select Workspace |
|
|
Yes |
Alibaba Cloud account UID that owns the workspace |
Alibaba Cloud console > Account Management > Account ID |
|
|
No |
Custom endpoint |
Default: |
|
|
No |
Region |
Default: |
export STAROPS_AGENT_EMPLOYEE="<Digital Employee name>"
export STAROPS_AGENT_WORKSPACE="<workspace identifier>"
export STAROPS_AGENT_UID="<Alibaba Cloud account UID>"
Configure Credentials
The Skill resolves credentials through the Alibaba Cloud Credentials SDK default chain. No Skill-specific AccessKey variables are required. We recommend configuring credentials via the Alibaba Cloud CLI — the Skill reads them automatically.
Method 1 (Recommended): Configure via Alibaba Cloud CLI
If the Alibaba Cloud CLI is not yet installed, refer to the Alibaba Cloud CLI installation guide to install it. Then run the following command to configure credentials:
aliyun configure
Follow the prompts to enter your AccessKey ID, AccessKey Secret, and default Region ID. After configuration, credentials are saved in ~/.aliyun/config.json, which the Skill reads automatically.
To verify that the CLI configuration is working, run:
aliyun sts GetCallerIdentity
If the command returns your account UID and identity information, the credentials are configured correctly.
The Alibaba Cloud CLI supports multiple credential modes via the --mode parameter:
# AK mode (default)
aliyun configure --mode AK
# STS Token mode (temporary credentials)
aliyun configure --mode StsToken
# RAM Role (ECS instance role)
aliyun configure --mode EcsRamRole
# RAM Role ARN (role assumption)
aliyun configure --mode RamRoleArn
Method 2: Configure via Environment Variables
If you prefer not to use the Alibaba Cloud CLI, you can set standard environment variables directly:
export ALIBABA_CLOUD_ACCESS_KEY_ID="<YOUR-ACCESS-KEY-ID>"
export ALIBABA_CLOUD_ACCESS_KEY_SECRET="<YOUR-ACCESS-KEY-SECRET>"
This method is suitable for CI pipelines and temporary debugging. It is not recommended for production environments.
Method 3: Other Credential Sources
The Credentials SDK default chain also supports the following sources, in priority order from highest to lowest:
-
Environment variables (
ALIBABA_CLOUD_ACCESS_KEY_ID/ALIBABA_CLOUD_ACCESS_KEY_SECRET) -
Alibaba Cloud CLI configuration file (
~/.aliyun/config.json) -
STS Token
-
RAM Role (ECS or container instance metadata)
For local development environments where instance metadata lookup is not needed, set export ALIBABA_CLOUD_ECS_METADATA_DISABLED=true to avoid unnecessary timeout delays.
Invoke the STAROps Agent
After installation and configuration, describe your diagnostic needs in the Agent to trigger the Skill. The Agent automatically executes the following process:
-
Checks that environment variables and the credential chain are ready.
-
Calls
CreateThreadto create a session thread, returning athreadIdand a link to the STAROps console for that thread. -
Calls
CreateChatto send the user's question and subscribes to the SSE streaming response. -
Streams tool invocation status (
[tool:started]/[tool:running]/[tool:done]) and diagnostic report fragments to stderr in real time. -
Outputs the final diagnostic conclusion to stdout, delimited by
=== STAROPS ANSWER BEGIN ===and=== STAROPS ANSWER END ===.
On first invocation, the Agent guides you through installing Python dependencies (pip3 install -r scripts/requirements.txt) and configuring environment variables.
Prompt Best Practices
A single diagnostic session can take several minutes and trigger multiple internal tool calls. The quality of your prompt directly affects the quality of the diagnosis. Include the following information in your prompts:
-
Target workspace and service name (or application, component, or APM service).
-
A clear diagnostic intent, for example "analyze root cause," "list potential impact scope," or "provide mitigation recommendations."
-
A time range, for example "last 30 minutes" or "2026-05-19 10:00 to 11:00 (Beijing time)."
-
Any existing clues, such as alert content, TraceID, error keywords, or SLS Project/Logstore.
-
The expected decision, for example "whether to scale up," "whether to roll back the release," or "whether to contact the upstream team."
To maintain thread context continuity, follow these two rules:
-
All follow-up questions in a single investigation must reuse the
threadIdreturned from the first call. Creating a new thread discards all previous diagnostic evidence and intermediate conclusions. -
Always run the invocation script with
--pipe. Without this flag, the output does not include the=== STAROPS ANSWER BEGIN ===delimiter, and downstream parsing becomes unreliable.
Scenario 1: Query APM Service Count
Prompt
How many APM services are in the current workspace?
Example Agent Output
The current workspace rca-benchmark contains 49 APM services.
Language distribution overview:
| Language | Count |
|---------------------------|---------------------------------------------|
| Java | Most (ad, fraud-detection, accounting, ...) |
| Go | Several (product-catalog, checkout, ...) |
| Node.js | Several (frontend, frontend-web, ...) |
| .NET | Several (cart, accounting, ...) |
| Python/Rust/PHP/Ruby/C++ | A few each |
These services cover a complete e-commerce and financial scenario, including frontend, product-catalog, cart, checkout, payment, shipping, fraud-detection, and other core microservices.
Scenario 2: View Service Distribution by Language
Prompt
Show me services grouped by programming language in the current workspace.
Example Agent Output
The current workspace rca-benchmark contains 49 APM services, grouped by programming language:
| Language | Count | Services |
|----------|-------|---------------------------------------------------------------------------|
| Java | 28 | ad, fraud-detection, inventory, loyalty, review, promotion, wishlist, ... |
| Golang | 3 | product-catalog, flagd, checkout |
| Node.js | 3 | frontend, payment, loadgenerator |
| .NET | 2 | cart, accounting |
| Python | 1 | recommendation |
| C++ | 1 | currency |
| Rust | 1 | shipping |
| PHP | 1 | quote |
| Ruby | 1 | email |
| WebJS | 1 | frontend-web |
| Unknown | 2 | image-provider, frontend-proxy |
A total of 10 programming languages are represented. Java services are the most common (28 services).
Scenario 3: Identify Service Error Root Cause
Prompt
Identify the root cause of errors in the inventory service.
Example Agent Output
## inventory Service 5xx Error Root Cause Analysis
### Conclusion
Root cause: The inventory service timed out (5 seconds) when calling the downstream notification service, causing cascading 5xx errors.
### Evidence Chain
#### 1. Trace Analysis
61 error traces were analyzed; all errors point to the same step:
| Error Span | Service | Duration | Diagnosis |
|-------------------------|--------------|------------|----------------|
| notification POST | notification | ~5000ms | error_and_slow |
| checkout orders publish | checkout | 1366-1927ms| slow |
Call path:
frontend-proxy -> frontend -> checkout -> inventory -> notification (5s timeout)
#### 2. Related Alerts
- [Health Rule] latency_avg_threshold_critical (CRITICAL): checkout avg latency 3.6s
- frontend slow response (CRITICAL): frontend POST avg response time > 1000ms
- Container memory > 85% (CRITICAL): kafka container memory 88.52%
#### 3. notification Service Resource Status
| Metric | Value | Status |
|--------------------|--------|---------|
| Pod status | Running| Normal |
| Memory usage/limit | 66.6% | Normal |
| Memory usage/request | 133.2% | Exceeds |
### Possible Causes
1. Kafka memory pressure (88.52%) may be delaying message processing in notification.
2. notification memory exceeds 133% of request value — likely triggering GC under load.
3. Connection pool exhaustion between inventory and notification.
### Mitigation
1. Check notification service logs and Kafka cluster status.
2. Scale up notification (increase resources.limits.memory).
3. Add a circuit breaker and appropriate timeout in inventory to prevent cascading failures.
Scenario 4: Multi-turn Investigation
The STAROps Skill supports multi-turn interactions. By continuing within the same thread, you can progressively narrow the investigation scope.
Turn 1 Prompt
Identify the root cause of errors in the inventory service.
Turn 2 Prompt (same thread)
Based on the notification service timeout identified earlier, check its error logs for the last 30 minutes to determine whether the issue is internal to notification or caused by its downstream Kafka.
Turn 3 Prompt (continue drilling down)
Kafka container memory usage is at 88.52%. Provide scale-up recommendations and a temporary mitigation plan.
Reusing the same threadId allows the STAROps Agent to reason from previously accumulated tool call results (metrics, traces, logs) without repeating data scans.
Data Security and Privacy
The STAROps Skill calls STAROps Digital Employees via Alibaba Cloud OpenAPI. The process follows these security principles:
-
All requests are transmitted over HTTPS with ACS3-HMAC-SHA256 signing. Diagnostic data does not pass through any third-party services.
-
Credential information (AccessKey, STS Token, RAM Role) is resolved through the Alibaba Cloud Credentials default chain and never appears in Agent conversations or script output.
-
The Skill only creates diagnostic threads and sends conversation requests. It does not directly modify cloud resources such as ECS, OSS, RDS, SLS, or RAM. Any remediation actions recommended by the STAROps Agent must go through your normal change approval process.
-
When the script returns an HTTP 401 or 403 error, the Skill stops immediately and reports the error. It does not retry with other credentials or fabricate diagnostic conclusions from prior knowledge.
Limits
|
Limit |
Description |
|
Task duration |
A single diagnostic task times out after 30 minutes by default. If no SSE events are received for an extended period, the Skill raises an idle error based on |
|
Data sources |
Diagnostic quality depends on the APM, SLS, and UModel data sources connected to the workspace. Missing or disconnected data sources cannot be analyzed. |
|
Resource management |
The Skill performs reasoning and diagnostics only. It does not directly execute resource changes on ECS, OSS, RDS, SLS, or RAM. |
|
Runtime environment |
Python 3 must be installed, and dependencies must be installed via |
FAQ
Do I need to create a thread in the console before using the Skill?
No. The Skill automatically creates a session thread via CreateThread on first invocation and prints the STAROPS_URL. You can use this URL to navigate directly to the STAROps console and view all messages and tool call records for that thread.
How do I configure Alibaba Cloud account credentials?
The recommended approach is to run aliyun configure to configure credentials via the Alibaba Cloud CLI. The Skill automatically reads ~/.aliyun/config.json. For detailed steps, see "Configure Credentials" above.
If you use other credential sources, the priority order from highest to lowest is:
-
Environment variables: Set
ALIBABA_CLOUD_ACCESS_KEY_IDandALIBABA_CLOUD_ACCESS_KEY_SECRET. Suitable for CI pipelines and temporary debugging. -
Alibaba Cloud CLI configuration file (
~/.aliyun/config.json): Recommended. Runaliyun configureto set it up. -
STS Token: Temporary credentials injected by the platform. Suitable for CI pipelines and sandbox environments.
-
RAM Role: ECS or container workloads obtain credentials from instance metadata.
Can I use a custom endpoint?
Yes. Set the environment variable STAROPS_AGENT_ENDPOINT=<domain> to specify a dedicated or private network endpoint.
What should I do if the diagnostic results are not specific enough?
Consider the following improvements:
-
Include key evidence in your prompt: service name, time range, TraceID, alert content, and error keywords.
-
Verify that the workspace has APM, SLS, and UModel data sources connected. Missing data sources prevent the STAROps Agent from gathering evidence.
-
Use
--threadto continue multi-turn follow-up questions, allowing the STAROps Agent to drill down from existing conclusions instead of starting a new session. -
If STAROps returns
(No assistant answer was returned.)or only a generic response, retry once using the same thread. If the issue persists, inform the user that STAROps did not return valid diagnostic data. Do not fabricate conclusions from prior knowledge.
Troubleshooting
HTTP 401 Unauthorized
The credential chain did not resolve an identity with STAROps permissions.
Resolution:
-
Verify that the Credentials default chain can resolve at least one of: STS Token, RAM Role, CLI profile, or instance metadata.
-
Verify that the resolved identity's RAM policy includes
starops:CreateThreadandstarops:CreateChat. -
If using an STS Token, confirm it has not expired and the assumed role includes the required permissions.
-
When authentication fails, the Skill terminates immediately without retrying. Grant the required permissions before reissuing the request.
HTTP 404 Not Found
The Digital Employee name, workspace, or UID does not match the actual resources.
Resolution: Verify that STAROPS_AGENT_EMPLOYEE, STAROPS_AGENT_WORKSPACE, and STAROPS_AGENT_UID all correspond to the same set of actual resources. The UID must be the primary account UID that owns the workspace.
ConfigError: Missing required STAROps environment variables
One or more of STAROPS_AGENT_EMPLOYEE, STAROPS_AGENT_WORKSPACE, or STAROPS_AGENT_UID is not set or is empty.
Resolution: Run the pre-flight check script in SKILL.md to confirm all variables are set, then retry.
CredentialError
The Alibaba Cloud Credentials SDK did not find any available credential source.
Resolution: Run aliyun configure to configure credentials via the Alibaba Cloud CLI — the Skill automatically reads ~/.aliyun/config.json. You can also provide credentials via environment variables, STS Token, or RAM Role. For local development where instance metadata is not needed, set export ALIBABA_CLOUD_ECS_METADATA_DISABLED=true to reduce timeout delays.
Idle Timeout Error
No SSE events were received within the --idle-timeout window, indicating the STAROps Agent may be stuck.
Resolution: Retry once using the same --thread. For complex tasks expected to be silent for extended periods, increase --idle-timeout accordingly.
ModuleNotFoundError
Python dependencies are not installed.
Resolution: Run pip3 install -r scripts/requirements.txt in the Skill root directory. The dependency file is in the scripts/ subdirectory, not the project root.
What Is a Digital Employee Skill?
In addition to invoking STAROps Digital Employees through the alibabacloud-starops-chat Skill in an AI Agent, you can add and manage custom Skills for Digital Employees directly in the STAROps console.
A Digital Employee Skill is a reusable instruction module that encapsulates domain knowledge and workflows into a standardized capability unit. The Digital Employee follows the predefined process in a Skill to perform specific tasks.
|
Feature |
Description |
|
Progressive loading |
The Digital Employee loads a Skill's full content only when needed, conserving context space. |
|
Knowledge reuse |
Encapsulates domain expertise into reusable modules to ensure consistent execution. |
|
On-demand triggering |
Automatically matches and activates relevant Skills based on the conversation content. |
|
Easy to maintain |
Based on Markdown file format — Skills can be created and modified without programming. |
Key Concepts
|
Term |
Description |
|
Skill |
A folder containing instructions, scripts, and resources that the Digital Employee can dynamically load to perform a specific task. |
|
SKILL.md |
The Skill definition file — a Markdown file containing metadata and execution instructions that serves as the core component of every Skill. |
|
Frontmatter |
YAML-format configuration at the top of a SKILL.md file that defines the Skill's name, description, and other basic information. |
|
Trigger condition |
The method by which a Skill is activated. Three trigger types are supported: automatic matching based on conversation content, explicit mention of the Skill name in a conversation (for example, "Run a cluster health inspection using the k8s-cluster-health-inspection Skill"), and selection via the |
Skill Loading Mechanism
Skill loading proceeds in three phases:
-
Discovery phase: When the Digital Employee starts, it loads only the name and description of each Skill — not the full content. This allows the Digital Employee to manage a large number of Skills simultaneously without consuming excessive context space.
-
Activation phase: When a user's task matches a Skill's description, the system loads the complete SKILL.md instruction content for that Skill in preparation for execution.
-
Execution phase: The Digital Employee follows the instructions step by step, loading reference files and script resources as needed. The Skill context is released after execution completes.
Add a Skill
Prerequisites
-
At least one Digital Employee has been created.
-
Your account has been granted the
cms:CreateDigitalEmployeeSkillpermission.
Procedure
-
Log on to the STAROps console.
-
In the left navigation pane, click Digital Employees.
-
In the Digital Employees list, click the target Digital Employee to open its details page.
-
Click the Skill Management tab, then click Add Skill.
-
Configure the Skill parameters. Choose one of the following methods:
-
Direct entry: Fill in the following parameters in the console:
Parameter
Description
Skill name
The unique identifier for the Skill, used for internal system references and triggering. Use lowercase letters and hyphens, for example
k8s-cluster-health-inspection.Display name
The name shown for the Skill in the console, for example "K8s Cluster Health Inspection."
Description
A functional description that helps the Digital Employee determine when to activate this Skill. A more precise description improves automatic matching accuracy.
Skill design
The execution instructions for the Skill — that is, the content of SKILL.md. Defines the specific steps and constraints the Digital Employee follows after loading this Skill.
-
Upload: Upload a local Skill folder. The folder must contain a SKILL.md file along with any required scripts and reference resources. After uploading, the system automatically parses the folder contents and switches to the direct entry view. Confirm the parameters and click Add Now to complete creation.
-
-
Click Add Now.
Manage Existing Skills
On the Skill Management tab, you can perform the following operations on added Skills:
|
Operation |
Description |
|
Edit |
Modify the Skill's name, description, or execution instructions. |
|
Delete |
Remove Skills that are no longer needed. |
Deleted Skills cannot be recovered. Before deleting, confirm that the Skill is not referenced by any active conversations of a Digital Employee.