STAROps Skill Integration

更新时间:
复制 MD 格式

The alibabacloud-starops-chat Agent Skill lets you invoke STAROps Digital Employees for AIOps diagnostics. You can also add and manage custom Skills for Digital Employees in the STAROps console.

Use Cases

Scenario

Description

Example Prompt

Service error root cause analysis

Analyzes root causes of errors for a specified service using multi-step reasoning across traces, logs, and metrics.

Identify the root cause of errors in the inventory service.

Workspace and service queries

Retrieves service lists, counts, language distributions, and other metadata for the current workspace.

How many APM services are in the current workspace?

APM metrics analysis

Analyzes request volume, error rates, latency, and other APM metrics with sorting and Top-N ranking by dimension.

Which service has the highest request volume?

Service topology and classification

Displays programming languages, upstream and downstream dependencies, and resource states in the service topology.

Show me services grouped by programming language in the current workspace.

Multi-turn investigation

Continues follow-up questions within the same thread based on prior diagnostic conclusions to progressively narrow the scope.

Based on the notification service timeout identified earlier, check its error logs.

Prerequisites

  • STAROps is activated on your Alibaba Cloud account.

  • The Digital Employee has access to APM, SLS, UModel, and other data sources. Without connected data sources, diagnostics cannot produce meaningful conclusions.

  • You have Alibaba Cloud account credentials with access to the target workspace and the following RAM permissions:

    API Name

    Action

    Resource

    CreateThread

    starops:CreateThread

    acs:starops:<region>:<uid>:digitalemployee/<employee_name>

    CreateChat

    starops:CreateChat

    acs:starops:<region>:<uid>:digitalemployee/<employee_name>

  • Alibaba Cloud CLI is installed and credentials are configured (recommended). The Skill resolves credentials through the Alibaba Cloud Credentials SDK default chain, which automatically reads ~/.aliyun/config.json. No additional configuration is needed.

    Warning

    To prevent credential leakage, do not paste your AccessKey ID or AccessKey Secret into Agent conversations. Use the Alibaba Cloud CLI configuration file to manage credentials — the Skill reads them automatically.

  • Python 3 is installed on the local machine to run the Skill's built-in diagnostic scripts.

Supported Agents

The alibabacloud-starops-chat Skill follows an open Skill specification and works with all major coding agents, including Qwen Code, Claude Code, Codex, Qoder, OpenClaw, and others.

Any custom Agent that supports the Skill specification can also use this Skill. The specification requires the Agent to have the following capabilities:

  1. Parse the SKILL.md description file to retrieve the Skill's metadata, instructions, and tool definitions.

  2. Support running Skill built-in scripts via Bash tool calls.

  3. Inject environment variables and credentials as declared in SKILL.md.

Custom Agents that meet these requirements (such as agents built on LangChain, AutoGen, or Dify) can load this Skill by placing the Skill files in a recognized skills directory.

Install the Skill

The alibabacloud-starops-chat Skill is published on Alibaba Cloud Skills and ClawHub. The following installation methods are supported.

Method 1 (Recommended): Install via npx

The npx command is bundled with Node.js. Before installing, confirm that your local environment is ready:

node -v
npx -v

If the terminal reports that node or npx is not found, download and install Node.js from the Node.js website.

Run the following command to install the alibabacloud-starops-chat Skill:

npx skills add aliyun/alibabacloud-aiops-skills --skill alibabacloud-starops-chat

After installation, confirm that the alibabacloud-starops-chat directory exists in your skills directory, then restart the Agent to activate the Skill.

Method 2: Install Manually

Download the alibabacloud-starops-chat package from the GitHub Release page. Extract the archive and copy the files to the skills directory of your Agent.

After copying, confirm that the alibabacloud-starops-chat directory exists in your skills directory, then restart the Agent to load the Skill.

The skills installation directories for common Agents are listed below.

Agent

Project-level Directory

User-level Directory

Claude Code

.claude/skills

~/.claude/skills

Codex

.agents/skills

~/.agents/skills

Qoder

.qoder/skills

~/.qoder/skills

QwenCode

.qwen/skills

~/.qwen/skills

OpenClaw

.openclaw/skills

~/.openclaw/skills

Configure Environment Variables

The Skill uses the following environment variables to locate the target Digital Employee and workspace. If your platform does not inject them automatically, set them manually before invoking the Skill:

Variable

Required

Description

How to Obtain

STAROPS_AGENT_EMPLOYEE

Yes

Digital Employee name

STAROps console > Digital Employees > Digital Employee name

STAROPS_AGENT_WORKSPACE

Yes

Workspace identifier

CMS 2.0 console > Select Workspace

STAROPS_AGENT_UID

Yes

Alibaba Cloud account UID that owns the workspace

Alibaba Cloud console > Account Management > Account ID

STAROPS_AGENT_ENDPOINT

No

Custom endpoint

Default: starops.cn-beijing.aliyuncs.com

STAROPS_AGENT_REGION

No

Region

Default: cn-beijing

export STAROPS_AGENT_EMPLOYEE="<Digital Employee name>"
export STAROPS_AGENT_WORKSPACE="<workspace identifier>"
export STAROPS_AGENT_UID="<Alibaba Cloud account UID>"

Configure Credentials

The Skill resolves credentials through the Alibaba Cloud Credentials SDK default chain. No Skill-specific AccessKey variables are required. We recommend configuring credentials via the Alibaba Cloud CLI — the Skill reads them automatically.

Method 1 (Recommended): Configure via Alibaba Cloud CLI

If the Alibaba Cloud CLI is not yet installed, refer to the Alibaba Cloud CLI installation guide to install it. Then run the following command to configure credentials:

aliyun configure

Follow the prompts to enter your AccessKey ID, AccessKey Secret, and default Region ID. After configuration, credentials are saved in ~/.aliyun/config.json, which the Skill reads automatically.

To verify that the CLI configuration is working, run:

aliyun sts GetCallerIdentity

If the command returns your account UID and identity information, the credentials are configured correctly.

The Alibaba Cloud CLI supports multiple credential modes via the --mode parameter:

# AK mode (default)
aliyun configure --mode AK

# STS Token mode (temporary credentials)
aliyun configure --mode StsToken

# RAM Role (ECS instance role)
aliyun configure --mode EcsRamRole

# RAM Role ARN (role assumption)
aliyun configure --mode RamRoleArn

Method 2: Configure via Environment Variables

If you prefer not to use the Alibaba Cloud CLI, you can set standard environment variables directly:

export ALIBABA_CLOUD_ACCESS_KEY_ID="<YOUR-ACCESS-KEY-ID>"
export ALIBABA_CLOUD_ACCESS_KEY_SECRET="<YOUR-ACCESS-KEY-SECRET>"

This method is suitable for CI pipelines and temporary debugging. It is not recommended for production environments.

Method 3: Other Credential Sources

The Credentials SDK default chain also supports the following sources, in priority order from highest to lowest:

  1. Environment variables (ALIBABA_CLOUD_ACCESS_KEY_ID / ALIBABA_CLOUD_ACCESS_KEY_SECRET)

  2. Alibaba Cloud CLI configuration file (~/.aliyun/config.json)

  3. STS Token

  4. RAM Role (ECS or container instance metadata)

For local development environments where instance metadata lookup is not needed, set export ALIBABA_CLOUD_ECS_METADATA_DISABLED=true to avoid unnecessary timeout delays.

Invoke the STAROps Agent

After installation and configuration, describe your diagnostic needs in the Agent to trigger the Skill. The Agent automatically executes the following process:

  1. Checks that environment variables and the credential chain are ready.

  2. Calls CreateThread to create a session thread, returning a threadId and a link to the STAROps console for that thread.

  3. Calls CreateChat to send the user's question and subscribes to the SSE streaming response.

  4. Streams tool invocation status ([tool:started] / [tool:running] / [tool:done]) and diagnostic report fragments to stderr in real time.

  5. Outputs the final diagnostic conclusion to stdout, delimited by === STAROPS ANSWER BEGIN === and === STAROPS ANSWER END ===.

On first invocation, the Agent guides you through installing Python dependencies (pip3 install -r scripts/requirements.txt) and configuring environment variables.

Prompt Best Practices

A single diagnostic session can take several minutes and trigger multiple internal tool calls. The quality of your prompt directly affects the quality of the diagnosis. Include the following information in your prompts:

  • Target workspace and service name (or application, component, or APM service).

  • A clear diagnostic intent, for example "analyze root cause," "list potential impact scope," or "provide mitigation recommendations."

  • A time range, for example "last 30 minutes" or "2026-05-19 10:00 to 11:00 (Beijing time)."

  • Any existing clues, such as alert content, TraceID, error keywords, or SLS Project/Logstore.

  • The expected decision, for example "whether to scale up," "whether to roll back the release," or "whether to contact the upstream team."

To maintain thread context continuity, follow these two rules:

  • All follow-up questions in a single investigation must reuse the threadId returned from the first call. Creating a new thread discards all previous diagnostic evidence and intermediate conclusions.

  • Always run the invocation script with --pipe. Without this flag, the output does not include the === STAROPS ANSWER BEGIN === delimiter, and downstream parsing becomes unreliable.

Scenario 1: Query APM Service Count

Prompt

How many APM services are in the current workspace?

Example Agent Output

The current workspace rca-benchmark contains 49 APM services.

Language distribution overview:

| Language                   | Count                                       |
|---------------------------|---------------------------------------------|
| Java                      | Most (ad, fraud-detection, accounting, ...) |
| Go                        | Several (product-catalog, checkout, ...)    |
| Node.js                   | Several (frontend, frontend-web, ...)       |
| .NET                      | Several (cart, accounting, ...)             |
| Python/Rust/PHP/Ruby/C++  | A few each                                  |

These services cover a complete e-commerce and financial scenario, including frontend, product-catalog, cart, checkout, payment, shipping, fraud-detection, and other core microservices.

Scenario 2: View Service Distribution by Language

Prompt

Show me services grouped by programming language in the current workspace.

Example Agent Output

The current workspace rca-benchmark contains 49 APM services, grouped by programming language:

| Language  | Count | Services                                                                  |
|----------|-------|---------------------------------------------------------------------------|
| Java     | 28    | ad, fraud-detection, inventory, loyalty, review, promotion, wishlist, ... |
| Golang   | 3     | product-catalog, flagd, checkout                                          |
| Node.js  | 3     | frontend, payment, loadgenerator                                          |
| .NET     | 2     | cart, accounting                                                          |
| Python   | 1     | recommendation                                                            |
| C++      | 1     | currency                                                                  |
| Rust     | 1     | shipping                                                                  |
| PHP      | 1     | quote                                                                     |
| Ruby     | 1     | email                                                                     |
| WebJS    | 1     | frontend-web                                                              |
| Unknown  | 2     | image-provider, frontend-proxy                                            |

A total of 10 programming languages are represented. Java services are the most common (28 services).

Scenario 3: Identify Service Error Root Cause

Prompt

Identify the root cause of errors in the inventory service.

Example Agent Output

## inventory Service 5xx Error Root Cause Analysis

### Conclusion

Root cause: The inventory service timed out (5 seconds) when calling the downstream notification service, causing cascading 5xx errors.

### Evidence Chain

#### 1. Trace Analysis

61 error traces were analyzed; all errors point to the same step:

| Error Span               | Service      | Duration   | Diagnosis      |
|-------------------------|--------------|------------|----------------|
| notification POST        | notification | ~5000ms    | error_and_slow |
| checkout orders publish  | checkout     | 1366-1927ms| slow           |

Call path:
frontend-proxy -> frontend -> checkout -> inventory -> notification (5s timeout)

#### 2. Related Alerts

- [Health Rule] latency_avg_threshold_critical (CRITICAL): checkout avg latency 3.6s
- frontend slow response (CRITICAL): frontend POST avg response time > 1000ms
- Container memory > 85% (CRITICAL): kafka container memory 88.52%

#### 3. notification Service Resource Status

| Metric               | Value  | Status  |
|--------------------|--------|---------|
| Pod status           | Running| Normal  |
| Memory usage/limit   | 66.6%  | Normal  |
| Memory usage/request | 133.2% | Exceeds |

### Possible Causes

1. Kafka memory pressure (88.52%) may be delaying message processing in notification.
2. notification memory exceeds 133% of request value — likely triggering GC under load.
3. Connection pool exhaustion between inventory and notification.

### Mitigation

1. Check notification service logs and Kafka cluster status.
2. Scale up notification (increase resources.limits.memory).
3. Add a circuit breaker and appropriate timeout in inventory to prevent cascading failures.

Scenario 4: Multi-turn Investigation

The STAROps Skill supports multi-turn interactions. By continuing within the same thread, you can progressively narrow the investigation scope.

Turn 1 Prompt

Identify the root cause of errors in the inventory service.

Turn 2 Prompt (same thread)

Based on the notification service timeout identified earlier, check its error logs for the last 30 minutes to determine whether the issue is internal to notification or caused by its downstream Kafka.

Turn 3 Prompt (continue drilling down)

Kafka container memory usage is at 88.52%. Provide scale-up recommendations and a temporary mitigation plan.

Reusing the same threadId allows the STAROps Agent to reason from previously accumulated tool call results (metrics, traces, logs) without repeating data scans.

Data Security and Privacy

The STAROps Skill calls STAROps Digital Employees via Alibaba Cloud OpenAPI. The process follows these security principles:

  • All requests are transmitted over HTTPS with ACS3-HMAC-SHA256 signing. Diagnostic data does not pass through any third-party services.

  • Credential information (AccessKey, STS Token, RAM Role) is resolved through the Alibaba Cloud Credentials default chain and never appears in Agent conversations or script output.

  • The Skill only creates diagnostic threads and sends conversation requests. It does not directly modify cloud resources such as ECS, OSS, RDS, SLS, or RAM. Any remediation actions recommended by the STAROps Agent must go through your normal change approval process.

  • When the script returns an HTTP 401 or 403 error, the Skill stops immediately and reports the error. It does not retry with other credentials or fabricate diagnostic conclusions from prior knowledge.

Limits

Limit

Description

Task duration

A single diagnostic task times out after 30 minutes by default. If no SSE events are received for an extended period, the Skill raises an idle error based on --idle-timeout (default: 60 seconds).

Data sources

Diagnostic quality depends on the APM, SLS, and UModel data sources connected to the workspace. Missing or disconnected data sources cannot be analyzed.

Resource management

The Skill performs reasoning and diagnostics only. It does not directly execute resource changes on ECS, OSS, RDS, SLS, or RAM.

Runtime environment

Python 3 must be installed, and dependencies must be installed via pip3 install -r scripts/requirements.txt.

FAQ

Do I need to create a thread in the console before using the Skill?

No. The Skill automatically creates a session thread via CreateThread on first invocation and prints the STAROPS_URL. You can use this URL to navigate directly to the STAROps console and view all messages and tool call records for that thread.

How do I configure Alibaba Cloud account credentials?

The recommended approach is to run aliyun configure to configure credentials via the Alibaba Cloud CLI. The Skill automatically reads ~/.aliyun/config.json. For detailed steps, see "Configure Credentials" above.

If you use other credential sources, the priority order from highest to lowest is:

  1. Environment variables: Set ALIBABA_CLOUD_ACCESS_KEY_ID and ALIBABA_CLOUD_ACCESS_KEY_SECRET. Suitable for CI pipelines and temporary debugging.

  2. Alibaba Cloud CLI configuration file (~/.aliyun/config.json): Recommended. Run aliyun configure to set it up.

  3. STS Token: Temporary credentials injected by the platform. Suitable for CI pipelines and sandbox environments.

  4. RAM Role: ECS or container workloads obtain credentials from instance metadata.

Can I use a custom endpoint?

Yes. Set the environment variable STAROPS_AGENT_ENDPOINT=<domain> to specify a dedicated or private network endpoint.

What should I do if the diagnostic results are not specific enough?

Consider the following improvements:

  • Include key evidence in your prompt: service name, time range, TraceID, alert content, and error keywords.

  • Verify that the workspace has APM, SLS, and UModel data sources connected. Missing data sources prevent the STAROps Agent from gathering evidence.

  • Use --thread to continue multi-turn follow-up questions, allowing the STAROps Agent to drill down from existing conclusions instead of starting a new session.

  • If STAROps returns (No assistant answer was returned.) or only a generic response, retry once using the same thread. If the issue persists, inform the user that STAROps did not return valid diagnostic data. Do not fabricate conclusions from prior knowledge.

Troubleshooting

HTTP 401 Unauthorized

The credential chain did not resolve an identity with STAROps permissions.

Resolution:

  • Verify that the Credentials default chain can resolve at least one of: STS Token, RAM Role, CLI profile, or instance metadata.

  • Verify that the resolved identity's RAM policy includes starops:CreateThread and starops:CreateChat.

  • If using an STS Token, confirm it has not expired and the assumed role includes the required permissions.

  • When authentication fails, the Skill terminates immediately without retrying. Grant the required permissions before reissuing the request.

HTTP 404 Not Found

The Digital Employee name, workspace, or UID does not match the actual resources.

Resolution: Verify that STAROPS_AGENT_EMPLOYEE, STAROPS_AGENT_WORKSPACE, and STAROPS_AGENT_UID all correspond to the same set of actual resources. The UID must be the primary account UID that owns the workspace.

ConfigError: Missing required STAROps environment variables

One or more of STAROPS_AGENT_EMPLOYEE, STAROPS_AGENT_WORKSPACE, or STAROPS_AGENT_UID is not set or is empty.

Resolution: Run the pre-flight check script in SKILL.md to confirm all variables are set, then retry.

CredentialError

The Alibaba Cloud Credentials SDK did not find any available credential source.

Resolution: Run aliyun configure to configure credentials via the Alibaba Cloud CLI — the Skill automatically reads ~/.aliyun/config.json. You can also provide credentials via environment variables, STS Token, or RAM Role. For local development where instance metadata is not needed, set export ALIBABA_CLOUD_ECS_METADATA_DISABLED=true to reduce timeout delays.

Idle Timeout Error

No SSE events were received within the --idle-timeout window, indicating the STAROps Agent may be stuck.

Resolution: Retry once using the same --thread. For complex tasks expected to be silent for extended periods, increase --idle-timeout accordingly.

ModuleNotFoundError

Python dependencies are not installed.

Resolution: Run pip3 install -r scripts/requirements.txt in the Skill root directory. The dependency file is in the scripts/ subdirectory, not the project root.

What Is a Digital Employee Skill?

In addition to invoking STAROps Digital Employees through the alibabacloud-starops-chat Skill in an AI Agent, you can add and manage custom Skills for Digital Employees directly in the STAROps console.

A Digital Employee Skill is a reusable instruction module that encapsulates domain knowledge and workflows into a standardized capability unit. The Digital Employee follows the predefined process in a Skill to perform specific tasks.

Feature

Description

Progressive loading

The Digital Employee loads a Skill's full content only when needed, conserving context space.

Knowledge reuse

Encapsulates domain expertise into reusable modules to ensure consistent execution.

On-demand triggering

Automatically matches and activates relevant Skills based on the conversation content.

Easy to maintain

Based on Markdown file format — Skills can be created and modified without programming.

Key Concepts

Term

Description

Skill

A folder containing instructions, scripts, and resources that the Digital Employee can dynamically load to perform a specific task.

SKILL.md

The Skill definition file — a Markdown file containing metadata and execution instructions that serves as the core component of every Skill.

Frontmatter

YAML-format configuration at the top of a SKILL.md file that defines the Skill's name, description, and other basic information.

Trigger condition

The method by which a Skill is activated. Three trigger types are supported: automatic matching based on conversation content, explicit mention of the Skill name in a conversation (for example, "Run a cluster health inspection using the k8s-cluster-health-inspection Skill"), and selection via the /skill command.

Skill Loading Mechanism

Skill loading proceeds in three phases:

  1. Discovery phase: When the Digital Employee starts, it loads only the name and description of each Skill — not the full content. This allows the Digital Employee to manage a large number of Skills simultaneously without consuming excessive context space.

  2. Activation phase: When a user's task matches a Skill's description, the system loads the complete SKILL.md instruction content for that Skill in preparation for execution.

  3. Execution phase: The Digital Employee follows the instructions step by step, loading reference files and script resources as needed. The Skill context is released after execution completes.

Add a Skill

Prerequisites

  • At least one Digital Employee has been created.

  • Your account has been granted the cms:CreateDigitalEmployeeSkill permission.

Procedure

  1. Log on to the STAROps console.

  2. In the left navigation pane, click Digital Employees.

  3. In the Digital Employees list, click the target Digital Employee to open its details page.

  4. Click the Skill Management tab, then click Add Skill.

  5. Configure the Skill parameters. Choose one of the following methods:

    • Direct entry: Fill in the following parameters in the console:

      Parameter

      Description

      Skill name

      The unique identifier for the Skill, used for internal system references and triggering. Use lowercase letters and hyphens, for example k8s-cluster-health-inspection.

      Display name

      The name shown for the Skill in the console, for example "K8s Cluster Health Inspection."

      Description

      A functional description that helps the Digital Employee determine when to activate this Skill. A more precise description improves automatic matching accuracy.

      Skill design

      The execution instructions for the Skill — that is, the content of SKILL.md. Defines the specific steps and constraints the Digital Employee follows after loading this Skill.

    • Upload: Upload a local Skill folder. The folder must contain a SKILL.md file along with any required scripts and reference resources. After uploading, the system automatically parses the folder contents and switches to the direct entry view. Confirm the parameters and click Add Now to complete creation.

  6. Click Add Now.

Manage Existing Skills

On the Skill Management tab, you can perform the following operations on added Skills:

Operation

Description

Edit

Modify the Skill's name, description, or execution instructions.

Delete

Remove Skills that are no longer needed.

Warning

Deleted Skills cannot be recovered. Before deleting, confirm that the Skill is not referenced by any active conversations of a Digital Employee.