Enable LLM tracing in PAI EAS with ARMS integration-Platform For AI(PAI)-阿里云帮助中心

Overview

EAS integrates with ARMS to add observability to LLM applications, enabling you to trace call chains, identify performance bottlenecks, and surface errors.

Tracing provides:

Call chain visualization: Maps the full request path with clear logs.
Performance monitoring: Tracks response time, token consumption, and error counts to identify bottlenecks.
Issue identification: Pinpoints issues via Trace IDs with contextual root-cause analysis.
Evaluation: Validates accuracy and reliability of LLM outputs using trace data.

Key concepts

Trace

The complete execution path of a request in a distributed system, composed of multiple spans. Each trace has a unique TraceID for tracking request flows and pinpointing bottlenecks or errors.
Span

A basic unit in a trace representing a specific operation. Records operation name, start time, and end time.
Python probe

Automatically collects call chain data and performance metrics from Python applications. Install a Python probe when deploying an EAS service to enable tracing.
Evaluation

Assesses LLM application answers across multiple dimensions. Contact your business manager to confirm specific evaluation dimension names.

Limitations

This feature supports only Python-based LLM applications developed with LangChain, LlamaIndex, or DashScope.

Prerequisites

Activate ARMS Application Monitoring. For more information, see Activate ARMS.
Activate LangStudio by following Authorize the PAI service account.
If using a Resource Access Management (RAM) user or RAM role, grant the AliyunPAILLMTraceFullAccess permission to the user or role. For more information, see Manage permissions for a RAM role and Manage RAM user permissions.

Step 1: Prepare sample code

This walkthrough uses a simple prediction service to demonstrate deployment, invocation, and trace viewing.

This example uses the DashScope API with Flask to build a web service that calls Generation.call for text generation. Activate DashScope and obtain an API key as described in Call the Tongyi Qianwen API for the first time. Set DASHSCOPE_API_KEY as an environment variable when deploying the service. Example app.py:

import os
import json
import flask
import dashscope

app = flask.Flask(__name__)

def run_query(query):
    """Run a query."""

    response = dashscope.Generation.call(
        api_key=os.getenv('DASHSCOPE_API_KEY'),
        model="qwen-plus",
        messages=[
            {'role': 'system', 'content': 'You are a helpful assistant.'},
            {'role': 'user', 'content': query}
        ],
        result_format='message'
    )
    return response


@app.route('/api/trace_demo', methods=['POST'])
def query():
    """
    Post data example:
    {
        "query": "capital of china"
    }
    """
    data = flask.request.get_data("utf-8")
    query = json.loads(data).get('query', '')
    response = run_query(query)
    return response.output.choices[0].message.content


if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8000)

Step 2: Enable tracing

When deploying an EAS service, turn on Tracing in Service Features. If your image lacks a built-in tracing component, configure the probe installation and startup commands. Manually install a Python probe.

Image with built-in tracing: Turn on Tracing. No additional configuration required.

Image without built-in tracing: Turn on Tracing and configure third-party libraries and the startup command.

Parameter	Description
Command to Run	Add `aliyun-bootstrap -a install && aliyun-instrument python app.py`. This installs and starts the ARMS Python probe. app.py is the main prediction service file. Also add aliyun-bootstrap to third-party libraries for the probe installer.
Third-party Library Settings	Add `aliyun-bootstrap` to download the probe installer from PyPI.

This example uses an image without built-in tracing. Key deployment parameters are listed below. Deploy a service in the console. After deployment:

View the service deployment status on the EAS page.
View the registered application on the Application List page in the ARMS console. The application name is the same as the EAS service name.

Parameter		Description
Environment Information	Deployment Method	Select Image-based Deployment.
	Image Configuration	This topic uses the default image: Alibaba Cloud Image. You can also enter a prepared custom image on the Image Address tab.
	Directly Mount	Mount the sample code to the service instance. For this example, click OSS and configure: Uri: Select the OSS folder containing the sample code. Upload instructions: Quick Start in the console. Mount Path: Set this to `/mnt/data/`. The code file will be read from this path. Skip this step if your custom image already includes the prediction service file.
	Command	This topic sets the command to `aliyun-bootstrap -a install && aliyun-instrument python /mnt/data/app.py`. `/mnt/data/app.py` is the mounted sample code file.
	Environment Variables	The sample code calls the DashScope API. Click Add and add this environment variable: Key: `DASHSCOPE_API_KEY` Value: Your DashScope API key. Obtain one from Call the Tongyi Qianwen API for the first time.
	Third-party Library Settings	Set the third-party libraries to `aliyun-bootstrap flask dashscope`.
Service Access	VPC	To use the tracing feature, you must configure a virtual private cloud (VPC). Select a VPC, vSwitch, and Security group in the region. EAS services cannot access the Internet by default. The sample code requires Internet access for DashScope API calls. Configure access following Scenario 1: Allow an EAS service to access the Internet.
	vSwitch
	Security Group Name
Features	Tracing Analysis	Turn on the Tracing switch and configure the third-party libraries and startup command in the Environment Context section.

Step 3: View traces

An evaluation tool based on traces helps developers verify the accuracy and reliability of LLM application outputs:

Call the EAS service

This example uses online debugging. You can also invoke the service via API calls.

On the Elastic Algorithm Service (EAS) page, find the target service and click > Online Debugging in the Actions column.
On the Body tab, send request data to the specified address based on your defined prediction service.

This topic uses the service interface defined in the sample app.py file. The following figure shows an example result:

View trace information

Note

By default, trace data is stored for 30 days from the time it is generated. To extend the storage period, contact the ARMS team for custom configuration.

Switch to the Trace Query tab on the Tracing Analysis tab to view trace information.
Find the target trace and click View Traces in the Actions column to go to the Trace details page.

View the service input, output, and log details on this page.

Note: If you use a RAM user or a RAM role, you must grant the AliyunARMSReadOnlyAccess permission to use this feature.
To share trace details, click the share icon to generate a one-time address.

Note: If you use a RAM user or a RAM role, you must grant the cms:CreateTicket permission before you can use this feature.

Step 4: Evaluate application performance

EAS supports two trace-based evaluation methods:

Method 1: Evaluate a single trace: Manually select a trace from the EAS service for evaluation. Suitable for the development or testing phase to debug a specific trace and verify its logic and performance.
Method 2: Evaluate traces in batches online: Periodically evaluate sampled traces from the running EAS service. Suitable for large-scale performance testing or feature validation to understand overall system status.

Note

By default, trace data is stored for 30 days from the time it is generated. To extend the storage period, contact the ARMS team for custom configuration.

Method 1: Evaluate a single trace

On the Trace Query tab of the Tracing Analysis tab, find the target trace and click Evaluate in the Actions column. Then, in the Evaluate configuration panel, configure the following parameters.

Evaluation Metrics: This is a fixed configuration and cannot be changed. The evaluation is performed based on the following dimensions.

Evaluation Metric	Description
Correctness	Determines whether the answer correctly addresses the question based on the input and reference text.
Faithfulness	Determines whether the answer is generated based on the input and reference text and whether it contains hallucinations.
Retrieval Relevance	Determines whether the retrieved results are relevant to the input question. It includes the following four metrics: nDCG: Normalized Discounted Cumulative Gain Hit Rate Precision@K MRR: Mean Reciprocal Rank

Model Settings: The LLM used to evaluate the trace. After initial setup, this configuration is automatically backfilled for subsequent evaluations.

Parameter

Description

Model Selection

The following two models are supported:

PAI Judge Model
qwen-max (Model Studio model)
Note
- To use a Model Studio model, configure an Internet connection for EAS.
- Model Studio model calls are billed separately. Billing.

Model Token

Enter the token for the selected model:

Judge model: Go to the Judge Model page, activate the PAI judge model, and obtain a token.
qwen-max: Obtain a token from Call the Tongyi Qianwen API for the first time.

Extraction Configuration: In the Query Extraction Configuration, Answer Extraction Configuration, and Context Extraction Configuration sections, configure the parameters in the following table to extract the corresponding content:

Query Extraction Configuration: Extracts the user query content (input).
Answer Extraction Configuration: Extracts the system-generated answer (output).
Context Extraction Configuration: Extracts the text or background information provided to the system (documents).

Parameter	Description
SpanName	Finds a span that matches the SpanName.
JsonPathInSpan	The format is a.b.c. This parameter cannot be empty. It extracts a value from a specified element of the matched span.
JsonPathInSpanValue	The format is a.b.c. This parameter can be empty. After the element corresponding to JsonPathInSpan is found, if the element's content is a JSON string, JsonPathInSpanValue is used to extract the corresponding value.

You can click View Traces in the Actions column to obtain the configuration content from the Trace details page. The following table shows configuration examples:

Extraction Configuration	How to obtain	Example value
Query Extraction Configuration	This topic provides an example where JsonPathInSpanValue has no value: For an example where JsonPathInSpanValue has a value, see the following figure.	JsonPathInSpanValue has no value SpanName: LLM JsonPathInSpan: attributes.input.value JsonPathInSpanValue: Because the content of the JsonPathInSpan element is not a JSON string, this parameter is empty. JsonPathInSpanValue has a value SpanName: LLM JsonPathInSpan: `attributes.input.value` JsonPathInSpanValue: Because the content of the JsonPathInSpan element is a JSON string, enter `text[0]` here.
Answer Extraction Configuration		SpanName: LLM JsonPathInSpan: `attributes.output.value` JsonPathInSpanValue: This parameter is empty.
Context Extraction Configuration	The sample service in this topic does not include a context extraction configuration. For an example, see the following figure:	SpanName: retrieve JsonPathInSpan: `attributes.retrieval.documents[].document.content` Important* Only the context configuration supports using an asterisk (). JsonPathInSpanValue: Because the content of the JsonPathInSpan* element is not a JSON string, this parameter is empty.

After you configure the parameters, click Confirm.

When a result appears in the Evaluation Result column (as shown in the following figure), the evaluation is complete. Click the result to view its details.

Method 2: Evaluate traces in batches online

On the Online Evaluation tab of the Tracing Analysis tab, click New Assessment.

On the Create Evaluation Task page, configure the following parameters and then click Confirm.

Parameter	Description
Basic Configuration
Task Name	Enter a custom task name.
Evaluation Configuration
Evaluation Metrics	This is a fixed configuration and cannot be changed. The evaluation is performed based on the following dimensions: Correctness: Determines whether the answer correctly addresses the question based on the input and reference text. Faithfulness: Determines whether the answer is generated based on the input and reference text and whether it contains hallucinations. Retrieval Relevance: Determines whether the retrieved content is relevant to the input question. It includes the following four metrics: nDCG: Normalized Discounted Cumulative Gain Hit Rate Precision@K MRR: Mean Reciprocal Rank
Model Selection	The following two models are supported: PAI Judge Model qwen-max (Model Studio model) Note To use a Model Studio model, configure an Internet connection for EAS. Model Studio model calls are billed separately. Billing.
Model Token	Enter the token for the selected model: Judge model: Go to the Judge Model page, activate the PAI judge model, and obtain a token. qwen-max: Obtain a token from Call the Tongyi Qianwen API for the first time.
Sampling Start and End Time	Select the start and end dates for sampling.
Sampling Policy	The following sampling policies are available: Time window-based sampling: Captures one trace at regular intervals (every x minutes). Probability-based sampling: Randomly captures a specified percentage of traces.
QCA Extraction Configuration: Trace data is formatted as a JSON string. The QCA extraction configuration defines the path to the QCA within this JSON structure. The value at this path represents the specific QCA content.
Query Extraction Configuration	Query Extraction Configuration: Extracts the user query content (input). Answer Extraction Configuration: Extracts the system-generated answer (output). Context Extraction Configuration: Extracts the text or background information provided to the system (documents). Configure SpanName, JsonPathInSpan, and JsonPathInSpanValue to extract the corresponding content. Configuration details: Extraction Configuration.
Answer Extraction Configuration
Context Extraction Configuration

When the task Status shows Completed, all evaluations are finished and no new results will be generated.

After the evaluation is complete, you can view the results in the Evaluation Result column of the evaluation task. You can also click the task name to view its details.
- View evaluation results: The system calculates and displays the average score from all successful evaluations. A score closer to 1 indicates higher quality.
- View evaluation details:

After creation, manage the task by updating, stopping, deleting, or cloning it. Cloning duplicates the configuration into a new task.