Reinforcement learning training observability tracing-阿里云帮助中心

This document demonstrates how to configure observability for reinforcement learning training using the agentic-rl-example project: add OpenTelemetry dependencies, set up tracing, and view traces and metrics on the console.

Overview

Reinforcement learning training uses OpenTelemetry for tracing. By adding a few decorators and wrapper calls to your function code, every LLM call, tool invocation, and scoring detail is automatically recorded and then exported to Application Real-Time Monitoring Service (ARMS) for visualization in the Model Studio console.

This topic walks you through the end-to-end process from initial setup to viewing results in the console, using the agentic-rl-example project, a CalcX calculator use case. It also provides a metric reference dictionary, troubleshooting guidelines, and a workflow for investigating FAILED statuses. For training submission and hyperparameter configuration, see Reinforcement learning training configuration — submission and configuration.

Integrate tracing

This guide demonstrates how to integrate tracing in three steps using the CalcXRolloutProcessor from the agentic-rl-example project.

Step 1: Add dependencies

Add the following OpenTelemetry dependencies to the requirements.txt file in your project's root directory:

opentelemetry-api==1.41.1
opentelemetry-sdk==1.41.1
opentelemetry-exporter-otlp-proto-http==1.41.1
opentelemetry-processor-baggage==0.62b1
loongsuite-util-genai==0.4.0

Note

The dashscope, fastapi, uvicorn, and pyyaml packages are pre-installed in the runtime and do not need to be listed in requirements.txt.

Step 2: Add instrumentation code

The code changes involve five decorators/functions, which automatically nest to form a complete trace:

[ENTRY: ROLLOUT] @observe_processor          ← Rollout processor entry point
├── [LLM] trace_client / @observe_llm        ← LLM call
│   └── (OpenAI / LangChain / DashScope API)
├── [TOOL] trace_tool / @observe_tool        ← Tool call
│   ├── tool: calculator (MCP)
│   └── tool: response_scorer (custom)
└── [custom] rollout_metrics                 ← Rollout custom metrics

[ENTRY: REWARD] @observe_processor           ← Reward processor entry point
├── [LLM] trace_client / @observe_llm        ← LLM call (for scoring)
├── [TOOL] trace_tool / @observe_tool        ← Tool call
└── [custom] reward_metrics                  ← Reward custom metrics

@observe_processor: Trace the processor entry point

Apply this decorator to the process() method to create the top-level ENTRY span. The SDK automatically identifies the type based on the parent class. Inheriting from AbstractRolloutProcessor creates a ROLLOUT-type span, and inheriting from AbstractRewardProcessor creates a REWARD-type span. The decorator automatically records the input, output, duration, and success/failure status of each call.

For Rollout (functions/rollout/rollout.py):

from dashscope.finetune.reinforcement.component.observability import (
    observe_processor
)

class CalcXRolloutProcessor(AbstractRolloutProcessor):
    @observe_processor  # Span type = ROLLOUT
    async def process(self, input: RolloutInput) -> RolloutOutput:
        await self._async_setup()
        return await self._async_process(input)

For Reward (functions/reward/reward.py):

class DemoRewardProcessor(AbstractRewardProcessor):
    @observe_processor  # Span type = REWARD
    async def process(self, input: RewardInput) -> RewardOutput:
        score = await evaluate(content, input.ground_truth)
        return RewardOutput(
            reward=Reward(
                reward_score=score,
                reward_metrics={"test1": 0.5, "test2": 0.3}
            ),
            status=TaskStatus.SUCCESS,
        )

trace_client(): Trace an LLM client

Call this function in your initialization method to wrap an LLM client instance. All subsequent LLM requests from this client will automatically generate an LLM span that records the model name, request content, token usage, latency, and more.

Supported client types (auto-detected via duck typing):

OpenAI clients (AsyncOpenAI / OpenAI)
OpenAI completions resources (.chat.completions)
LangChain classes like ChatOpenAI (via .client / .async_client)
DashScope Generation class (pass the class itself, not an instance)

Example (functions/rollout/rollout.py):

from langchain_openai import ChatOpenAI
from dashscope.finetune.reinforcement.component.observability import (
    trace_client
)

class CalcXRolloutProcessor(AbstractRolloutProcessor):
    def _build_llm(self, input: RolloutInput) -> ChatOpenAI:
        llm = ChatOpenAI(
            model=input.model_resource.model_name,
            openai_api_key=api_key,
            openai_api_base=input.model_resource.base_url,
            ...
        )
        trace_client(llm)  # Wrap the client to automatically trace all LLM calls
        return llm

trace_tool(): Trace tool calls

After retrieving tool instances, call this function to wrap them. Each subsequent tool call will generate a TOOL span, recording the tool name, parameters, return value, and latency.

Supported input formats:

A single LangChain BaseTool
A list, tuple, or dictionary (iterated over automatically)
A LangGraph ToolNode (automatically expands .tools_by_name)
MCP tools (auto-detected, with provider set to "mcp")

Example (functions/rollout/rollout.py):

from dashscope.finetune.reinforcement.component.observability import (
    trace_tool
)

class CalcXRolloutProcessor(AbstractRolloutProcessor):
    async def _init_resources_async(self):
        client = MultiServerMCPClient({
            "calculator": {"url": "http://localhost:10086/sse"}
        })
        tools = await client.get_tools()
        trace_tool(tools)  # Must be called after get_tools()

Warning

Special consideration for MCP: The MCP server and client run in different processes. The @observe_tool decorator on the server has no effect on the client. You must call trace_tool(tools) on the client with the list of tools returned by get_tools().

@observe_llm: Custom LLM function

If trace_client() cannot automatically detect your LLM client, use this decorator to manually mark a function as an LLM call.

Signature requirement: The function must include the model and messages keyword arguments after a *.

from dashscope.finetune.reinforcement.component.observability import (
    observe_llm
)

@observe_llm  # Mark as an LLM span
async def call_custom_llm(*, model: str, messages: list, **kwargs):
    # Custom LLM call logic
    ...

@observe_tool: Custom tool function

If trace_tool() cannot automatically detect your tool (for example, a standard Python function), use this decorator to mark it manually. You can customize the span name using the name parameter.

from dashscope.finetune.reinforcement.component.observability import (
    observe_tool
)

@observe_tool(name="response_scorer")  # Mark as a TOOL span
def score_response(*, messages: list) -> float:
    # Custom scoring logic
    ...

Step 3: Submit the job

When submitting a job, you can leave the env field in the runtime empty. Tracing is enabled by default.

from dashscope.finetune.agentic_rl import AgenticRL
from dashscope.finetune.reinforcement import (
    RolloutFunctionComponent, RewardFunctionComponent,
    FunctionComponentModel, FunctionComponentRuntime
)

client = AgenticRL()

rollout_runtime = FunctionComponentRuntime(
    cpu=2, memory_size=4096, disk_size=512,
    concurrency=30, capacity=30,
    min_capacity=30, max_capacity=60,
    env={}  # Leave empty to enable tracing by default
)

reward_runtime = FunctionComponentRuntime(
    cpu=2, memory_size=4096, disk_size=512,
    concurrency=30, capacity=30,
    min_capacity=30, max_capacity=60,
    env={}
)

result = await client.run(
    model="qwen3.5-9b",
    functions=[
        RolloutFunctionComponent(
            name="rollout-1",
            fcmodel=FunctionComponentModel(
                classpath="functions.rollout.rollout.CalcXRolloutProcessor"),
            runtime=rollout_runtime,
        ),
        RewardFunctionComponent(
            name="reward-1",
            weight=1.0,
            fcmodel=FunctionComponentModel(
                classpath="functions.reward.reward.DemoRewardProcessor"),
            runtime=reward_runtime,
        ),
    ],
    ...
)

Note

To disable tracing (and save costs), set {"ENABLE_TRAJECTORY": "false"} in the env field. Disabling tracing stops the collection of trace data but does not affect system metrics such as Actor, Critic, and Perf.

Tracing: Performance and cost trade-offs

Data flow and cost drivers

Data flow: Function code (core decorators) → OpenTelemetry SDK → ARMS → Trace/Metrics tabs in the console
Cost drivers: ARMS span storage (pay-as-you-go), minor CPU and network overhead on the function side, and a slight increase in training latency.

Development vs. large-scale training strategy

Stage	Tracing status	Data collected	Data not collected	Cost impact
Development / Small-batch debugging	Fully Enabled	All data (trajectories, reward analysis, tool calls, system metrics)	—	Low
Canary release / Pre-release validation	Enabled	All data	—	Medium
Large-scale production training (e.g., 9B model, tens of thousands of samples, multiple epochs)	Recommended to disable	System metrics: actor, critic, trajectory, perf, timing	Trajectory replay, tool call details, and custom metric curves under `trace/`	Significantly reduced

To disable tracing, set runtime.env = {"ENABLE_TRAJECTORY": "false"}.

Cost management for custom metrics

The number and cardinality of metrics in reward_metrics and rollout_metrics affect ARMS storage costs.
Avoid using high-cardinality fields, such as user_id or request_id, as metric keys.
Limit key metrics to 10 or fewer, and remove temporary debugging metrics.
Sub-dimensional metrics from @sub_reward_func are aggregated into the reward_metrics of the corresponding Reward function. This aggregation helps keep the data volume manageable.

Custom metrics

Key-value metrics you define in your code using the following entry points automatically appear on the Metrics tab under the trace/ group and on the Reward Analysis page in the console:

Entry point	Code location	Console path
reward_metrics	`Reward(reward_metrics={"acc": 0.8})`	`trace/reward_metrics/{reward-name}/acc/{avg,sum}`
rollout_metrics	`AgentOutput(rollout_metrics={"latency": 1.2})`	`trace/rollout_metrics/latency/{avg,sum}`
sub-dimension score	The `reward_metrics` returned by `@sub_reward_func("toxicity")`	Merged into the `reward_metrics` of the corresponding reward function

Multiple reward functions: Set a unique name for each reward function using RewardFunctionComponent(name="reward-1"). This name distinguishes them in the metric path (e.g., trace/reward_metrics/reward-1/...). You can also use reward_metric_weight to set the weight of each sub-metric in the overall score.

Console overview

After the training job starts, go to the model fine-tuning page in the Model Studio console and click the job name to open its details. This section maps the actions in your code to the data displayed in the console.

The job details page has the following five tabs. We recommend using them in this order: first check the progress, then analyze the behavior, and finally drill down if you encounter issues.

Tab	Purpose	Use case
Details	Job progress and status	Always check this tab first
Trajectory	What the model did and why it received its score	Verify model behavior and diagnose low-scoring samples
Metrics	Quantitative training trends	Analyze curves to determine convergence and identify inflection points
Outputs	checkpoint list and publishing	Select a model after training is complete
Logs	stdout, stderr, and error stack traces	Troubleshoot FAILED jobs

Details tab

View basic job information, such as the job ID, training model, training method (RL full-parameter training), and data configuration. Pay close attention to the job status:

PENDING: The job is queued and waiting for resource allocation.
RUNNING: Training is in progress. You can switch to other tabs to view real-time data.
SUCCESS: Training is complete. Switch to the Outputs tab to publish the model.
FAILED: Training failed. Check the Logs tab or retrieve logs using the SDK or CLI: AgenticRL.logs(job_id="ft-xxx") or dashscope rl logs "ft-xxx".

Trajectory Details — maps to @observe_processor

On the Trajectory tab, open the Trajectory Details subpage to view the complete interaction process for each rollout:

Trajectory List: Displays all sampled trajectories. You can filter them by sample ID, trajectory ID, epoch, or step.
Conversation Process: Shows the complete multi-turn interaction (user → assistant → tool_call → tool_result → assistant), clearly showing the model's reasoning chain.
Reward Score: Displays the reward score and status (SUCCESS/FAILED) for each step.

Key questions to consider: Is the tool calling correct? Is the number of conversation turns reasonable? Is the model repeating ineffective actions?

Tool Call Analysis — maps to trace_tool / @observe_tool

On the Trajectory tab, open the Tool Call Analysis subpage to view:

Tool Call Records: Tool name, call parameters, result, and latency.
Tracing subtab: The span tree for each trajectory. You can expand it to see full details of each tool call and LLM request.

Typical use case: Troubleshoot an agent's tool calling failures. Which tool returned an error? Were the parameters passed correctly? Is the latency too high?

Reward Analysis — maps to reward_metrics

Key concepts:

sample: An original sample from the training data, such as a question, an instruction, or a prompt.
trajectory: A specific interaction trajectory generated from a single sample over n_rollouts sampling attempts.
Relationship: One sample generates N trajectories, where N equals n_rollouts.

On the Trajectory tab, open the Reward Analysis subpage to evaluate training performance from three perspectives:

Step dimension: Select a training step to view the aggregated reward metrics (average score, success rate, and trend chart) for all samples at that step. This helps you assess the overall training trend.
Sample dimension: Select a sample ID to compare the reward scores for the same sample across trajectories. This helps you identify problematic samples.
Trajectory dimension: View the raw scores for each scoring dimension of a single trajectory. Use this for attribution analysis.

Metrics tab — maps to rollout_metrics / reward_metrics

On the Metrics tab, under the trace/ group, you can view aggregated curves (avg / sum) for all the custom metrics you defined in your code.

For a complete list of all 13 groups and 121 metrics, see §Training Metrics Reference. To learn how to diagnose anomalies, see §Troubleshooting Decision Tree (P1–P9).

Outputs tab

After training is complete, the Outputs tab displays a list of checkpoints. Each row includes a checkpoint ID, publishing status, and remaining retention time.

Select the target checkpoint and click the Publish button.
Wait for the publishing process to complete (the status changes from "To Be Published" to "Published").
After the model is published, you can call it via API by its model name.

Training completion does not mean training success. You must check the validation/data/reward/mean@1 metric to select the best checkpoint, which is not always the last one. For information on checkpoint retention policies, resuming training, and the SFT→DPO→RL transition path, see §Resume Training, Checkpoints, and Progressive Training in "Reinforcement Learning Training Configuration — Submission and Configuration."

Logs tab

On the Logs tab, you can view the training run logs. You can also retrieve them using the SDK or CLI: AgenticRL.logs(job_id="ft-xxx", lines=100) or dashscope rl logs "ft-xxx" --lines 100.

For instructions on troubleshooting a FAILED job, see §Troubleshooting workflow for FAILED training jobs in the FAQ section.

Training metrics

Key metrics

The following metrics are organized by monitoring dimension to help you quickly assess the health of your training job:

Monitoring dimension	Key metric	Description	Health criteria
Task performance and generalization	`critic/rewards/mean`	The North Star metric. It measures the mean reward for the current training batch, showing whether the model is learning an effective policy.	Healthy: Increases steadily until convergence.
Task performance and generalization	`validation/data/reward/mean@1`	Measures the quality of the model's first response to new prompts, reflecting its generalization ability.	Healthy: Increases steadily, consistent with the training reward trend.
Training stability	`actor/entropy`	Measures the uncertainty in the policy's output, which is crucial for balancing exploration and exploitation.	Healthy: High at the beginning of training, then gradually decreases while remaining at a non-zero level.
	`actor/ppo_kl`	Measures how much the current policy has diverged from the initial policy.	Healthy: Stays within the range of 0.01 to 0.05.
	`actor/pg_clipfrac`	The fraction of updates clipped by importance sampling, indicating the rate of policy drift.	Healthy: Typically remains below 1%.
System efficiency and boundaries	`trajectory/response_length_non_aborted/mean`	The mean token count for non-truncated responses.	Healthy: Aligns with your task requirements.
	`trajectory/.../clip_ratio`	The fraction of prompts or responses forcibly truncated for exceeding the maximum length.	Healthy: Extremely low, close to 0%.
	`timing/s/step`	The total time for a complete reinforcement learning (RL) step, which includes generation, evaluation, and update.	Healthy: Remains relatively stable.

Troubleshooting framework

Interpreting metrics in layers

Essentials (8-10 metrics for daily health checks): The north star metric, three core metrics, truncation rate, and single-step latency (see the Key Metric Quick Reference Table for health criteria).
Scenario-specific (by use case): For agent scenarios, monitor trace/num_llm_calls and trajectory/num_turns. For long-text scenarios, review the length-related metrics.
Troubleshooting drill-down: timing sub-items / fully_async queue / critic distribution max/min

To learn when to use each of the five tabs in the console, see the tab quick reference table in the "View Results in the Console" section above.

Troubleshooting decision tree (P1–P9)

When you observe an abnormal curve on the Metrics tab, use this section to pinpoint the problem and tune parameters as recommended (see "Reinforcement Learning Training Configuration — Submission and Configuration" § Parameter tuning decision table).

P1: Non-convergence and P2: oscillation

P1 primary signal: A long plateau in critic/rewards/mean; actor/ppo_kl ≈ 0; and actor/grad_norm is extremely small.
P1 root cause: The learning rate is too small, the reward function consistently returns the same value, or there is insufficient data.
P2 primary signal: Major oscillation in rewards/mean, a spike in grad_norm, and pg_clipfrac > 5%.
P2 root cause: The learning rate is too high, the batch size is too small, or the advantage variance is not normalized.

P3: KL explosion and P4: sudden entropy drop (mode collapse)

P3 primary signal: actor/ppo_kl is consistently > 0.1, pg_clipfrac increases concurrently, and the trajectory shows garbled or repetitive output.
P3 recommended action: Increase kl_loss_coef (e.g., from 0.001 to 0.01), decrease the learning rate, and decrease ppo_mini_batch_size.
P4 primary signal: actor/entropy drops to nearly 0 within a few dozen steps, rewards/mean shows a false increase, and sample outputs are repetitive and template-like.
P4 root cause: A single mode is excessively rewarded, the KL constraint is too weak, or the temperature is too low.

P5: Reward hacking and P6: validation set collapse

P5 primary signal: The training reward increases while the validation reward stagnates or decreases; response_length_non_aborted/mean increases linearly and rapidly; and a specific sub-dimension of reward_metrics shows exclusive gains.
P5 attribution path: Analyze the abnormal metric, then sample the trajectory. Check if the assistant is padding its output, copying the prompt, or using tools to exploit loopholes in the reward function (for details, see "Reinforcement Learning Development Guide" § Identifying and preventing reward hacking).
P6 primary signal: validation/data/reward/mean@1 shows a downward inflection point while the training reward continues to increase.
P6 recommended action: Apply early stopping (select an earlier checkpoint from the Outputs tab), increase the KL constraint, and expand the validation set.

P7: Slow training, P8: garbled output, and P9: high truncation rate

P7 primary signal: A sudden increase in timing/s/step.
P7 drill-down: Investigate rollout/agent_loop_latency and update_actor. Possible root causes include an increase in response length, delays in FC auto-scaling (check the fully_async queue), or high tool latency.
P8 primary signal: The trajectory contains garbled or non-natural language output. This is often caused by a KL explosion (P3). Address this issue using the P3 solution.
P9 primary signal: A high value for trajectory/response_length/clip_ratio or prompt_length/clip_ratio.
P9 recommended action: Increase max_length or shorten the prompt. Also, check if the reward function inadvertently encourages longer outputs.

Three-dimensional console attribution

Using Step, Sample, and Trajectory

Standard workflow: Spot an inflection point on the Metrics tab → Use Step to pinpoint the failing training segment (trend inflection) → Use Sample to find which samples are skewing the mean → Use Trajectory to examine the reasoning chain or tool calling details → Return to the code to modify the reward function or data.

Tool calling analysis and tracing span tree

Tool calling analysis: Tool name, parameters, return value, and duration—essential for troubleshooting agent tool failures.
The Tracing subtab provides a span tree for each trajectory (ENTRY:ROLLOUT → LLM → TOOL → REWARD).
Typical use case: If the reward suddenly drops, check if a tool has failed or the MCP server is unavailable.
Note: The server-side @observe_tool decorator does not affect the client. You must use trace_tool(tools) on the client side.

Custom metrics for attribution (rollout_metrics and reward_metrics)

System metrics tell you that an error occurred, but custom metrics pinpoint which dimension is failing.
Recommended naming convention: Name your metrics according to business sub-dimensions, such as accuracy, format_score, tool_success_rate, or answer_length.
Handling multiple reward functions: Distinguish them using RewardFunctionComponent(name=...) and adjust their weights using reward_metric_weight.
Anti-pattern: Avoid stuffing log or debugging information into metrics, as this can cause a cardinality explosion.

Troubleshooting with logs, metrics, traces, and tracing

Each of the four observation sources answers a different question: logs show when a failure occurred and provide stack traces; metrics indicate whether an issue is occurring and its severity; traces reveal why an issue occurred and pinpoint the specific sample; and tracing identifies slow code segments and failing external dependencies. Select an observation source based on the issue type:

Issue type	Primary source	Secondary source	Not needed
Training divergence	metric	trace	log
Task failure	log	metric	trace
Stagnant reward	metric → trace	`reward_metrics`	log
Tool call error	tracing	log	—
Training slowdown	timing metrics	tracing	—
Poor output quality	trace	`reward_metrics`	—

When to look beyond the console

If you suspect an issue with the reward implementation, reproduce it locally using test_functions.
If you suspect data issues, sample the JSONL file and inspect rollout_extra.
If you suspect incorrect hyperparameters, review the submission script and check it against the "Reinforcement Learning Training Configuration — Submission and Configuration" guide.

Metric groups

Metrics generated during training are grouped by prefix. See the collapsible panels below for a detailed description of the metrics in each group.

Group prefix	Metric count	Type	Description
actor/	8	System	PPO policy network metrics: loss, entropy, KL divergence, clip ratio, gradient norm, and learning rate
critic/	12	System	Reward and value metrics: mean, max, and min of score, rewards, advantages, and returns
trajectory/	16	System	Trajectory statistics: response length, prompt length, truncation rate, abort rate, and conversation turns
trace/	40+	Hybrid	Observability metrics: epoch, LLM calls, success rate, and custom reward_metrics and rollout_metrics
timing/	17	System	Timing analysis: duration of each Trainer phase (in seconds), rollout duration, and per-token processing time (in milliseconds)
perf/	3	System	Performance overview: total tokens, time per step, and throughput per GPU
fully_async/	24	System	Asynchronous training scheduling: queue status, parameter versions, staleness statistics, and processing latency distribution

Metric groups

Click to expand the complete list of metrics for each group:

actor/ — Policy network metrics (8)

Metric	Description
`actor/loss`	Total loss (pg + entropy + ...).
`actor/pg_loss`	Policy Gradient (PG) loss.
`actor/entropy`	Mean token entropy of the current policy, which indicates the degree of exploration.
`actor/ppo_kl`	The KL divergence between the current policy and the initial policy (the SFT model).
`actor/pg_clipfrac`	Fraction of updates clipped by importance sampling, which indicates the rate of policy drift.
`actor/pg_clipfrac_lower`	Lower-bound clip fraction for dual-clip (always 0 if dual-clip is not enabled).
`actor/grad_norm`	Gradient norm.
`actor/lr`	Current learning rate.

critic/ — Reward and value evaluation metrics (12)

Metric	Description
`critic/score/{mean,max,min}`	Statistics for the raw reward score (before subtracting the KL penalty).
`critic/rewards/{mean,max,min}`	Statistics for the final training reward after subtracting the KL penalty.
`critic/advantages/{mean,max,min}`	Statistics for the advantage function, which indicates the improvement of the current policy over the baseline.
`critic/returns/{mean,max,min}`	Statistics for returns (target for the Critic).

trajectory/ — Trajectory statistics (16)

Metric	Description
`trajectory/response_length/{mean,max,min}`	Statistics for the number of response tokens, including aborted samples.
`trajectory/response/aborted_ratio`	Ratio of trajectories with a zero-length response, which can result from a Rollout error or cancellation.
`trajectory/response_length_non_aborted/{mean,max,min}`	Statistics for the number of tokens in valid responses, excluding aborted samples.
`trajectory/response_length/clip_ratio`	Ratio of responses truncated at the maximum length.
`trajectory/prompt_length/{mean,max,min}`	Statistics for the number of prompt tokens.
`trajectory/prompt_length/clip_ratio`	Ratio of prompts truncated at the maximum length.
`trajectory/num_turns/{mean,max,min}`	Statistics for the number of interaction turns between the Agent and the LLM.

trace/ — Observability metrics

Metric	Description
`trace/training/epoch`	Current training epoch.
`trace/num_llm_calls/{avg,sum}`	Average number of LLM calls per trajectory / total number of LLM calls.
`trace/success_rate/agent/{avg,sum}`	Agent task success rate / cumulative number of successful tasks.
`trace/success_rate/reward/{avg,sum}`	Reward calculation success rate / cumulative number of successful calculations.
`trace/attempts/agent/{avg,sum}`	Average / cumulative number of HTTP attempts by the Agent (including retries).
`trace/reward/<reward_name>/{avg,sum}`	Average / cumulative value for an individual reward function.
`trace/reward_metrics/<reward_name>/<metric>/...`	User-defined custom sub-metrics returned by a reward function.

timing/ — Timing analysis (17)

Divided into three subgroups: timing/s/* (Trainer stage latency, in seconds), timing/s/rollout/* (Rollout-side latency, in seconds), and timing/ms/*_per_token (per-token latency, in milliseconds).

Metric	Description
`timing/s/step`	Total time for a complete Trainer step.
`timing/s/trainer_fetch_batch`	Time the Trainer spends waiting for and fetching a batch.
`timing/s/old_log_prob`	Time to compute the log-prob of the old policy.
`timing/s/adv`	Time to compute the advantage function.
`timing/s/update_actor`	Time taken for the Actor's backpropagation and optimizer step.
`timing/s/param_sync`	Time taken for the Rollouter to sync parameters from the Trainer.
`timing/s/rollout/agent_loop_latency/avg`	Total latency for a single Rollout, including Agent interaction, reward calculation, and post-processing.
`timing/s/rollout/model_latency/avg`	Cumulative LLM inference latency per trajectory.
`timing/s/rollout/reward_latency/avg`	Latency of reward function calls.
`timing/ms/gen_per_token`	Per-token latency during the generation phase.
`timing/ms/update_actor_per_token`	Per-token latency during the Actor update phase.

perf/ — Performance overview (3)

Metric	Description
`perf/total_num_tokens`	Total number of tokens processed in this step.
`perf/time_per_step`	Total time elapsed per step.
`perf/throughput`	Throughput per GPU (tokens/s/GPU).

Custom metrics

See the custom metrics section for a complete definition.

FAQ

Troubleshooting quick reference

Issue keyword	Primary signal	Recommended action	See also
Stagnant reward	`critic/rewards/mean` plateaus	Check the learning rate and reward function	P1 in this topic
KL explosion / Garbled output	`actor/ppo_kl` > 0.1	Increase `kl_loss_coef` or decrease the learning rate	P3 / P8 in this topic
Reward hacking	Training reward increases while validation reward decreases	Switch signals / Apply reward shaping	P5 in this topic + "Reinforcement Learning Development Guide"
High truncation rate	High `clip_ratio`	Increase `max_length`	P9 in this topic
FAILED · Function registration failed	Incorrect classpath / Missing dependency	Check `requirements.txt`	"Reinforcement Learning Training Configuration — Submission and Configuration"
FAILED · Rollout timeout	Individual request timeout	Increase the `timeout` or use tracing to find bottlenecks	See "Troubleshooting FAILED jobs" below
FAILED · Insufficient resources	MTU or FC container issues	Check the `fully_async` queue	"Reinforcement Learning Training Configuration — Submission and Configuration"
Cannot see trace data	Console is blank	Check ARMS authorization / `requirements.txt`	See "Troubleshooting missing trace data" below

Troubleshooting FAILED jobs

This section covers troubleshooting FAILED jobs caused by metric abnormalities during training. For infrastructure-related failures, like FC function registration failures, upload timeouts, or insufficient resources, refer to the FAQ section in "Reinforcement Learning Training Configuration — Submission and Configuration".

Standard troubleshooting workflow

Step 1: Details tab Confirm the job status and the time of failure.
Step 2: Logs tab Inspect the last 100-500 lines of the log (SDK: AgenticRL.logs(job_id, lines=100) / CLI: dashscope rl logs --lines 100).
Step 3: Identify the error layer:
- User function error (exceptions in rollout/reward functions) → Reproduce locally with test_functions → Fix the code → Re-register and run.
- Framework error (OOM, insufficient resources, network issues) → Check the fully_async queue and adjust concurrency / capacity.
- Data error (JSONL parsing failure) → Validate the format of individual records and the rollout_extra field.

Common error patterns

Error pattern	Primary symptom	Recommended action
Rollout timeout	`timeout` is triggered, LLM is slow, or a tool is slow.	Increase the timeout value and use tracing to identify the bottleneck.
Intermittent reward FAILED	Empty messages are unhandled or there is an encoding error.	Add a `try/except` block to return `TaskStatus.FAILED` and the error.
Insufficient resources	Insufficient MTU or FC auto-scaling cannot keep up.	Check the fully_async queue and adjust `concurrency` or `capacity`.
Incorrect data format	JSONL parsing fails.	Validate line-by-line JSON format, the roles in `messages`, and the `rollout_extra` field.

Viewing trace data

After authorizing ARMS in the Model Studio console:

Trajectory tab → View Trajectory Details, Reward Analysis, and Tool Call Analysis.
Metrics tab → Check the trace/ group.

Troubleshooting missing trace data

Follow these steps to troubleshoot the issue:

Ensure that ENABLE_TRAJECTORY=false is not set in the runtime env. (Tracing is enabled by default and requires no extra configuration.)
Ensure you have authorized the ARMS service in the Model Studio console.
Check that requirements.txt includes the required OpenTelemetry dependencies.
Check whether the process() method is decorated with @observe_processor.

Performance impact of tracing

Tracing adds minor storage and latency overhead. We recommend enabling it during development and debugging. For large-scale production training, if you only need to view training metrics (such as Actor, Critic, and Perf) and do not need trace details, you can disable tracing by setting {"ENABLE_TRAJECTORY": "false"} in the runtime env. Disabling tracing does not affect system metrics.

Distinguishing reward functions

Set a unique name for each reward function by using RewardFunctionComponent(name="reward-1"). This name appears in the metric path (e.g., trace/reward_metrics/reward-1/...), and the console automatically groups and displays the metrics by name.

Disabling tracing to reduce costs

Set {"ENABLE_TRAJECTORY": "false"} in the runtime env, either by using FunctionComponentRuntime(env={"ENABLE_TRAJECTORY": "false"}) or by setting env: {ENABLE_TRAJECTORY: false} in a YAML file.