Reinforcement learning observability and metrics

更新时间:
复制 MD 格式

This document demonstrates how to configure observability for reinforcement learning training using the agentic-rl-example project: add OpenTelemetry dependencies, set up tracing, and view traces and metrics on the console.

Overview

Reinforcement learning training uses OpenTelemetry for tracing. By adding a few decorators and wrapper calls to your function code, every LLM call, tool invocation, and scoring detail is automatically recorded and then exported to Application Real-Time Monitoring Service (ARMS) for visualization in the Model Studio console.

This topic walks you through the end-to-end process from initial setup to viewing results in the console, using the agentic-rl-example project, a CalcX calculator use case. It also provides a metric reference dictionary, troubleshooting guidelines, and a workflow for investigating FAILED statuses. For training submission and hyperparameter configuration, see Reinforcement learning training configuration — submission and configuration.

Integrate tracing

This guide demonstrates how to integrate tracing in three steps using the CalcXRolloutProcessor from the agentic-rl-example project.

Step 1: Add dependencies

Add the following OpenTelemetry dependencies to the requirements.txt file in your project's root directory:

opentelemetry-api==1.41.1
opentelemetry-sdk==1.41.1
opentelemetry-exporter-otlp-proto-http==1.41.1
opentelemetry-processor-baggage==0.62b1
loongsuite-util-genai==0.4.0
Note

The dashscope, fastapi, uvicorn, and pyyaml packages are pre-installed in the runtime and do not need to be listed in requirements.txt.

Step 2: Add instrumentation code

The code changes involve five decorators/functions, which automatically nest to form a complete trace:

[ENTRY: ROLLOUT] @observe_processor          ← Rollout processor entry point
├── [LLM] trace_client / @observe_llm        ← LLM call
│   └── (OpenAI / LangChain / DashScope API)
├── [TOOL] trace_tool / @observe_tool        ← Tool call
│   ├── tool: calculator (MCP)
│   └── tool: response_scorer (custom)
└── [custom] rollout_metrics                 ← Rollout custom metrics

[ENTRY: REWARD] @observe_processor           ← Reward processor entry point
├── [LLM] trace_client / @observe_llm        ← LLM call (for scoring)
├── [TOOL] trace_tool / @observe_tool        ← Tool call
└── [custom] reward_metrics                  ← Reward custom metrics

@observe_processor: Trace the processor entry point

Apply this decorator to the process() method to create the top-level ENTRY span. The SDK automatically identifies the type based on the parent class. Inheriting from AbstractRolloutProcessor creates a ROLLOUT-type span, and inheriting from AbstractRewardProcessor creates a REWARD-type span. The decorator automatically records the input, output, duration, and success/failure status of each call.

For Rollout (functions/rollout/rollout.py):

from dashscope.finetune.reinforcement.component.observability import (
    observe_processor
)

class CalcXRolloutProcessor(AbstractRolloutProcessor):
    @observe_processor  # Span type = ROLLOUT
    async def process(self, input: RolloutInput) -> RolloutOutput:
        await self._async_setup()
        return await self._async_process(input)

For Reward (functions/reward/reward.py):

class DemoRewardProcessor(AbstractRewardProcessor):
    @observe_processor  # Span type = REWARD
    async def process(self, input: RewardInput) -> RewardOutput:
        score = await evaluate(content, input.ground_truth)
        return RewardOutput(
            reward=Reward(
                reward_score=score,
                reward_metrics={"test1": 0.5, "test2": 0.3}
            ),
            status=TaskStatus.SUCCESS,
        )

trace_client(): Trace an LLM client

Call this function in your initialization method to wrap an LLM client instance. All subsequent LLM requests from this client will automatically generate an LLM span that records the model name, request content, token usage, latency, and more.

Supported client types (auto-detected via duck typing):

  • OpenAI clients (AsyncOpenAI / OpenAI)

  • OpenAI completions resources (.chat.completions)

  • LangChain classes like ChatOpenAI (via .client / .async_client)

  • DashScope Generation class (pass the class itself, not an instance)

Example (functions/rollout/rollout.py):

from langchain_openai import ChatOpenAI
from dashscope.finetune.reinforcement.component.observability import (
    trace_client
)

class CalcXRolloutProcessor(AbstractRolloutProcessor):
    def _build_llm(self, input: RolloutInput) -> ChatOpenAI:
        llm = ChatOpenAI(
            model=input.model_resource.model_name,
            openai_api_key=api_key,
            openai_api_base=input.model_resource.base_url,
            ...
        )
        trace_client(llm)  # Wrap the client to automatically trace all LLM calls
        return llm

trace_tool(): Trace tool calls

After retrieving tool instances, call this function to wrap them. Each subsequent tool call will generate a TOOL span, recording the tool name, parameters, return value, and latency.

Supported input formats:

  • A single LangChain BaseTool

  • A list, tuple, or dictionary (iterated over automatically)

  • A LangGraph ToolNode (automatically expands .tools_by_name)

  • MCP tools (auto-detected, with provider set to "mcp")

Example (functions/rollout/rollout.py):

from dashscope.finetune.reinforcement.component.observability import (
    trace_tool
)

class CalcXRolloutProcessor(AbstractRolloutProcessor):
    async def _init_resources_async(self):
        client = MultiServerMCPClient({
            "calculator": {"url": "http://localhost:10086/sse"}
        })
        tools = await client.get_tools()
        trace_tool(tools)  # Must be called after get_tools()
Warning

Special consideration for MCP: The MCP server and client run in different processes. The @observe_tool decorator on the server has no effect on the client. You must call trace_tool(tools) on the client with the list of tools returned by get_tools().

@observe_llm: Custom LLM function

If trace_client() cannot automatically detect your LLM client, use this decorator to manually mark a function as an LLM call.

Signature requirement: The function must include the model and messages keyword arguments after a *.

from dashscope.finetune.reinforcement.component.observability import (
    observe_llm
)

@observe_llm  # Mark as an LLM span
async def call_custom_llm(*, model: str, messages: list, **kwargs):
    # Custom LLM call logic
    ...

@observe_tool: Custom tool function

If trace_tool() cannot automatically detect your tool (for example, a standard Python function), use this decorator to mark it manually. You can customize the span name using the name parameter.

from dashscope.finetune.reinforcement.component.observability import (
    observe_tool
)

@observe_tool(name="response_scorer")  # Mark as a TOOL span
def score_response(*, messages: list) -> float:
    # Custom scoring logic
    ...

Step 3: Submit the job

When submitting a job, you can leave the env field in the runtime empty. Tracing is enabled by default.

from dashscope.finetune.agentic_rl import AgenticRL
from dashscope.finetune.reinforcement import (
    RolloutFunctionComponent, RewardFunctionComponent,
    FunctionComponentModel, FunctionComponentRuntime
)

client = AgenticRL()

rollout_runtime = FunctionComponentRuntime(
    cpu=2, memory_size=4096, disk_size=512,
    concurrency=30, capacity=30,
    min_capacity=30, max_capacity=60,
    env={}  # Leave empty to enable tracing by default
)

reward_runtime = FunctionComponentRuntime(
    cpu=2, memory_size=4096, disk_size=512,
    concurrency=30, capacity=30,
    min_capacity=30, max_capacity=60,
    env={}
)

result = await client.run(
    model="qwen3.5-9b",
    functions=[
        RolloutFunctionComponent(
            name="rollout-1",
            fcmodel=FunctionComponentModel(
                classpath="functions.rollout.rollout.CalcXRolloutProcessor"),
            runtime=rollout_runtime,
        ),
        RewardFunctionComponent(
            name="reward-1",
            weight=1.0,
            fcmodel=FunctionComponentModel(
                classpath="functions.reward.reward.DemoRewardProcessor"),
            runtime=reward_runtime,
        ),
    ],
    ...
)
Note

To disable tracing (and save costs), set {"ENABLE_TRAJECTORY": "false"} in the env field. Disabling tracing stops the collection of trace data but does not affect system metrics such as Actor, Critic, and Perf.

Tracing: Performance and cost trade-offs

Data flow and cost drivers

  • Data flow: Function code (core decorators) → OpenTelemetry SDK → ARMS → Trace/Metrics tabs in the console

  • Cost drivers: ARMS span storage (pay-as-you-go), minor CPU and network overhead on the function side, and a slight increase in training latency.

Development vs. large-scale training strategy

Stage

Tracing status

Data collected

Data not collected

Cost impact

Development / Small-batch debugging

Fully Enabled

All data (trajectories, reward analysis, tool calls, system metrics)

Low

Canary release / Pre-release validation

Enabled

All data

Medium

Large-scale production training (e.g., 9B model, tens of thousands of samples, multiple epochs)

Recommended to disable

System metrics: actor, critic, trajectory, perf, timing

Trajectory replay, tool call details, and custom metric curves under trace/

Significantly reduced

To disable tracing, set runtime.env = {"ENABLE_TRAJECTORY": "false"}.

Cost management for custom metrics

  • The number and cardinality of metrics in reward_metrics and rollout_metrics affect ARMS storage costs.

  • Avoid using high-cardinality fields, such as user_id or request_id, as metric keys.

  • Limit key metrics to 10 or fewer, and remove temporary debugging metrics.

  • Sub-dimensional metrics from @sub_reward_func are aggregated into the reward_metrics of the corresponding Reward function. This aggregation helps keep the data volume manageable.

Custom metrics

Key-value metrics you define in your code using the following entry points automatically appear on the Metrics tab under the trace/ group and on the Reward Analysis page in the console:

Entry point

Code location

Console path

reward_metrics

Reward(reward_metrics={"acc": 0.8})

trace/reward_metrics/{reward-name}/acc/{avg,sum}

rollout_metrics

AgentOutput(rollout_metrics={"latency": 1.2})

trace/rollout_metrics/latency/{avg,sum}

sub-dimension score

The reward_metrics returned by @sub_reward_func("toxicity")

Merged into the reward_metrics of the corresponding reward function

Multiple reward functions: Set a unique name for each reward function using RewardFunctionComponent(name="reward-1"). This name distinguishes them in the metric path (e.g., trace/reward_metrics/reward-1/...). You can also use reward_metric_weight to set the weight of each sub-metric in the overall score.

Console overview

After the training job starts, go to the model fine-tuning page in the Model Studio console and click the job name to open its details. This section maps the actions in your code to the data displayed in the console.

The job details page has the following five tabs. We recommend using them in this order: first check the progress, then analyze the behavior, and finally drill down if you encounter issues.

Tab

Purpose

Use case

Details

Job progress and status

Always check this tab first

Trajectory

What the model did and why it received its score

Verify model behavior and diagnose low-scoring samples

Metrics

Quantitative training trends

Analyze curves to determine convergence and identify inflection points

Outputs

checkpoint list and publishing

Select a model after training is complete

Logs

stdout, stderr, and error stack traces

Troubleshoot FAILED jobs

Details tab

View basic job information, such as the job ID, training model, training method (RL full-parameter training), and data configuration. Pay close attention to the job status:

  • PENDING: The job is queued and waiting for resource allocation.

  • RUNNING: Training is in progress. You can switch to other tabs to view real-time data.

  • SUCCESS: Training is complete. Switch to the Outputs tab to publish the model.

  • FAILED: Training failed. Check the Logs tab or retrieve logs using the SDK or CLI: AgenticRL.logs(job_id="ft-xxx") or dashscope rl logs "ft-xxx".

Trajectory Details — maps to @observe_processor

On the Trajectory tab, open the Trajectory Details subpage to view the complete interaction process for each rollout:

  • Trajectory List: Displays all sampled trajectories. You can filter them by sample ID, trajectory ID, epoch, or step.

  • Conversation Process: Shows the complete multi-turn interaction (user → assistant → tool_call → tool_result → assistant), clearly showing the model's reasoning chain.

  • Reward Score: Displays the reward score and status (SUCCESS/FAILED) for each step.

Key questions to consider: Is the tool calling correct? Is the number of conversation turns reasonable? Is the model repeating ineffective actions?

Tool Call Analysis — maps to trace_tool / @observe_tool

On the Trajectory tab, open the Tool Call Analysis subpage to view:

  • Tool Call Records: Tool name, call parameters, result, and latency.

  • Tracing subtab: The span tree for each trajectory. You can expand it to see full details of each tool call and LLM request.

Typical use case: Troubleshoot an agent's tool calling failures. Which tool returned an error? Were the parameters passed correctly? Is the latency too high?

Reward Analysis — maps to reward_metrics

Key concepts:

  • sample: An original sample from the training data, such as a question, an instruction, or a prompt.

  • trajectory: A specific interaction trajectory generated from a single sample over n_rollouts sampling attempts.

  • Relationship: One sample generates N trajectories, where N equals n_rollouts.

On the Trajectory tab, open the Reward Analysis subpage to evaluate training performance from three perspectives:

  • Step dimension: Select a training step to view the aggregated reward metrics (average score, success rate, and trend chart) for all samples at that step. This helps you assess the overall training trend.

  • Sample dimension: Select a sample ID to compare the reward scores for the same sample across trajectories. This helps you identify problematic samples.

  • Trajectory dimension: View the raw scores for each scoring dimension of a single trajectory. Use this for attribution analysis.

Metrics tab — maps to rollout_metrics / reward_metrics

On the Metrics tab, under the trace/ group, you can view aggregated curves (avg / sum) for all the custom metrics you defined in your code.

For a complete list of all 13 groups and 121 metrics, see §Training Metrics Reference. To learn how to diagnose anomalies, see §Troubleshooting Decision Tree (P1–P9).

Outputs tab

After training is complete, the Outputs tab displays a list of checkpoints. Each row includes a checkpoint ID, publishing status, and remaining retention time.

  1. Select the target checkpoint and click the Publish button.

  2. Wait for the publishing process to complete (the status changes from "To Be Published" to "Published").

  3. After the model is published, you can call it via API by its model name.

Training completion does not mean training success. You must check the validation/data/reward/mean@1 metric to select the best checkpoint, which is not always the last one. For information on checkpoint retention policies, resuming training, and the SFT→DPO→RL transition path, see §Resume Training, Checkpoints, and Progressive Training in "Reinforcement Learning Training Configuration — Submission and Configuration."

Logs tab

On the Logs tab, you can view the training run logs. You can also retrieve them using the SDK or CLI: AgenticRL.logs(job_id="ft-xxx", lines=100) or dashscope rl logs "ft-xxx" --lines 100.

For instructions on troubleshooting a FAILED job, see §Troubleshooting workflow for FAILED training jobs in the FAQ section.

Training metrics

Key metrics

The following metrics are organized by monitoring dimension to help you quickly assess the health of your training job:

Monitoring dimension

Key metric

Description

Health criteria

Task performance and generalization

critic/rewards/mean

The North Star metric. It measures the mean reward for the current training batch, showing whether the model is learning an effective policy.

Healthy: Increases steadily until convergence.

validation/data/reward/mean@1

Measures the quality of the model's first response to new prompts, reflecting its generalization ability.

Healthy: Increases steadily, consistent with the training reward trend.

Training stability

actor/entropy

Measures the uncertainty in the policy's output, which is crucial for balancing exploration and exploitation.

Healthy: High at the beginning of training, then gradually decreases while remaining at a non-zero level.

actor/ppo_kl

Measures how much the current policy has diverged from the initial policy.

Healthy: Stays within the range of 0.01 to 0.05.

actor/pg_clipfrac

The fraction of updates clipped by importance sampling, indicating the rate of policy drift.

Healthy: Typically remains below 1%.

System efficiency and boundaries

trajectory/response_length_non_aborted/mean

The mean token count for non-truncated responses.

Healthy: Aligns with your task requirements.

trajectory/.../clip_ratio

The fraction of prompts or responses forcibly truncated for exceeding the maximum length.

Healthy: Extremely low, close to 0%.

timing/s/step

The total time for a complete reinforcement learning (RL) step, which includes generation, evaluation, and update.

Healthy: Remains relatively stable.

Troubleshooting framework

Interpreting metrics in layers

  • Essentials (8-10 metrics for daily health checks): The north star metric, three core metrics, truncation rate, and single-step latency (see the Key Metric Quick Reference Table for health criteria).

  • Scenario-specific (by use case): For agent scenarios, monitor trace/num_llm_calls and trajectory/num_turns. For long-text scenarios, review the length-related metrics.

  • Troubleshooting drill-down: timing sub-items / fully_async queue / critic distribution max/min

To learn when to use each of the five tabs in the console, see the tab quick reference table in the "View Results in the Console" section above.

Troubleshooting decision tree (P1–P9)

When you observe an abnormal curve on the Metrics tab, use this section to pinpoint the problem and tune parameters as recommended (see "Reinforcement Learning Training Configuration — Submission and Configuration" § Parameter tuning decision table).

P1: Non-convergence and P2: oscillation

  • P1 primary signal: A long plateau in critic/rewards/mean; actor/ppo_kl ≈ 0; and actor/grad_norm is extremely small.

  • P1 root cause: The learning rate is too small, the reward function consistently returns the same value, or there is insufficient data.

  • P2 primary signal: Major oscillation in rewards/mean, a spike in grad_norm, and pg_clipfrac > 5%.

  • P2 root cause: The learning rate is too high, the batch size is too small, or the advantage variance is not normalized.

P3: KL explosion and P4: sudden entropy drop (mode collapse)

  • P3 primary signal: actor/ppo_kl is consistently > 0.1, pg_clipfrac increases concurrently, and the trajectory shows garbled or repetitive output.

  • P3 recommended action: Increase kl_loss_coef (e.g., from 0.001 to 0.01), decrease the learning rate, and decrease ppo_mini_batch_size.

  • P4 primary signal: actor/entropy drops to nearly 0 within a few dozen steps, rewards/mean shows a false increase, and sample outputs are repetitive and template-like.

  • P4 root cause: A single mode is excessively rewarded, the KL constraint is too weak, or the temperature is too low.

P5: Reward hacking and P6: validation set collapse

  • P5 primary signal: The training reward increases while the validation reward stagnates or decreases; response_length_non_aborted/mean increases linearly and rapidly; and a specific sub-dimension of reward_metrics shows exclusive gains.

  • P5 attribution path: Analyze the abnormal metric, then sample the trajectory. Check if the assistant is padding its output, copying the prompt, or using tools to exploit loopholes in the reward function (for details, see "Reinforcement Learning Development Guide" § Identifying and preventing reward hacking).

  • P6 primary signal: validation/data/reward/mean@1 shows a downward inflection point while the training reward continues to increase.

  • P6 recommended action: Apply early stopping (select an earlier checkpoint from the Outputs tab), increase the KL constraint, and expand the validation set.

P7: Slow training, P8: garbled output, and P9: high truncation rate

  • P7 primary signal: A sudden increase in timing/s/step.

  • P7 drill-down: Investigate rollout/agent_loop_latency and update_actor. Possible root causes include an increase in response length, delays in FC auto-scaling (check the fully_async queue), or high tool latency.

  • P8 primary signal: The trajectory contains garbled or non-natural language output. This is often caused by a KL explosion (P3). Address this issue using the P3 solution.

  • P9 primary signal: A high value for trajectory/response_length/clip_ratio or prompt_length/clip_ratio.

  • P9 recommended action: Increase max_length or shorten the prompt. Also, check if the reward function inadvertently encourages longer outputs.

Three-dimensional console attribution

Using Step, Sample, and Trajectory

Standard workflow: Spot an inflection point on the Metrics tab → Use Step to pinpoint the failing training segment (trend inflection) → Use Sample to find which samples are skewing the mean → Use Trajectory to examine the reasoning chain or tool calling details → Return to the code to modify the reward function or data.

Tool calling analysis and tracing span tree

  • Tool calling analysis: Tool name, parameters, return value, and duration—essential for troubleshooting agent tool failures.

  • The Tracing subtab provides a span tree for each trajectory (ENTRY:ROLLOUT → LLM → TOOL → REWARD).

  • Typical use case: If the reward suddenly drops, check if a tool has failed or the MCP server is unavailable.

  • Note: The server-side @observe_tool decorator does not affect the client. You must use trace_tool(tools) on the client side.

Custom metrics for attribution (rollout_metrics and reward_metrics)

  • System metrics tell you that an error occurred, but custom metrics pinpoint which dimension is failing.

  • Recommended naming convention: Name your metrics according to business sub-dimensions, such as accuracy, format_score, tool_success_rate, or answer_length.

  • Handling multiple reward functions: Distinguish them using RewardFunctionComponent(name=...) and adjust their weights using reward_metric_weight.

  • Anti-pattern: Avoid stuffing log or debugging information into metrics, as this can cause a cardinality explosion.

Troubleshooting with logs, metrics, traces, and tracing

Each of the four observation sources answers a different question: logs show when a failure occurred and provide stack traces; metrics indicate whether an issue is occurring and its severity; traces reveal why an issue occurred and pinpoint the specific sample; and tracing identifies slow code segments and failing external dependencies. Select an observation source based on the issue type:

Issue type

Primary source

Secondary source

Not needed

Training divergence

metric

trace

log

Task failure

log

metric

trace

Stagnant reward

metric → trace

reward_metrics

log

Tool call error

tracing

log

Training slowdown

timing metrics

tracing

Poor output quality

trace

reward_metrics

When to look beyond the console

  • If you suspect an issue with the reward implementation, reproduce it locally using test_functions.

  • If you suspect data issues, sample the JSONL file and inspect rollout_extra.

  • If you suspect incorrect hyperparameters, review the submission script and check it against the "Reinforcement Learning Training Configuration — Submission and Configuration" guide.

Metric groups

Metrics generated during training are grouped by prefix. See the collapsible panels below for a detailed description of the metrics in each group.

Group prefix

Metric count

Type

Description

actor/

8

System

PPO policy network metrics: loss, entropy, KL divergence, clip ratio, gradient norm, and learning rate

critic/

12

System

Reward and value metrics: mean, max, and min of score, rewards, advantages, and returns

trajectory/

16

System

Trajectory statistics: response length, prompt length, truncation rate, abort rate, and conversation turns

trace/

40+

Hybrid

Observability metrics: epoch, LLM calls, success rate, and custom reward_metrics and rollout_metrics

timing/

17

System

Timing analysis: duration of each Trainer phase (in seconds), rollout duration, and per-token processing time (in milliseconds)

perf/

3

System

Performance overview: total tokens, time per step, and throughput per GPU

fully_async/

24

System

Asynchronous training scheduling: queue status, parameter versions, staleness statistics, and processing latency distribution

Metric groups

Click to expand the complete list of metrics for each group:

actor/ — Policy network metrics (8)

Metric

Description

actor/loss

Total loss (pg + entropy + ...).

actor/pg_loss

Policy Gradient (PG) loss.

actor/entropy

Mean token entropy of the current policy, which indicates the degree of exploration.

actor/ppo_kl

The KL divergence between the current policy and the initial policy (the SFT model).

actor/pg_clipfrac

Fraction of updates clipped by importance sampling, which indicates the rate of policy drift.

actor/pg_clipfrac_lower

Lower-bound clip fraction for dual-clip (always 0 if dual-clip is not enabled).

actor/grad_norm

Gradient norm.

actor/lr

Current learning rate.

critic/ — Reward and value evaluation metrics (12)

Metric

Description

critic/score/{mean,max,min}

Statistics for the raw reward score (before subtracting the KL penalty).

critic/rewards/{mean,max,min}

Statistics for the final training reward after subtracting the KL penalty.

critic/advantages/{mean,max,min}

Statistics for the advantage function, which indicates the improvement of the current policy over the baseline.

critic/returns/{mean,max,min}

Statistics for returns (target for the Critic).

trajectory/ — Trajectory statistics (16)

Metric

Description

trajectory/response_length/{mean,max,min}

Statistics for the number of response tokens, including aborted samples.

trajectory/response/aborted_ratio

Ratio of trajectories with a zero-length response, which can result from a Rollout error or cancellation.

trajectory/response_length_non_aborted/{mean,max,min}

Statistics for the number of tokens in valid responses, excluding aborted samples.

trajectory/response_length/clip_ratio

Ratio of responses truncated at the maximum length.

trajectory/prompt_length/{mean,max,min}

Statistics for the number of prompt tokens.

trajectory/prompt_length/clip_ratio

Ratio of prompts truncated at the maximum length.

trajectory/num_turns/{mean,max,min}

Statistics for the number of interaction turns between the Agent and the LLM.

trace/ — Observability metrics

Metric

Description

trace/training/epoch

Current training epoch.

trace/num_llm_calls/{avg,sum}

Average number of LLM calls per trajectory / total number of LLM calls.

trace/success_rate/agent/{avg,sum}

Agent task success rate / cumulative number of successful tasks.

trace/success_rate/reward/{avg,sum}

Reward calculation success rate / cumulative number of successful calculations.

trace/attempts/agent/{avg,sum}

Average / cumulative number of HTTP attempts by the Agent (including retries).

trace/reward/<reward_name>/{avg,sum}

Average / cumulative value for an individual reward function.

trace/reward_metrics/<reward_name>/<metric>/...

User-defined custom sub-metrics returned by a reward function.

timing/ — Timing analysis (17)

Divided into three subgroups: timing/s/* (Trainer stage latency, in seconds), timing/s/rollout/* (Rollout-side latency, in seconds), and timing/ms/*_per_token (per-token latency, in milliseconds).

Metric

Description

timing/s/step

Total time for a complete Trainer step.

timing/s/trainer_fetch_batch

Time the Trainer spends waiting for and fetching a batch.

timing/s/old_log_prob

Time to compute the log-prob of the old policy.

timing/s/adv

Time to compute the advantage function.

timing/s/update_actor

Time taken for the Actor's backpropagation and optimizer step.

timing/s/param_sync

Time taken for the Rollouter to sync parameters from the Trainer.

timing/s/rollout/agent_loop_latency/avg

Total latency for a single Rollout, including Agent interaction, reward calculation, and post-processing.

timing/s/rollout/model_latency/avg

Cumulative LLM inference latency per trajectory.

timing/s/rollout/reward_latency/avg

Latency of reward function calls.

timing/ms/gen_per_token

Per-token latency during the generation phase.

timing/ms/update_actor_per_token

Per-token latency during the Actor update phase.

perf/ — Performance overview (3)

Metric

Description

perf/total_num_tokens

Total number of tokens processed in this step.

perf/time_per_step

Total time elapsed per step.

perf/throughput

Throughput per GPU (tokens/s/GPU).

Custom metrics

See the custom metrics section for a complete definition.

FAQ

Troubleshooting quick reference

Issue keyword

Primary signal

Recommended action

See also

Stagnant reward

critic/rewards/mean plateaus

Check the learning rate and reward function

P1 in this topic

KL explosion / Garbled output

actor/ppo_kl > 0.1

Increase kl_loss_coef or decrease the learning rate

P3 / P8 in this topic

Reward hacking

Training reward increases while validation reward decreases

Switch signals / Apply reward shaping

P5 in this topic + "Reinforcement Learning Development Guide"

High truncation rate

High clip_ratio

Increase max_length

P9 in this topic

FAILED · Function registration failed

Incorrect classpath / Missing dependency

Check requirements.txt

"Reinforcement Learning Training Configuration — Submission and Configuration"

FAILED · Rollout timeout

Individual request timeout

Increase the timeout or use tracing to find bottlenecks

See "Troubleshooting FAILED jobs" below

FAILED · Insufficient resources

MTU or FC container issues

Check the fully_async queue

"Reinforcement Learning Training Configuration — Submission and Configuration"

Cannot see trace data

Console is blank

Check ARMS authorization / requirements.txt

See "Troubleshooting missing trace data" below

Troubleshooting FAILED jobs

This section covers troubleshooting FAILED jobs caused by metric abnormalities during training. For infrastructure-related failures, like FC function registration failures, upload timeouts, or insufficient resources, refer to the FAQ section in "Reinforcement Learning Training Configuration — Submission and Configuration".

Standard troubleshooting workflow

  1. Step 1: Details tab Confirm the job status and the time of failure.

  2. Step 2: Logs tab Inspect the last 100-500 lines of the log (SDK: AgenticRL.logs(job_id, lines=100) / CLI: dashscope rl logs --lines 100).

  3. Step 3: Identify the error layer:

    • User function error (exceptions in rollout/reward functions) → Reproduce locally with test_functions → Fix the code → Re-register and run.

    • Framework error (OOM, insufficient resources, network issues) → Check the fully_async queue and adjust concurrency / capacity.

    • Data error (JSONL parsing failure) → Validate the format of individual records and the rollout_extra field.

Common error patterns

Error pattern

Primary symptom

Recommended action

Rollout timeout

timeout is triggered, LLM is slow, or a tool is slow.

Increase the timeout value and use tracing to identify the bottleneck.

Intermittent reward FAILED

Empty messages are unhandled or there is an encoding error.

Add a try/except block to return TaskStatus.FAILED and the error.

Insufficient resources

Insufficient MTU or FC auto-scaling cannot keep up.

Check the fully_async queue and adjust concurrency or capacity.

Incorrect data format

JSONL parsing fails.

Validate line-by-line JSON format, the roles in messages, and the rollout_extra field.

Viewing trace data

After authorizing ARMS in the Model Studio console:

  • Trajectory tab → View Trajectory Details, Reward Analysis, and Tool Call Analysis.

  • Metrics tab → Check the trace/ group.

Troubleshooting missing trace data

Follow these steps to troubleshoot the issue:

  1. Ensure that ENABLE_TRAJECTORY=false is not set in the runtime env. (Tracing is enabled by default and requires no extra configuration.)

  2. Ensure you have authorized the ARMS service in the Model Studio console.

  3. Check that requirements.txt includes the required OpenTelemetry dependencies.

  4. Check whether the process() method is decorated with @observe_processor.

Performance impact of tracing

Tracing adds minor storage and latency overhead. We recommend enabling it during development and debugging. For large-scale production training, if you only need to view training metrics (such as Actor, Critic, and Perf) and do not need trace details, you can disable tracing by setting {"ENABLE_TRAJECTORY": "false"} in the runtime env. Disabling tracing does not affect system metrics.

Distinguishing reward functions

Set a unique name for each reward function by using RewardFunctionComponent(name="reward-1"). This name appears in the metric path (e.g., trace/reward_metrics/reward-1/...), and the console automatically groups and displays the metrics by name.

Disabling tracing to reduce costs

Set {"ENABLE_TRAJECTORY": "false"} in the runtime env, either by using FunctionComponentRuntime(env={"ENABLE_TRAJECTORY": "false"}) or by setting env: {ENABLE_TRAJECTORY: false} in a YAML file.