Reinforcement learning training

更新时间:
复制 MD 格式

This guide walks you through submitting and monitoring your first reinforcement learning (RL) training job using an end-to-end example.

Learn about the basic concepts and use cases of RL training.

Model Studio offers several model fine-tuning methods that are not mutually exclusive. Instead, they form a progressive and complementary workflow.

continued pre-training (CPT) (optional) → supervised fine-tuning (SFT) → direct preference optimization (DPO) (optional) → RL (optional)

  1. Continued pre-training (CPT) - Fills knowledge gaps. General models have broad but shallow knowledge and often cannot meet the depth and accuracy requirements of specialized domains.

    • Financial model: Learns financial terminology

    • Medical model: Memorizes drugs and pathologies

    • Legal model: Understands legal articles and case law

  2. Supervised fine-tuning (SFT) - Learns to perform tasks

    • Customer service bot: Learns customer service procedures

    • Code assistant: Learns programming paradigms

    • tool calling (agent): Learns to use Model Context Protocol (MCP)

  3. Direct preference optimization (DPO) - Improves performance

    • Safety and responsibility: Rejects harmful suggestions

    • Conciseness and effectiveness: Provides crisp and clear answers

    • Objectivity and neutrality: Gives fair and unbiased evaluations

  4. Reinforcement learning (RL) - Learns to reason. The model autonomously explores optimal strategies guided by a reward signal, without needing a standard answer.

    • Mathematical reasoning: Solves problems more accurately

    • tool calling (agent): Makes more reliable calls

    • Code generation: Writes executable code

RL training workflow

The RL training process consists of five steps, from environment setup to model deployment:

  1. Set up the environment — Install the offline SDK, get an API key, and complete RL service authorization.

  2. Prepare data and code — Download the demo package to understand the data format and project structure.

  3. Submit the training job — Run the demo to deploy functions, upload data, and submit the job in one step.

  4. Monitor training status — View the training status, logs, and reward trends.

  5. Publish and deploy the model — Publish a checkpoint to "My Models" and deploy it for API access.

How RL training works

RL training continuously improves the model's capabilities through a "generate-score-optimize" loop. The data flow for each training round is as follows:

Training data (user questions + reference answers)
            │
            ▼
Rollout: The model generates candidate results based on the current policy.
            │
            ▼
Reward: Scores the result on a scale of 0 to 1 by comparing it with the reference answer. This step can be merged with the rollout.
            │
            ▼
Policy update: Reinforce high-scoring policies and suppress low-scoring ones based on the reward signal.
            │
            ▼
Iteration: Resample using the updated model to improve quality in each round.

Supported models

Model name

Model code

MoE

Recommended MTUs

Qwen3.5-9B

qwen3.5-9b

No

Type IV model training unit * 24

Qwen3.5-35B-A3B

qwen3.5-35b-a3b

Yes

-

Training unit pricing

Billing method

Description

Billing formula

Model training unit · subscription

Purchase dedicated model training units on a monthly subscription for faster, queue-free training. The minimum billing unit is one hour. If you cancel your subscription early, you are charged at 1.2 times the hourly rate. To purchase or manage subscription training units, go to the Model Fine-tuning console and click Manage Training Resources.

Unit price × Number of instances × Subscription duration

Model training unit · pay-as-you-go

Use dedicated training resources for faster, queue-free training. You are billed for the actual duration of use. This option is enabled automatically when you create a job and does not require a pre-purchase.

Hourly unit price × Number of instances × Hours of use

Training unit type

Billing method

Unit price

Minimum billing unit

Type IV (MTU4)

subscription (monthly)

CNY 19,914.00/month/instance

1 hour

pay-as-you-go (per minute)

CNY 41.00/hour/instance

1 minute

Note

Contact your account manager for rules on subscription scaling, renewal, and cancellation, or for specific model training unit (MTU) capacity calculations.

Environment setup

Warning

RL training only supports billing based on model training units (MTUs). Billing by token is not supported.

Follow these steps to configure your environment:

  1. Get an API key: Obtain a DashScope API key from the Model Studio console.

  2. Authorize the RL service: On the Model Fine-tuning page in the Model Studio console, complete the one-click authorization. The first time you use RL training, the console prompts you to authorize Alibaba Cloud OpenTelemetry, Function Compute (FC), and Log Service (SLS).

    image

  3. Set environment variables: The local SDK reads the following variables to control packaging, uploading, and deployment:

    • DASHSCOPE_API_KEY (Required): Your API key.

    • FC_PYPI_LIB (Required): Specifies the offline dashscope package to be installed when the FC container starts. The SDK automatically injects this variable as FC_SDK_PACKAGE during deployment, so you do not need to install it manually in FC. The .whl filename must exactly match this variable and be placed in the project's root directory.

    • LOG_LEVEL (Optional, default: info): Set to debug to output complete request/response information, which is useful for troubleshooting submission failures.

    export DASHSCOPE_API_KEY="sk-your-api-key"
    export FC_PYPI_LIB="dashscope-1.25.16-py3-none-any.whl"
    export LOG_LEVEL="info"
    

Data and code preparation

  1. Download the demo package agentic-rl-example.zip. (This demo requires a minimum of 24 Type IV model training units to run.)

  2. After downloading the demo package, install the DashScope SDK (requires Python >= 3.10). Follow the instructions in the demo package to install the required dependencies:

    pip install dashscope-1.25.16-py3-none-any.whl  # Install the local .whl package
    # Or use the PyPI source: pip install dashscope==1.25.16

The project has the following structure:

agentic-rl-example/
├── submit_job.py                         ← Main entry point for submitting the training job
├── test_functions.py                     ← Function registration and remote testing
├── config.yaml                           ← Optional YAML configuration file
├── dashscope-1.25.16-py3-none-any.whl   ← Offline SDK installation package
├── requirements.txt                      ← List of dependencies
├── data/
│   ├── calc_train.jsonl                  ← Training data (full version)
│   ├── calc_train_min.jsonl              ← Training data (demo uses the minimal version by default)
│   ├── calc_validation.jsonl             ← Validation data (full version)
│   └── calc_validation_min.jsonl         ← Validation data (demo uses the minimal version by default)
├── functions/
│   ├── __init__.py
│   ├── rollout/
│   │   ├── rollout.py                    ← Rollout function (LangGraph + MCP tool)
│   │   └── rollout_only.py              ← Rollout function (simple version, with trajectory example)
│   └── reward/
│       ├── reward.py                     ← Reward function (basic scoring)
│       ├── reward_decorator.py           ← Reward function (decorator style)
│       └── group_reward.py              ← Reward function (group scoring)
└── resources/
    ├── rollout_input.json               ← Rollout test input
    ├── reward_input.json                ← Reward test input
    ├── reward_decorator_input.json      ← Decorator reward test input
    └── group_reward_input.json          ← Group reward test input

Demo

This demo implements a calculation agent:

  • Rollout function: Calls the model being trained, using a calculator tool to answer math problems.

  • Reward function: Compares the model's output with the reference answer to determine if the response is correct and assigns a score.

Note

The rollout function in the demo uses LangGraph to orchestrate the agent workflow and calls the calculator tool via the Model Context Protocol (MCP), a tool-calling protocol for LLMs.

To write your own rollout and reward functions, see the Reinforcement Learning Development Guide.

Data format

The training data is in JSONL format, with each line being a JSON object containing messages and rollout_extra fields:

{"messages": [{"content": "Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May?", "role": "user"}], "rollout_extra": {"solution": "72"}}
  • messages: The user's question, in a format consistent with ChatML.

  • rollout_extra: Contains the reference answer or other business data, which is passed to the reward function for scoring. You can customize the fields in this object.

Data in rollout_extra passes through to RolloutInput.rollout_extra in the rollout function and RewardInput.agent_output.rollout_extra in the reward function, making it available for your scoring logic. This field supports custom keys and is not limited to the solution key used in the example.

Recommended dataset size

  • Getting started: A few dozen to a few hundred entries are enough to validate your approach (the data volume must be greater than the batch_size).

  • Production training: A larger dataset generally yields better results. Large datasets also support a larger batch size, which helps improve training stability.

Job submission

The demo already includes the rollout and reward functions. Run the following command to deploy the functions, upload data, and submit the training job in a single step:

python submit_job.py

Full code and parameter descriptions

The following code is the complete training script. Its comments explain each parameter and provide recommended values:

import asyncio
from dashscope.finetune.reinforcement import (
    RolloutFunctionComponent, RewardFunctionComponent,
    FunctionComponentModel, FunctionComponentRuntime,
    TrainingDataset, ValidationDataset, DataSourceType,
)
from dashscope.finetune.agentic_rl import AgenticRL

async def main():
    client = AgenticRL()

    # client.run() handles everything in one step: registering functions to FC, uploading training data, and submitting the training job
    result = await client.run(
        model="qwen3.5-9b",                              # Base model
        training_datasets=[TrainingDataset(              # Training dataset
            data_source_type=DataSourceType.FILE_ID,
            file_name="./data/calc_train_min.jsonl",
        )],
        validation_datasets=[ValidationDataset(          # Validation dataset (optional)
            data_source_type=DataSourceType.FILE_ID,
            file_name="./data/calc_validation_min.jsonl",
        )],
        functions=[
            # Rollout function: Defines how the model performs policy sampling
            RolloutFunctionComponent(
                name="rollout-1",
                timeout=600,
                fcmodel=FunctionComponentModel(
                    classpath="functions.rollout.rollout.CalcXRolloutProcessor"),
                runtime=FunctionComponentRuntime(
                    cpu=2, memory_size=4096, disk_size=512,
                    concurrency=30, capacity=30,
                    env={"ENABLE_TRAJECTORY": "true"})),  # Environment variables for the FC function
            RewardFunctionComponent(
                name="reward-1",
                weight=1.0,
                timeout=120,
                fcmodel=FunctionComponentModel(
                    classpath="functions.reward.reward.DemoRewardProcessor"),
                runtime=FunctionComponentRuntime(
                    cpu=2, memory_size=4096, disk_size=512,
                    concurrency=30, capacity=30)),
        ],
        # MTU billing configuration
        resource_config={
            "charge_type": "mtu_postpaid",   # pay-as-you-go
            "mtu_spec_code": "MTU4",         # Training unit specification
            "mtu_capacity": 1,               # Number of MTUs
        },
        # Hyperparameters for the GSPO algorithm (the following are all required hyperparameters for the non-MoE qwen3.5-9b model)
        hyper_parameters={
            "algorithm": "gspo",             # Training algorithm
            "n_epochs": 5,                   # Number of training epochs
            "learning_rate": 7e-6,           # Learning rate
            "lr_scheduler_type": "cosine",   # Learning rate scheduler policy
            "lr_warmup_steps": -1,           # Learning rate warmup steps (-1 = auto)
            "batch_size": 128,               # Batch size
            "n_rollouts": 8,                 # Number of rollouts per sample
            "ppo_mini_batch_size": 16,       # PPO mini-batch size
            "kl_loss_coef": 0.001,           # KL divergence loss coefficient
            "max_length": 8192,              # Maximum sequence length (prompt + response)
            "max_prompt_length": 2048,       # Maximum prompt length
            "eval_steps": 50,                # Perform evaluation every N steps
            "save_steps": 100,               # Save a checkpoint every N steps
            "save_strategy": "steps",        # Save strategy (steps or epoch)
            "save_total_limit": 3,           # Maximum number of checkpoints to keep
        })

    if result.status_code == 200:
        print(f"Training job submitted. Job ID: {result.output.job_id}")
    else:
        print(f"Submission failed: {result}")

asyncio.run(main())
Note

You can also submit training jobs via a configuration file or the CLI (dashscope rl run). For detailed explanations of all three submission methods and their fields, see Training Configuration — Submission and Configuration.

Getting started tips

After your first successful run, follow these four practices:

  • Start with a minimal dataset: The demo defaults to using calc_train_min.jsonl and n_epochs=1. Run this once to confirm the entire pipeline works before switching to the full dataset. This helps you avoid wasting 24 MTUs on a large dataset only to discover a configuration error.

  • Do not change the hyperparameters initially: The 13 hyperparameters listed above are the recommended starting point for qwen3.5-9b. Do not adjust them for your first training run. When you do make adjustments, change only one variable at a time (like the learning rate or batch size) and run for at least eval_steps × 3 steps before evaluating the trend.

  • Focus on one core metric: On the Metrics tab, watch critic/rewards/mean. A steady increase indicates the model is learning. If it stagnates or decreases, stop the job and investigate instead of letting it run idly.

  • If the job fails, check the logs first: If a job status is FAILED, go to the Logs tab and check the error messages at the end, or use the SDK to pull the logs: AgenticRL.logs(job_id="ft-xxx", lines=100). Configuration or dependency errors are usually easy to spot.

Note

For algorithm selection, hyperparameter starting points, and tuning decisions, see "Training Configuration — Submission and Configuration." For reward function design principles and defense against hacking, see the "Reinforcement Learning Development Guide." For guidance on attributing the cause of abnormal metrics, see "Observability Configuration and Metric Reference for Reinforcement Learning."

Training monitoring and analysis

After you submit the job, the job_id returned by run() is the unique identifier for querying the training status. Training typically takes from tens of minutes to several hours, depending on the dataset size and model.

In the Model Studio console, go to the Model Fine-tuning page and click a job to open its details page, which contains the following five tabs:

Tab

Purpose

Main content

Details

Job metadata

Job ID, status, base model, training method, start/end times, data configuration

Trajectory

Agent behavior replay

Complete dialogue process, reward analysis (across step, sample, and trajectory dimensions), and tool calling details

Metrics

Training health

13 groups of metric charts (actor, critic, trajectory, trace, timing, perf, fully_async, etc.), including user-defined metrics

Logs

Runtime troubleshooting

stdout/stderr from the training process, used to locate the cause of errors when a job fails

Output

Model assets

List of checkpoints and their publishing status. You can publish a checkpoint to "My Models" for deployment and use.

image

For a detailed description of each tab, see Observability Configuration for Reinforcement Learning.

Model publishing and deployment

When training is complete, the final checkpoint is automatically published to My Models. To publish an intermediate checkpoint, go to the Model Fine-tuning > Output page in the Model Studio console and do so manually. For details, see Model Fine-tuning Console (Model Studio documentation, online only).

After a model is published, you can deploy it from the My Models page. Once deployed, you can call the model via an API. For details, see Model Deployment (Model Studio documentation, online only).

Next steps