本文将指导您如何在 ACK 集群中通过VeRL基于Qwen2.5-3B-Instruct模型部署和运行典型的强化学习任务,包括环境准备、镜像构建、任务提交、资源监控及最佳实践。
容器服务 Kubernetes 版 ACK(Container Service for Kubernetes)为企业提供了一种高效、弹性、可扩展的容器化平台。强化学习(Reinforcement Learning, RL)作为人工智能的重要分支,通常涉及大量计算资源、分布式训练和复杂环境模拟。借助 ACK,您可以轻松部署、管理和扩展强化学习训练任务,充分利用 Kubernetes 的调度能力与阿里云的弹性基础设施。下图展示了本次作业运行的组件架构。

准备工作
-
-
推荐使用 GPU 实例节点以加速训练。本文示例中使用 1 台 8 卡 GU8TF 的灵骏节点。
-
-
(可选)已开通对象存储 OSS,用于持久化模型检查点、日志和训练数据。
步骤一:准备强化学习训练镜像
本示例采用VeRL框架执行强化学习任务,您可以使用VeRL官方提供的镜像,也可以使用自建镜像。使用自建镜像时需要保证镜像中有已安装相关依赖包,如VeRL,vLLM,SGLang,Ray等。下面是一个示例Dockerfile:
from verl/verl:vllm012.latest
WORKDIR /home/verl
COPY . .
RUN apt update && apt install -y openssh-server vim
RUN apt remove python3-blinker -y; pip install -e .
步骤二:配置MCP Server并使用ACS Sandbox
-
安装MCP Server和Sandbox组件。
开源版本
# 下载代码仓库 git clone https://github.com/openkruise/agents cd agents # 生成agents operator部署yaml kubectl kustomize config/default >operator-install.yaml # 按需修改 operator-install.yaml 中的配置信息 kubectl apply -f operator-install.yaml # 部署测试 sandbox-manager,参考 https://github.com/openkruise/agents/blob/master/config/sandbox-manager/README_zh-CH.md kubectl kustomize config/sandbox-manager >sandbox-manager.yaml # mcp 代码还未合入因此需要手动将sandbox-manager镜像改成 # baicun-business-registry.cn-beijing.cr.aliyuncs.com/baicun-dev/sandbox:sandbox-manager-v12 # 确认管理pod运行正常 kubectl get pod -l "app.kubernetes.io/name=sandbox-manager" -A kubectl get pod -l "app.kubernetes.io/name=sandbox-controller-manager" -A应用市场版本
-
在集群列表页面,单击目标集群名称,然后在左侧导航栏,选择组件管理。
-
安装 Ingress Controller 和 Sandbox 相关组件。
-
安装
ack-agent-sandbox-controller组件使用默认配置安装组件。
-
安装
ack-sandbox-manager组件-
准备E2B域名。
准备域名、域名解析和申请证书的详细操作,请参见应用于生产环境。
-
配置组件参数。
修改
className为alb(以安装ALB Ingress Controller组件为例),修改domain为实际域名,修改adminApiKey为自定义API Key,其他配置保持默认。组件安装完成后会在sandbox-system命名空间中创建一个名为sandbox-manager的路由。 -
若使用ALB Ingress Controller,还需同时为ALB实例和Ingress新增HTTPS:443监听配置。
-
-
-
-
将以下内容保存为
sandbox.yaml,然后执行kubectl apply -f sandbox.yaml部署Sandbox定义。SandboxSet会创建出大小为3的预热池,在强化学习过程中SandboxManager会不断从该预热池中取出并使用Sandbox。--- apiVersion: v1 kind: Service metadata: name: mcp-sandbox spec: selector: app.kubernetes.io/instance: release-name app.kubernetes.io/name: ack-sandbox-manager component: sandbox-manager type: ClusterIP sessionAffinity: None sessionAffinityConfig: clientIP: timeoutSeconds: 10800 ports: - name: vllm protocol: TCP port: 8000 targetPort: 18082 --- apiVersion: agents.kruise.io/v1alpha1 kind: SandboxSet metadata: annotations: # 启用 SandboxManager 的 Envd 初始化能力 e2b.agents.kruise.io/should-init-envd: "true" name: code-interpreter namespace: default spec: # 预热池的大小,建议比预估的请求突发量略大 replicas: 3 template: spec: initContainers: - name: init image: registry-cn-hangzhou.ack.aliyuncs.com/acs/agent-runtime:v0.0.1 imagePullPolicy: IfNotPresent terminationMessagePolicy: File volumeMounts: - name: envd-volume mountPath: /mnt/envd env: - name: ENVD_DIR value: /mnt/envd restartPolicy: Always containers: - name: sandbox image: acs-image-test-01-registry.cn-hangzhou.cr.aliyuncs.com/e2b/code-interpreter:v1.6 imagePullPolicy: IfNotPresent terminationMessagePolicy: File env: - name: ENVD_DIR value: /mnt/envd volumeMounts: - name: envd-volume mountPath: /mnt/envd lifecycle: postStart: exec: command: - bash - /mnt/envd/envd-run.sh startupProbe: failureThreshold: 20 successThreshold: 1 httpGet: path: /health port: 49999 scheme: HTTP initialDelaySeconds: 1 periodSeconds: 2 timeoutSeconds: 1 # 保证容器快速销毁,提高复用的概率 terminationGracePeriodSeconds: 1 restartPolicy: Always dnsPolicy: ClusterFirst volumes: - name: envd-volume emptyDir: { }
(可选)步骤三:准备强化学习数据集
VeRL中可以通过指定data.train_files的方式从远端下载数据集。不过由于数据集通常较大,且通常需要一些预处理,在生产环境中建议通过预处理任务下载数据,完成预处理并推送到云端存储。
-
将以下内容保存为
data.yaml,然后执行kubectl apply -f data.yaml从Hugging Face下载数据,进行预处理,并推送到OSS Bucket。apiVersion: v1 kind: Secret metadata: name: hf-oss-credentials namespace: default type: Opaque stringData: # HuggingFace Token HF_TOKEN: "hf_xxxxx" # 阿里云 OSS 凭证 (alibabacloud-oss-v2 SDK 使用环境变量认证) akId: "xxx" akSecret: "xxx" OSS_REGION: "xxx" OSS_BUCKET: "xxx" --- apiVersion: v1 kind: ConfigMap metadata: name: preprocess-script namespace: default data: preprocess.py: | #!/usr/bin/env python3 """ 数据集预处理脚本示例 """ import os import json from datasets import load_from_disk def preprocess_dataset(input_dir, output_dir): """预处理数据集""" print(f"Loading dataset from {input_dir}") dataset = load_from_disk(input_dir) train_dataset = dataset["train"] test_dataset = dataset["test"] instruction_following = "Let's think step by step and output the final answer after `####`." # add a row to each data item that represents a unique id def make_map_fn(split): def process_fn(example, idx): question_raw = example.pop("question") question = question_raw + " " + instruction_following answer_raw = example.pop("answer") solution = extract_solution(answer_raw) data = { "data_source": data_source, "agent_name": "tool_agent", "prompt": [ { "role": "system", "content": ( "You are a math expert. You are given a question and you need to solve it step by step. " "Reasoning step by step before any tool call. " "You should use the `calc_gsm8k_reward` tool after step by step solving the question, " "before generate final answer at least once and refine your answer if necessary. " "Put your final answer in the format of `#### <answer>`." ), }, { "role": "user", "content": question, }, ], "ability": "math", "reward_model": {"style": "rule", "ground_truth": solution}, "extra_info": { "split": split, "index": idx, "answer": answer_raw, "question": question_raw, "need_tools_kwargs": True, "tools_kwargs": { "calc_gsm8k_reward": { "create_kwargs": {"ground_truth": solution}, # "execute_kwargs": {}, # "calc_reward_kwargs": {}, # "release_kwargs": {}, }, }, "interaction_kwargs": { "query": question, "ground_truth": solution, }, }, } return data return process_fn train_dataset = train_dataset.map(function=make_map_fn("train"), with_indices=True, num_proc=8) test_dataset = test_dataset.map(function=make_map_fn("test"), with_indices=True, num_proc=8) # 保存处理后的数据集 os.makedirs(output_dir, exist_ok=True) train_dataset.to_parquet(os.path.join(output_dir, "train.parquet")) test_dataset.to_parquet(os.path.join(output_dir, "test.parquet")) print(f"Processed dataset saved to {output_dir}") return output_dir if __name__ == "__main__": input_path = os.environ.get("INPUT_PATH", "/data/raw") output_path = os.environ.get("OUTPUT_PATH", "/data/processed") preprocess_dataset(input_path, output_path) --- apiVersion: batch/v1 kind: Job metadata: name: dataset-pipeline namespace: default labels: app: dataset-pipeline spec: backoffLimit: 3 template: metadata: labels: app: dataset-pipeline spec: restartPolicy: OnFailure volumes: # 预处理脚本 - name: scripts configMap: name: preprocess-script defaultMode: 0755 containers: - name: dataset-pipeline image: python:3.10-slim command: - /bin/bash - -c - | set -e #========================================== # Step 1: 安装所有依赖 #========================================== echo "=== Installing dependencies ===" pip install --no-cache-dir datasets huggingface_hub pandas numpy alibabacloud-oss-v2 Pillow #========================================== # Step 2: 从 HuggingFace 下载数据集 #========================================== echo "=== Downloading dataset from HuggingFace ===" python3 << 'EOF' import os from datasets import load_dataset from huggingface_hub import login # 登录 HuggingFace(如果需要访问私有数据集) hf_token = os.environ.get("HF_TOKEN") if hf_token: login(token=hf_token) # 下载数据集 dataset_name = os.environ.get("DATASET_NAME", "hiyouga/geometry3k") dataset_config = os.environ.get("DATASET_CONFIG", None) print(f"Downloading dataset: {dataset_name}") dataset = load_dataset(dataset_name, dataset_config) # 保存到本地 output_path = "/data/raw" dataset.save_to_disk(output_path) print(f"Dataset saved to {output_path}") EOF echo "=== Download completed ===" #========================================== # Step 3: 执行预处理脚本 #========================================== echo "=== Running preprocessing script ===" python3 /scripts/preprocess.py echo "=== Preprocessing completed ===" #========================================== # Step 4: 上传到 OSS (使用 alibabacloud-oss-v2 SDK) #========================================== echo "=== Uploading to OSS ===" python3 << 'EOF' import os from pathlib import Path import alibabacloud_oss_v2 as oss # OSS 配置 bucket_name = os.environ["OSS_BUCKET"] region = os.environ["OSS_REGION"] oss_prefix = os.environ.get("OSS_PREFIX", "data/geo3k-processed/") local_path = os.environ.get("OUTPUT_PATH", "/data/processed") # 使用环境变量凭证提供者 (自动读取 OSS_ACCESS_KEY_ID 和 OSS_ACCESS_KEY_SECRET) credentials_provider = oss.credentials.EnvironmentVariableCredentialsProvider() # 加载默认配置并设置凭证提供者 cfg = oss.config.load_default() cfg.credentials_provider = credentials_provider cfg.region = region # 创建 OSS 客户端 client = oss.Client(cfg) def upload_directory(local_dir, oss_prefix): """递归上传目录到 OSS""" local_path = Path(local_dir) uploaded_count = 0 failed_count = 0 for file_path in local_path.rglob("*"): if file_path.is_file(): relative_path = file_path.relative_to(local_path) oss_key = f"{oss_prefix}{relative_path}" try: # 读取文件内容 with open(file_path, 'rb') as f: data = f.read() # 上传到 OSS result = client.put_object(oss.PutObjectRequest( bucket=bucket_name, key=oss_key, body=data, )) print(f"Uploaded: {file_path} -> {oss_key} (status: {result.status_code})") uploaded_count += 1 except Exception as e: print(f"Failed to upload {file_path}: {e}") failed_count += 1 return uploaded_count, failed_count uploaded, failed = upload_directory(local_path, oss_prefix) print(f"=== Upload completed: {uploaded} files uploaded, {failed} files failed ===") if failed > 0: raise Exception(f"{failed} files failed to upload") EOF echo "=== Pipeline completed successfully ===" env: # HuggingFace 配置 - name: HF_TOKEN valueFrom: secretKeyRef: name: hf-oss-credentials key: HF_TOKEN - name: DATASET_NAME value: "hiyouga/geometry3k" - name: HF_HOME value: "/tmp/huggingface" # 预处理配置 - name: INPUT_PATH value: "/data/raw" - name: OUTPUT_PATH value: "/data/processed" # OSS 配置 - name: OSS_ACCESS_KEY_ID valueFrom: secretKeyRef: name: hf-oss-credentials key: akId - name: OSS_ACCESS_KEY_SECRET valueFrom: secretKeyRef: name: hf-oss-credentials key: akSecret - name: OSS_REGION valueFrom: secretKeyRef: name: hf-oss-credentials key: OSS_REGION - name: OSS_BUCKET valueFrom: secretKeyRef: name: hf-oss-credentials key: OSS_BUCKET - name: OSS_PREFIX value: "data/geo3k-processed/" volumeMounts: - name: scripts mountPath: /scripts resources: requests: memory: "2Gi" cpu: "1" limits: memory: "16Gi" cpu: "4"
步骤四:提交强化学习任务配置
-
将以下内容保存为
pvpvc.yaml,然后执行kubectl apply -f pvpvc.yaml通过PV和PVC挂载OSS静态存储卷。以下示例使用AK/SK方式认证,RRSA方式请参考使用ossfs 2.0静态存储卷。
apiVersion: v1 kind: PersistentVolume metadata: name: ym-dataset labels: alicloud-pvname: ym-dataset spec: capacity: storage: 20Gi accessModes: - ReadWriteMany persistentVolumeReclaimPolicy: Retain csi: driver: ossplugin.csi.alibabacloud.com volumeHandle: ym-dataset # 需要和PV名字一致。 nodePublishSecretRef: name: hf-oss-credentials namespace: default volumeAttributes: bucket: "xxxx" #替换为实际Bucket名称。 url: "oss-ap-southeast-1-internal.aliyuncs.com" #替换为实际oss url名称。 otherOpts: "-o umask=022 -o max_stat_cache_size=100000 -o allow_other -o dbglevel=debug -o curldbg" path: "/" --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: ym-dataset spec: accessModes: - ReadWriteMany resources: requests: storage: 20Gi selector: matchLabels: alicloud-pvname: ym-dataset # (可选)Model可以通过传入HuggingFace仓库路径进行实时下载 --- apiVersion: v1 kind: PersistentVolume metadata: name: ym-models labels: alicloud-pvname: ym-models spec: capacity: storage: 20Gi accessModes: - ReadWriteMany persistentVolumeReclaimPolicy: Retain csi: driver: ossplugin.csi.alibabacloud.com volumeHandle: ym-models # 需要和PV名字一致。 nodePublishSecretRef: name: hf-oss-credentials namespace: default volumeAttributes: bucket: "xxxx" #替换为实际Bucket名称。 url: "oss-ap-southeast-1-internal.aliyuncs.com" #替换为实际oss url名称。 otherOpts: "-o umask=022 -o max_stat_cache_size=100000 -o allow_other -o dbglevel=debug -o curldbg" path: "/" --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: ym-models spec: accessModes: - ReadWriteMany resources: requests: storage: 20Gi selector: matchLabels: alicloud-pvname: ym-models -
将以下内容保存为
configs.yaml,然后执行kubectl apply -f configs.yaml提交任务相关配置。--- apiVersion: v1 kind: ConfigMap metadata: name: gsm8k-configs namespace: default data: gsm8k_multiturn_grpo.yaml: | hydra: searchpath: - file://verl/trainer/config defaults: - ppo_trainer - _self_ data: max_prompt_length: 1024 max_response_length: 1024 train_batch_size: 256 return_raw_chat: True actor_rollout_ref: hybrid_engine: True rollout: name: vllm multi_turn: enable: True max_assistant_turns: 5 mcp_server.json: | { "mcpServers": { "Tavily Expert": { "url": "xxxxx", # 替换成sandbox mcp的ingress地址 "api_key": "xxxxx" # 如果需要添加api_key,可以在ray cluster中新增nginx容器进行代理 } } } gsm8k_mcp_tool_config.yaml: | tools: - class_name: verl.tools.mcp_search_tool.MCPSearchTool config: rate_limit: 120 timeout: 120 type: mcp mcp: mcp_servers_config_path: /var/configs/mcp_server.json tool_selected_list: - run_code_once - class_name: "verl.tools.gsm8k_tool.Gsm8kTool" config: type: native tool_schema: type: "function" function: name: "calc_gsm8k_reward" description: "A tool for calculating the reward of gsm8k. (1.0 if parsed answer is correct, 0.0 if parsed answer is incorrect or not correctly parsed)" parameters: type: "object" properties: answer: type: "string" description: "The model's answer to the GSM8K math problem, must be a digits" required: ["answer"]
步骤五:提交强化学习任务
VeRL中支持通过MCPSearchTool的方式查询MCP Server提供的工具,在每个用例开始时会通过AgentLoop链接到MCP Server并在多轮对话中调用工具。
-
将以下内容保存为
rayjob.yaml,然后执行kubectl apply -f rayjob.yaml提交强化学习任务。--- apiVersion: ray.io/v1 kind: RayJob metadata: name: rayjob-example namespace: default spec: shutdownAfterJobFinishes: false # ttlSecondsAfterFinished: 300 runtimeEnvYAML: | working_dir: /home/verl submissionMode: SidecarMode entrypoint: | python3 -m verl.trainer.main_ppo \ --config-path=/var/configs \ --config-name='gsm8k_multiturn_grpo' \ algorithm.adv_estimator=grpo \ data.train_batch_size=16 \ data.max_prompt_length=1024 \ data.max_response_length=1024 \ data.filter_overlong_prompts=True \ data.truncation='error' \ data.return_raw_chat=True \ actor_rollout_ref.model.path=/var/model/Qwen2.5-3B-Instruct \ actor_rollout_ref.actor.optim.lr=1e-6 \ actor_rollout_ref.model.use_remove_padding=True \ actor_rollout_ref.actor.ppo_mini_batch_size=8 \ actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1 \ actor_rollout_ref.actor.use_kl_loss=True \ actor_rollout_ref.actor.kl_loss_coef=0.001 \ actor_rollout_ref.actor.kl_loss_type=low_var_kl \ actor_rollout_ref.actor.entropy_coeff=0 \ actor_rollout_ref.model.enable_gradient_checkpointing=True \ actor_rollout_ref.actor.fsdp_config.param_offload=False \ actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \ actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=1 \ actor_rollout_ref.rollout.tensor_model_parallel_size=2 \ actor_rollout_ref.rollout.name=vllm \ actor_rollout_ref.rollout.mode=async \ actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \ actor_rollout_ref.rollout.n=16 \ actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=1 \ actor_rollout_ref.ref.fsdp_config.param_offload=True \ actor_rollout_ref.rollout.trace.backend=mlflow \ actor_rollout_ref.rollout.trace.token2text=True \ algorithm.use_kl_in_reward=False \ trainer.critic_warmup=0 \ trainer.logger='["console","mlflow"]' \ trainer.project_name='gsm8k_tool-agent' \ trainer.experiment_name='qwen2.5-3b_function_rm-gsm8k-vllm-tool-agent-verify-n16' \ trainer.n_gpus_per_node=8 \ trainer.nnodes=1 \ trainer.save_freq=1 \ trainer.test_freq=20 \ trainer.total_training_steps=1 \ data.train_files=/var/model-dataset/processed-gsm8k/train20.parquet \ data.val_files=/var/model-dataset/processed-gsm8k/test100.parquet \ actor_rollout_ref.rollout.multi_turn.tool_config_path="/var/configs/gsm8k_mcp_tool_config.yaml" \ actor_rollout_ref.actor.checkpoint.save_contents='["hf_model", "model"]' \ trainer.total_epochs=1 rayClusterSpec: headGroupSpec: rayStartParams: dashboard-host: 0.0.0.0 serviceType: ClusterIP template: metadata: annotations: labels: spec: affinity: {} tolerations: - key: node-role.alibabacloud.com/lingjun containers: - env: - name: VERL_ROOT value: /home/verl image: registry-ap-southeast-1.ack.aliyuncs.com/dev/verl:vllm012.latest.43dc9a44 imagePullPolicy: IfNotPresent name: ray-head resources: limits: cpu: "100" memory: 500Gi nvidia.com/gpu: "8" securityContext: runAsUser: 0 volumeMounts: - mountPath: /var/configs name: configs - mountPath: /var/model name: model - mountPath: /var/model-dataset name: model-dataset imagePullSecrets: - name: regcred-hangzhou - name: regcred-ap-southeast volumes: - name: configs configMap: name: gsm8k-configs - name: model persistentVolumeClaim: claimName: ym-models - name: model-dataset persistentVolumeClaim: claimName: ym-dataset