在ACK上运行强化学习任务

更新时间:
复制为 MD 格式

本文将指导您如何在 ACK 集群中通过VeRL基于Qwen2.5-3B-Instruct模型部署和运行典型的强化学习任务,包括环境准备、镜像构建、任务提交、资源监控及最佳实践。

容器服务 Kubernetes 版 ACK(Container Service for Kubernetes)为企业提供了一种高效、弹性、可扩展的容器化平台。强化学习(Reinforcement Learning, RL)作为人工智能的重要分支,通常涉及大量计算资源、分布式训练和复杂环境模拟。借助 ACK,您可以轻松部署、管理和扩展强化学习训练任务,充分利用 Kubernetes 的调度能力与阿里云的弹性基础设施。下图展示了本次作业运行的组件架构。

image

准备工作

  1. 创建ACK托管集群

    • 推荐使用 GPU 实例节点以加速训练。本文示例中使用 1 台 8 卡 GU8TF 的灵骏节点

  2. 获取集群KubeConfig并通过kubectl工具连接集群

  3. 安装KubeRay Operator组件

  4. (可选)已开通对象存储 OSS,用于持久化模型检查点、日志和训练数据。

步骤一:准备强化学习训练镜像

本示例采用VeRL框架执行强化学习任务,您可以使用VeRL官方提供的镜像,也可以使用自建镜像。使用自建镜像时需要保证镜像中有已安装相关依赖包,如VeRL,vLLM,SGLang,Ray等。下面是一个示例Dockerfile:

from verl/verl:vllm012.latest
WORKDIR /home/verl
COPY . .
RUN apt update && apt install -y openssh-server vim
RUN apt remove python3-blinker -y; pip install -e .

步骤二:配置MCP Server并使用ACS Sandbox

  1. 安装MCP ServerSandbox组件。

    开源版本

    # 下载代码仓库
    git clone https://github.com/openkruise/agents
    cd agents
    # 生成agents operator部署yaml
    kubectl kustomize config/default >operator-install.yaml
    # 按需修改 operator-install.yaml 中的配置信息
    kubectl apply -f operator-install.yaml
    # 部署测试 sandbox-manager,参考 https://github.com/openkruise/agents/blob/master/config/sandbox-manager/README_zh-CH.md
    kubectl kustomize config/sandbox-manager >sandbox-manager.yaml
    # mcp 代码还未合入因此需要手动将sandbox-manager镜像改成
    # baicun-business-registry.cn-beijing.cr.aliyuncs.com/baicun-dev/sandbox:sandbox-manager-v12
    # 确认管理pod运行正常
    kubectl get pod -l "app.kubernetes.io/name=sandbox-manager" -A
    kubectl get pod -l "app.kubernetes.io/name=sandbox-controller-manager" -A

    应用市场版本

    1. 集群列表页面,单击目标集群名称,然后在左侧导航栏,选择组件管理

    2. 安装 Ingress Controller 和 Sandbox 相关组件。

      1. 安装ack-agent-sandbox-controller组件

        使用默认配置安装组件。

      2. 安装ack-sandbox-manager组件

        1. 准备E2B域名。

          准备域名、域名解析和申请证书的详细操作,请参见应用于生产环境

        2. 配置组件参数。

          修改classNamealb(以安装ALB Ingress Controller组件为例),修改domain为实际域名,修改adminApiKey为自定义API Key,其他配置保持默认。组件安装完成后会在sandbox-system命名空间中创建一个名为sandbox-manager的路由。

          详细参数说明

          配置项

          参数

          说明

          sandboxManager

          replicaCount

          sandbox-manager 实例个数,默认值为 3。

          E2B

          domain

          E2B域名,即步骤a中准备的域名。

          Enable E2B_API_KEY verification

          是否开启 API_KEY 鉴权,默认开启。

          adminApiKey

          开启鉴权后,首次安装时通过该配置项来指定最初的 Key。请替换为自定义的API Key。

          Controller

          logLevel

          controller日志等级,默认为 1。

          resources.requests.cpu

          controller CPU资源请求,默认为 2。

          resources.requests.memory

          controller 内存资源请求,默认为 4Gi。

          Proxy

          resources.requests.cpu

          proxy CPU资源请求,默认为 2。

          resources.requests.memory

          proxy 内存资源请求,默认为 4Gi。

          Ingress

          className

          集群中已配置的 IngressClass 名称,如 albmse

        3. 若使用ALB Ingress Controller,还需同时为ALB实例和Ingress新增HTTPS:443监听配置。

          更新AlbConfig,为ALB实例新增HTTPS:443监听。

          1. 在左侧导航栏,选择工作负载 > 自定义资源。在资源对象浏览器页签中,搜索AlbConfig,然后单击搜索结果。

          2. AlbConfig资源对象列表中,找到目标资源alb,单击其右侧操作列下的YAML 编辑

          3. 新增spec.listeners.port: 443spec.listeners.protocol: HTTPS字段,然后单击确定

            spec:
                config:
                  addressAllocatedMode: Fixed
                  addressType: Internet
                  zoneMappings:
                    - vSwitchId: vsw-xxx
                    - vSwitchId: vsw-xxx
                listeners:
                  - port: 80
                    protocol: HTTP
                  - port: 443
                    protocol: HTTPS

          更新Ingress,关联HTTPS:443监听。

          1. 在左侧导航栏,选择网络 > 路由。在sandbox-manager路由右侧操作栏中,单击更新

          2. 添加以下配置,单击确定

            • 注解alb.ingress.kubernetes.io/listen-ports: [{"HTTP": 80}, {"HTTPS": 443}]

  2. 将以下内容保存为sandbox.yaml,然后执行kubectl apply -f sandbox.yaml部署Sandbox定义。SandboxSet会创建出大小为3的预热池,在强化学习过程中SandboxManager会不断从该预热池中取出并使用Sandbox。

    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: mcp-sandbox
    spec:
      selector:
        app.kubernetes.io/instance: release-name
        app.kubernetes.io/name: ack-sandbox-manager
        component: sandbox-manager
      type: ClusterIP
      sessionAffinity: None
      sessionAffinityConfig:
        clientIP:
          timeoutSeconds: 10800
      ports:
      - name: vllm
        protocol: TCP
        port: 8000
        targetPort: 18082
    ---
    apiVersion: agents.kruise.io/v1alpha1
    kind: SandboxSet
    metadata:
      annotations:
        # 启用 SandboxManager 的 Envd 初始化能力
        e2b.agents.kruise.io/should-init-envd: "true"
      name: code-interpreter
      namespace: default
    spec:
      # 预热池的大小,建议比预估的请求突发量略大
      replicas: 3
      template:
        spec:
          initContainers:
            - name: init
              image: registry-cn-hangzhou.ack.aliyuncs.com/acs/agent-runtime:v0.0.1
              imagePullPolicy: IfNotPresent
              terminationMessagePolicy: File
              volumeMounts:
                - name: envd-volume
                  mountPath: /mnt/envd
              env:
                - name: ENVD_DIR
                  value: /mnt/envd
              restartPolicy: Always
          containers:
            - name: sandbox
              image: acs-image-test-01-registry.cn-hangzhou.cr.aliyuncs.com/e2b/code-interpreter:v1.6
              imagePullPolicy: IfNotPresent
              terminationMessagePolicy: File
              env:
                - name: ENVD_DIR
                  value: /mnt/envd
              volumeMounts:
                - name: envd-volume
                  mountPath: /mnt/envd
              lifecycle:
                postStart:
                  exec:
                    command:
                      - bash
                      - /mnt/envd/envd-run.sh
              startupProbe:
                failureThreshold: 20
                successThreshold: 1
                httpGet:
                  path: /health
                  port: 49999
                  scheme: HTTP
                initialDelaySeconds: 1
                periodSeconds: 2
                timeoutSeconds: 1
          # 保证容器快速销毁,提高复用的概率
          terminationGracePeriodSeconds: 1
          restartPolicy: Always
          dnsPolicy: ClusterFirst
          volumes:
            - name: envd-volume
              emptyDir: { }

(可选)步骤三:准备强化学习数据集

VeRL中可以通过指定data.train_files的方式从远端下载数据集。不过由于数据集通常较大,且通常需要一些预处理,在生产环境中建议通过预处理任务下载数据,完成预处理并推送到云端存储。

  1. 将以下内容保存为data.yaml,然后执行kubectl apply -f data.yamlHugging Face下载数据,进行预处理,并推送到OSS Bucket。

    apiVersion: v1
    kind: Secret
    metadata:
      name: hf-oss-credentials
      namespace: default
    type: Opaque
    stringData:
      # HuggingFace Token
      HF_TOKEN: "hf_xxxxx"
      # 阿里云 OSS 凭证 (alibabacloud-oss-v2 SDK 使用环境变量认证)
      akId: "xxx"
      akSecret: "xxx"
      OSS_REGION: "xxx"
      OSS_BUCKET: "xxx"
    ---
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: preprocess-script
      namespace: default
    data:
      preprocess.py: |
        #!/usr/bin/env python3
        """
        数据集预处理脚本示例
        """
        import os
        import json
        from datasets import load_from_disk
        def preprocess_dataset(input_dir, output_dir):
            """预处理数据集"""
            print(f"Loading dataset from {input_dir}")
            dataset = load_from_disk(input_dir)
            train_dataset = dataset["train"]
            test_dataset = dataset["test"]
            instruction_following = "Let's think step by step and output the final answer after `####`."
            # add a row to each data item that represents a unique id
            def make_map_fn(split):
                def process_fn(example, idx):
                    question_raw = example.pop("question")
                    question = question_raw + " " + instruction_following
                    answer_raw = example.pop("answer")
                    solution = extract_solution(answer_raw)
                    data = {
                        "data_source": data_source,
                        "agent_name": "tool_agent",
                        "prompt": [
                            {
                                "role": "system",
                                "content": (
                                    "You are a math expert. You are given a question and you need to solve it step by step. "
                                    "Reasoning step by step before any tool call. "
                                    "You should use the `calc_gsm8k_reward` tool after step by step solving the question, "
                                    "before generate final answer at least once and refine your answer if necessary. "
                                    "Put your final answer in the format of `#### <answer>`."
                                ),
                            },
                            {
                                "role": "user",
                                "content": question,
                            },
                        ],
                        "ability": "math",
                        "reward_model": {"style": "rule", "ground_truth": solution},
                        "extra_info": {
                            "split": split,
                            "index": idx,
                            "answer": answer_raw,
                            "question": question_raw,
                            "need_tools_kwargs": True,
                            "tools_kwargs": {
                                "calc_gsm8k_reward": {
                                    "create_kwargs": {"ground_truth": solution},
                                    # "execute_kwargs": {},
                                    # "calc_reward_kwargs": {},
                                    # "release_kwargs": {},
                                },
                            },
                            "interaction_kwargs": {
                                "query": question,
                                "ground_truth": solution,
                            },
                        },
                    }
                    return data
                return process_fn
            train_dataset = train_dataset.map(function=make_map_fn("train"), with_indices=True, num_proc=8)
            test_dataset = test_dataset.map(function=make_map_fn("test"), with_indices=True, num_proc=8)
            # 保存处理后的数据集
            os.makedirs(output_dir, exist_ok=True)
            train_dataset.to_parquet(os.path.join(output_dir, "train.parquet"))
            test_dataset.to_parquet(os.path.join(output_dir, "test.parquet"))
            print(f"Processed dataset saved to {output_dir}")
            return output_dir
        if __name__ == "__main__":
            input_path = os.environ.get("INPUT_PATH", "/data/raw")
            output_path = os.environ.get("OUTPUT_PATH", "/data/processed")
            preprocess_dataset(input_path, output_path)
    ---
    apiVersion: batch/v1
    kind: Job
    metadata:
      name: dataset-pipeline
      namespace: default
      labels:
        app: dataset-pipeline
    spec:
      backoffLimit: 3
      template:
        metadata:
          labels:
            app: dataset-pipeline
        spec:
          restartPolicy: OnFailure
          volumes:
            # 预处理脚本
            - name: scripts
              configMap:
                name: preprocess-script
                defaultMode: 0755
          containers:
            - name: dataset-pipeline
              image: python:3.10-slim
              command:
                - /bin/bash
                - -c
                - |
                  set -e
                  #==========================================
                  # Step 1: 安装所有依赖
                  #==========================================
                  echo "=== Installing dependencies ==="
                  pip install --no-cache-dir datasets huggingface_hub pandas numpy alibabacloud-oss-v2 Pillow
                  #==========================================
                  # Step 2: 从 HuggingFace 下载数据集
                  #==========================================
                  echo "=== Downloading dataset from HuggingFace ==="
                  python3 << 'EOF'
                  import os
                  from datasets import load_dataset
                  from huggingface_hub import login
                  # 登录 HuggingFace(如果需要访问私有数据集)
                  hf_token = os.environ.get("HF_TOKEN")
                  if hf_token:
                      login(token=hf_token)
                  # 下载数据集
                  dataset_name = os.environ.get("DATASET_NAME", "hiyouga/geometry3k")
                  dataset_config = os.environ.get("DATASET_CONFIG", None)
                  print(f"Downloading dataset: {dataset_name}")
                  dataset = load_dataset(dataset_name, dataset_config)
                  # 保存到本地
                  output_path = "/data/raw"
                  dataset.save_to_disk(output_path)
                  print(f"Dataset saved to {output_path}")
                  EOF
                  echo "=== Download completed ==="
                  #==========================================
                  # Step 3: 执行预处理脚本
                  #==========================================
                  echo "=== Running preprocessing script ==="
                  python3 /scripts/preprocess.py
                  echo "=== Preprocessing completed ==="
                  #==========================================
                  # Step 4: 上传到 OSS (使用 alibabacloud-oss-v2 SDK)
                  #==========================================
                  echo "=== Uploading to OSS ==="
                  python3 << 'EOF'
                  import os
                  from pathlib import Path
                  import alibabacloud_oss_v2 as oss
                  # OSS 配置
                  bucket_name = os.environ["OSS_BUCKET"]
                  region = os.environ["OSS_REGION"]
                  oss_prefix = os.environ.get("OSS_PREFIX", "data/geo3k-processed/")
                  local_path = os.environ.get("OUTPUT_PATH", "/data/processed")
                  # 使用环境变量凭证提供者 (自动读取 OSS_ACCESS_KEY_ID 和 OSS_ACCESS_KEY_SECRET)
                  credentials_provider = oss.credentials.EnvironmentVariableCredentialsProvider()
                  # 加载默认配置并设置凭证提供者
                  cfg = oss.config.load_default()
                  cfg.credentials_provider = credentials_provider
                  cfg.region = region
                  # 创建 OSS 客户端
                  client = oss.Client(cfg)
                  def upload_directory(local_dir, oss_prefix):
                      """递归上传目录到 OSS"""
                      local_path = Path(local_dir)
                      uploaded_count = 0
                      failed_count = 0
                      for file_path in local_path.rglob("*"):
                          if file_path.is_file():
                              relative_path = file_path.relative_to(local_path)
                              oss_key = f"{oss_prefix}{relative_path}"
                              try:
                                  # 读取文件内容
                                  with open(file_path, 'rb') as f:
                                      data = f.read()
                                  # 上传到 OSS
                                  result = client.put_object(oss.PutObjectRequest(
                                      bucket=bucket_name,
                                      key=oss_key,
                                      body=data,
                                  ))
                                  print(f"Uploaded: {file_path} -> {oss_key} (status: {result.status_code})")
                                  uploaded_count += 1
                              except Exception as e:
                                  print(f"Failed to upload {file_path}: {e}")
                                  failed_count += 1
                      return uploaded_count, failed_count
                  uploaded, failed = upload_directory(local_path, oss_prefix)
                  print(f"=== Upload completed: {uploaded} files uploaded, {failed} files failed ===")
                  if failed > 0:
                      raise Exception(f"{failed} files failed to upload")
                  EOF
                  echo "=== Pipeline completed successfully ==="
              env:
                # HuggingFace 配置
                - name: HF_TOKEN
                  valueFrom:
                    secretKeyRef:
                      name: hf-oss-credentials
                      key: HF_TOKEN
                - name: DATASET_NAME
                  value: "hiyouga/geometry3k"
                - name: HF_HOME
                  value: "/tmp/huggingface"
                # 预处理配置
                - name: INPUT_PATH
                  value: "/data/raw"
                - name: OUTPUT_PATH
                  value: "/data/processed"
                # OSS 配置
                - name: OSS_ACCESS_KEY_ID
                  valueFrom:
                    secretKeyRef:
                      name: hf-oss-credentials
                      key: akId
                - name: OSS_ACCESS_KEY_SECRET
                  valueFrom:
                    secretKeyRef:
                      name: hf-oss-credentials
                      key: akSecret
                - name: OSS_REGION
                  valueFrom:
                    secretKeyRef:
                      name: hf-oss-credentials
                      key: OSS_REGION
                - name: OSS_BUCKET
                  valueFrom:
                    secretKeyRef:
                      name: hf-oss-credentials
                      key: OSS_BUCKET
                - name: OSS_PREFIX
                  value: "data/geo3k-processed/"
              volumeMounts:
                - name: scripts
                  mountPath: /scripts
              resources:
                requests:
                  memory: "2Gi"
                  cpu: "1"
                limits:
                  memory: "16Gi"
                  cpu: "4"
                                          

步骤四:提交强化学习任务配置

  1. 将以下内容保存为pvpvc.yaml,然后执行kubectl apply -f pvpvc.yaml通过PVPVC挂载OSS静态存储卷。

    以下示例使用AK/SK方式认证,RRSA方式请参考使用ossfs 2.0静态存储卷
    apiVersion: v1
    kind: PersistentVolume
    metadata:
      name: ym-dataset
      labels:
        alicloud-pvname: ym-dataset
    spec:
      capacity:
        storage: 20Gi
      accessModes:
        - ReadWriteMany
      persistentVolumeReclaimPolicy: Retain
      csi:
        driver: ossplugin.csi.alibabacloud.com
        volumeHandle: ym-dataset # 需要和PV名字一致。
        nodePublishSecretRef:
          name: hf-oss-credentials
          namespace: default
        volumeAttributes:
          bucket: "xxxx" #替换为实际Bucket名称。
          url: "oss-ap-southeast-1-internal.aliyuncs.com" #替换为实际oss url名称。
          otherOpts: "-o umask=022 -o max_stat_cache_size=100000 -o allow_other -o dbglevel=debug -o curldbg"
          path: "/"
    ---
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: ym-dataset
    spec:
      accessModes:
        - ReadWriteMany
      resources:
        requests:
          storage: 20Gi
      selector:
        matchLabels:
          alicloud-pvname: ym-dataset
    # (可选)Model可以通过传入HuggingFace仓库路径进行实时下载
    ---
    apiVersion: v1
    kind: PersistentVolume
    metadata:
      name: ym-models
      labels:
        alicloud-pvname: ym-models
    spec:
      capacity:
        storage: 20Gi
      accessModes:
        - ReadWriteMany
      persistentVolumeReclaimPolicy: Retain
      csi:
        driver: ossplugin.csi.alibabacloud.com
        volumeHandle: ym-models # 需要和PV名字一致。
        nodePublishSecretRef:
          name: hf-oss-credentials
          namespace: default
        volumeAttributes:
          bucket: "xxxx" #替换为实际Bucket名称。
          url: "oss-ap-southeast-1-internal.aliyuncs.com" #替换为实际oss url名称。
          otherOpts: "-o umask=022 -o max_stat_cache_size=100000 -o allow_other -o dbglevel=debug -o curldbg"
          path: "/"
    ---
    apiVersion: v1
    kind: PersistentVolumeClaim
    metadata:
      name: ym-models
    spec:
      accessModes:
        - ReadWriteMany
      resources:
        requests:
          storage: 20Gi
      selector:
        matchLabels:
          alicloud-pvname: ym-models
  2. 将以下内容保存为configs.yaml,然后执行kubectl apply -f configs.yaml提交任务相关配置。

    ---
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: gsm8k-configs
      namespace: default
    data:
      gsm8k_multiturn_grpo.yaml: |
        hydra:
          searchpath:
            - file://verl/trainer/config
        defaults:
          - ppo_trainer
          - _self_
        data:
          max_prompt_length: 1024
          max_response_length: 1024
          train_batch_size: 256
          return_raw_chat: True
        actor_rollout_ref:
          hybrid_engine: True
          rollout:
            name: vllm
            multi_turn:
              enable: True
              max_assistant_turns: 5
      mcp_server.json: |
        {
            "mcpServers": {
                "Tavily Expert": {
                    "url": "xxxxx", # 替换成sandbox mcpingress地址
                    "api_key": "xxxxx" # 如果需要添加api_key,可以在ray cluster中新增nginx容器进行代理
                }
            }
        }
      gsm8k_mcp_tool_config.yaml: |
        tools:
        - class_name: verl.tools.mcp_search_tool.MCPSearchTool
          config:
            rate_limit: 120
            timeout: 120
            type: mcp
          mcp:
            mcp_servers_config_path: /var/configs/mcp_server.json
            tool_selected_list: 
              - run_code_once
        - class_name: "verl.tools.gsm8k_tool.Gsm8kTool"
          config: 
            type: native
          tool_schema:
            type: "function"
            function:
              name: "calc_gsm8k_reward"
              description: "A tool for calculating the reward of gsm8k. (1.0 if parsed answer is correct, 0.0 if parsed answer is incorrect or not correctly parsed)"
              parameters:
                type: "object"
                properties:
                  answer:
                    type: "string"
                    description: "The model's answer to the GSM8K math problem, must be a digits"
                required: ["answer"]

步骤五:提交强化学习任务

VeRL中支持通过MCPSearchTool的方式查询MCP Server提供的工具,在每个用例开始时会通过AgentLoop链接到MCP Server并在多轮对话中调用工具。

  1. 将以下内容保存为rayjob.yaml,然后执行kubectl apply -f rayjob.yaml提交强化学习任务。

    ---
    apiVersion: ray.io/v1
    kind: RayJob
    metadata:
      name: rayjob-example
      namespace: default
    spec:
      shutdownAfterJobFinishes: false
      # ttlSecondsAfterFinished: 300
      runtimeEnvYAML: |
        working_dir: /home/verl
      submissionMode: SidecarMode
      entrypoint: |
        python3 -m verl.trainer.main_ppo \
          --config-path=/var/configs \
          --config-name='gsm8k_multiturn_grpo' \
          algorithm.adv_estimator=grpo \
          data.train_batch_size=16 \
          data.max_prompt_length=1024 \
          data.max_response_length=1024 \
          data.filter_overlong_prompts=True \
          data.truncation='error' \
          data.return_raw_chat=True \
          actor_rollout_ref.model.path=/var/model/Qwen2.5-3B-Instruct \
          actor_rollout_ref.actor.optim.lr=1e-6 \
          actor_rollout_ref.model.use_remove_padding=True \
          actor_rollout_ref.actor.ppo_mini_batch_size=8 \
          actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1 \
          actor_rollout_ref.actor.use_kl_loss=True \
          actor_rollout_ref.actor.kl_loss_coef=0.001 \
          actor_rollout_ref.actor.kl_loss_type=low_var_kl \
          actor_rollout_ref.actor.entropy_coeff=0 \
          actor_rollout_ref.model.enable_gradient_checkpointing=True \
          actor_rollout_ref.actor.fsdp_config.param_offload=False \
          actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \
          actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=1 \
          actor_rollout_ref.rollout.tensor_model_parallel_size=2 \
          actor_rollout_ref.rollout.name=vllm \
          actor_rollout_ref.rollout.mode=async \
          actor_rollout_ref.rollout.gpu_memory_utilization=0.5 \
          actor_rollout_ref.rollout.n=16 \
          actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=1 \
          actor_rollout_ref.ref.fsdp_config.param_offload=True \
          actor_rollout_ref.rollout.trace.backend=mlflow \
          actor_rollout_ref.rollout.trace.token2text=True \
          algorithm.use_kl_in_reward=False \
          trainer.critic_warmup=0 \
          trainer.logger='["console","mlflow"]' \
          trainer.project_name='gsm8k_tool-agent' \
          trainer.experiment_name='qwen2.5-3b_function_rm-gsm8k-vllm-tool-agent-verify-n16' \
          trainer.n_gpus_per_node=8 \
          trainer.nnodes=1 \
          trainer.save_freq=1 \
          trainer.test_freq=20 \
          trainer.total_training_steps=1 \
          data.train_files=/var/model-dataset/processed-gsm8k/train20.parquet \
          data.val_files=/var/model-dataset/processed-gsm8k/test100.parquet \
          actor_rollout_ref.rollout.multi_turn.tool_config_path="/var/configs/gsm8k_mcp_tool_config.yaml" \
          actor_rollout_ref.actor.checkpoint.save_contents='["hf_model", "model"]' \
          trainer.total_epochs=1 
      rayClusterSpec:
        headGroupSpec:
          rayStartParams:
            dashboard-host: 0.0.0.0
          serviceType: ClusterIP
          template:
            metadata:
              annotations: 
              labels:
            spec:
              affinity: {}
              tolerations:
              - key: node-role.alibabacloud.com/lingjun
              containers:
              - env:
                - name: VERL_ROOT
                  value: /home/verl
                image: registry-ap-southeast-1.ack.aliyuncs.com/dev/verl:vllm012.latest.43dc9a44
                imagePullPolicy: IfNotPresent
                name: ray-head
                resources:
                  limits:
                    cpu: "100"
                    memory: 500Gi
                    nvidia.com/gpu: "8"
                securityContext:
                  runAsUser: 0
                volumeMounts:
                - mountPath: /var/configs
                  name: configs
                - mountPath: /var/model
                  name: model
                - mountPath: /var/model-dataset
                  name: model-dataset
              imagePullSecrets: 
              - name: regcred-hangzhou
              - name: regcred-ap-southeast
              volumes:
              - name: configs
                configMap:
                  name: gsm8k-configs
              - name: model
                persistentVolumeClaim:
                  claimName: ym-models
              - name: model-dataset
                persistentVolumeClaim:
                  claimName: ym-dataset