文档

AGS帮助示例

更新时间:

本文主要为您提供AGS帮助示例。

前提条件

Log

  1. 执行ags config sls命令,在AGS上配置并安装日志服务。

    原生Argo查看Pod日志只能从本地拉取,当Pod或者所在节点被删除后,日志也会随之丢失,这对查看错误分析以及原因等带来很大问题。 如果将日志上传到阿里云日志服务,即使节点消失也会从日志服务重新拉取,将日志持久化。

  2. 执行ags logs命令,查看工作流的日志。

    本例中执行ags logs POD/WORKFLOW,查看Pod/Workflow的日志。

    # ags logs
    view logs of a workflow
    
    Usage:
     ags logs POD/WORKFLOW [flags]
    
    Flags:
     -c, --container string    Print the logs of this container (default "main")
     -f, --follow              Specify if the logs should be streamed.
     -h, --help                help for logs
     -l, --recent-line int     how many lines to show in one call (default 100)
         --since string        Only return logs newer than a relative duration like 5s, 2m, or 3h. Defaults to all logs. Only one of since-time / since may be used.
         --since-time string   Only return logs after a specific date (RFC3339). Defaults to all logs. Only one of since-time / since may be used.
         --tail int            Lines of recent log file to display. Defaults to -1 with no selector, showing all log lines otherwise 10, if a selector is provided. (default -1)
         --timestamps          Include timestamps on each line in the log output
     -w, --workflow            Specify that whole workflow logs should be printed
    说明
    • 如果Pod存在本机,AGS会从本地将Pod日志查询出来,所有flag兼容原Argo命令。

    • 如果Pod已经被删除,AGS会从阿里云日志服务来查询日志,默认返回最近100条日志,可以通过-l flag来指定到底返回多少条日志。

List

您可以通过--limit参数选择查看的Workflow条目数。

# ags remote list --limit 8
+-----------------------+-------------------------------+------------+
|       JOB NAME        |          CREATE TIME          | JOB STATUS |
+-----------------------+-------------------------------+------------+
| merge-6qk46           | 2020-09-02 16:52:34 +0000 UTC | Pending    |
| rna-mapping-gpu-ck4cl | 2020-09-02 14:47:57 +0000 UTC | Succeeded  |
| wgs-gpu-n5f5s         | 2020-09-02 13:14:14 +0000 UTC | Running    |
| merge-5zjhv           | 2020-09-02 12:03:11 +0000 UTC | Succeeded  |
| merge-jjcw4           | 2020-09-02 10:44:51 +0000 UTC | Succeeded  |
| wgs-gpu-nvxr2         | 2020-09-01 22:18:44 +0000 UTC | Succeeded  |
| merge-4vg42           | 2020-09-01 20:52:13 +0000 UTC | Succeeded  |
| rna-mapping-gpu-2ss6n | 2020-09-01 20:34:45 +0000 UTC | Succeeded  |

集成kubectl命令

您可以执行如下命令,查看Pod状态以及其他需要使用kubectl命令的情况。

# ags get test-v2
Name:                test-v2
Namespace:           default
ServiceAccount:      default
Status:              Running
Created:             Thu Nov 22 11:06:52 +0800 (2 minutes ago)
Started:             Thu Nov 22 11:06:52 +0800 (2 minutes ago)
Duration:            2 minutes 46 seconds

STEP           PODNAME             DURATION  MESSAGE
● test-v2
└---● bcl2fq  test-v2-2716811808  2m

# ags kubectl describe pod test-v2-2716811808
Name:               test-v2-2716811808
Namespace:          default
Priority:           0
PriorityClassName:  <none>
Node:               cn-shenzhen.i-wz9gwobtqrbjgfnqxl1k/192.168.0.94
Start Time:         Thu, 22 Nov 2018 11:06:52 +0800
Labels:             workflows.argoproj.io/completed=false
                   workflows.argoproj.io/workflow=test-v2
Annotations:        workflows.argoproj.io/node-name=test-v2[0].bcl2fq
                   workflows.argoproj.io/template={"name":"bcl2fq","inputs":{},"outputs":{},"metadata":{},"container":{"name":"main","image":"registry.cn-hangzhou.aliyuncs.com/dahu/curl-jp:1.2","command":["sh","-c"],"ar...
Status:             Running
IP:                 172.16.*.***
Controlled By:      Workflow/test-v2

通过使用ags kubectl命令,可以查看到describe pod的状态信息,所有kubectl原生命令AGS均支持。

集成ossutil命令

AGS初始化完毕后,您可以使用如下命令进行文件的上传和查看。

# ags oss cp test.fq.gz oss://my-test-shenzhen/fasq/
Succeed: Total num: 1, size: 690. OK num: 1(upload 1 files).

average speed 3000(byte/s)

0.210685(s) elapsed
# ags oss ls oss://my-test-shenzhen/fasq/
LastModifiedTime                   Size(B)  StorageClass   ETAG                                  ObjectName
2020-09-02 17:20:34 +0800 CST          690      Standard   9FDB86F70C6211B2EAF95A9B06B14F7E      oss://my-test-shenzhen/fasq/test.fq.gz
Object Number is: 1

0.117591(s) elapsed

通过使用ags oss命令,可以进行文件的上传下载等操作,所有的ossutil原生命令AGS均支持。

查看Workflow资源使用量

  1. 创建并拷贝内容到arguments-workflow-resource.yaml文件中,并执行ags submit arguments-workflow-resource.yaml命令,指定resource requests。

    apiVersion: argoproj.io/v1alpha1
    kind: Workflow
    metadata:
      name: test-resource
    spec:
      arguments: {}
      entrypoint: test-resource-
      templates:
      - inputs: {}
        metadata: {}
        name: test-resource-
        outputs: {}
        parallelism: 1
        steps:
        - - arguments: {}
            name: bcl2fq
            template: bcl2fq
      - container:
          args:
          - id > /tmp/yyy;echo `date` > /tmp/aaa;ps -e -o comm,euid,fuid,ruid,suid,egid,fgid,gid,rgid,sgid,supgid
            > /tmp/ppp;ls -l /tmp/aaa;sleep 100;pwd
          command:
          - sh
          - -c
          image: registry.cn-hangzhou.aliyuncs.com/dahu/curl-jp:1.2
          name: main
          resources:                #don't use too much resources
            requests:
              memory: 320Mi
              cpu: 1000m
        inputs: {}
        metadata: {}
        name: bcl2fq
        outputs: {}
  2. 执行ags get test456 --show命令,查看Workflow资源使用。

    本例中,结果显示的是Pod和test456使用的核/时。

    # ags get test456 --show
    Name:                test456
    Namespace:           default
    ServiceAccount:      default
    Status:              Succeeded
    Created:             Thu Nov 22 14:41:49 +0800 (2 minutes ago)
    Started:             Thu Nov 22 14:41:49 +0800 (2 minutes ago)
    Finished:            Thu Nov 22 14:43:30 +0800 (27 seconds ago)
    Duration:            1 minute 41 seconds
    Total CPU:           0.02806    (core*hour)
    Total Memory:        0.00877    (GB*hour)
    
    STEP           PODNAME             DURATION  MESSAGE  CPU(core*hour)  MEMORY(GB*hour)
    ✔ test456                                            0               0
    └---✔ bcl2fq  test456-4221301428  1m                 0.02806         0.00877

securityContext安全支持

创建并拷贝内容到arguments-security-context.yaml文件中,并执行ags submit arguments-security-context.yaml命令,绑定对应的psp来进行权限控制。

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: test
spec:
  arguments: {}
  entrypoint: test-security-
  templates:
  - inputs: {}
    metadata: {}
    name: test-security-
    outputs: {}
    parallelism: 1
    steps:
    - - arguments: {}
        name: bcl2fq
        template: bcl2fq
  - container:
      args:
      - id > /tmp/yyy;echo `date` > /tmp/aaa;ps -e -o comm,euid,fuid,ruid,suid,egid,fgid,gid,rgid,sgid,supgid
        > /tmp/ppp;ls -l /tmp/aaa;sleep 100;pwd
      command:
      - sh
      - -c
      image: registry.cn-hangzhou.aliyuncs.com/dahu/curl-jp:1.2
      name: main
      resources:                #don't use too much resources
        requests:
          memory: 320Mi
          cpu: 1000m
    inputs: {}
    metadata: {}
    name: bcl2fq
    outputs: {}
    securityContext:
      runAsUser: 800

YAML定义自动重试功能

bash命令会由于不明原因失败,重试就可以解决,AGS提供一种基于YAML配置的自动重启机制,当Pod内命令运行失败后,会自动拉起重试,并且可以设置重试次数。

创建并拷贝内容到arguments-auto-retry.yaml文件中,并执行ags submit arguments-auto-retry.yaml命令,配置Workflow的自动重启机制。

# This example demonstrates the use of retries for a single container.
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: retry-container-
spec:
  entrypoint: retry-container
  templates:
  - name: retry-container
    retryStrategy:
      limit: 10
    container:
      image: python:alpine3.6
      command: ["python", -c]
      # fail with a 66% probability
      args: ["import random; import sys; exit_code = random.choice([0, 1, 1]); sys.exit(exit_code)"]

基于最近失败断点重试整个Workflow

在整个Workflow运行中,有时候任务中的某个步骤会失败,这时候希望从某个失败的节点重试Workflow,类似断点续传的断点重试功能。

  1. 执行ags get test456 --show命令,查看workflow test456从哪个步骤断点。

    # ags get test456 --show
    Name:                test456
    Namespace:           default
    ServiceAccount:      default
    Status:              Succeeded
    Created:             Thu Nov 22 14:41:49 +0800 (2 minutes ago)
    Started:             Thu Nov 22 14:41:49 +0800 (2 minutes ago)
    Finished:            Thu Nov 22 14:43:30 +0800 (27 seconds ago)
    Duration:            1 minute 41 seconds
    Total CPU:           0.0572   (core*hour)
    Total Memory:        0.01754    (GB*hour)
    
    STEP           PODNAME             DURATION  MESSAGE  CPU(core*hour)  MEMORY(GB*hour)
     ✔ test456                                            0               0
     └---✔ bcl2fq  test456-4221301428  1m                 0.02806         0.00877
     └---X bcl2fq  test456-4221301238  1m                 0.02806         0.00877
  2. 执行ags retry test456命令,从最近失败断点处继续重试workflow test456。

使用ECI运行workflow

ECI操作请参见弹性容器实例ECI

配置使用ECI前,请先安装AGS,请参见AGS 下载和安装

  1. 执行kubectl get cm -n argo命令,获取Workflow对应的YAML文件名称。

    # kubectl get cm -n argo
    NAME                            DATA      AGE
    workflow-controller-configmap   1         4d
  2. 执行kubectl get cm -n argo workflow-controller-configmap -o yaml命令,打开workflow-controller-configmap.yaml文件,并使用如下内容覆盖当前YAML文件的内容。

    apiVersion: v1
    data:
      config: |
        containerRuntimeExecutor: k8sapi
    kind: ConfigMap
  3. 执行kubectl delete pod <podName>命令,重启Argo Controller。

    说明

    这里的podName为workflow所在的Pod的名称。

  4. 创建并拷贝内容到arguments-workflow-eci.yaml文件中,并执行ags submit arguments-workflow-eci.yaml命令,在ECI上运行的容器添加nodeSelector和Tolerations这两个标识。

    apiVersion: argoproj.io/v1alpha1
    kind: Workflow
    metadata:
      generateName: hello-world-
    spec:
      entrypoint: whalesay
      templates:
      - name: whalesay
        container:
          image: docker/whalesay
          command: [env]
          #args: ["hello world"]
          resources:
            limits:
              memory: 32Mi
              cpu: 100m
        nodeSelector:               # 添加nodeSelector
          type: virtual-kubelet
        tolerations:                # 添加tolerations
        - key: virtual-kubelet.io/provider
          operator: Exists
        - key: alibabacloud.com
          effect: NoSchedule

查看Workflow实际资源使用量以及峰值

ags workflow controller会通过metrics-server自动获取Pod每分钟的实际资源使用量,并且统计出来总量和各个Pod的峰值使用量。

执行ags get steps-jr6tw --metrics命令,查看Workflow实际资源使用量以及峰值。

➜  ags get steps-jr6tw --metrics
Name:                steps-jr6tw
Namespace:           default
ServiceAccount:      default
Status:              Succeeded
Created:             Tue Apr 16 16:52:36 +0800 (21 hours ago)
Started:             Tue Apr 16 16:52:36 +0800 (21 hours ago)
Finished:            Tue Apr 16 19:39:18 +0800 (18 hours ago)
Duration:            2 hours 46 minutes
Total CPU:           0.00275    (core*hour)
Total Memory:        0.04528    (GB*hour)

STEP            PODNAME                 DURATION  MESSAGE  CPU(core*hour)  MEMORY(GB*hour)  MaxCpu(core)  MaxMemory(GB)
 ✔ steps-jr6tw                                             0               0                0             0
 └---✔ hello1   steps-jr6tw-2987978173  2h                 0.00275         0.04528          0.000005      0.00028

设置Workflow优先级

当前面有一些任务正在运行时,有一个紧急任务急需运行,此时,您可以给Workflow设置高、中、低的优先级,高优先级抢占低优先级任务的资源。

  • 您可以给某个Pod设置高优先级,示例如下:

    创建并拷贝内容到arguments-high-priority-taskA.yaml文件中,并执行ags submit arguments-high-priority-taskA.yaml命令,给任务A设置高优先级。

    apiVersion: scheduling.k8s.io/v1beta1
    kind: PriorityClass
    metadata:
      name: high-priority
    value: 1000000
    globalDefault: false
    description: "This priority class should be used for XYZ service pods only."
  • 您可以给某个Pod设置优先级,示例如下:

    创建并拷贝内容到arguments-high-priority-taskB.yaml文件中,并执行ags submit arguments-high-priority-taskB.yaml命令,给任务B设置优先级。

    apiVersion: scheduling.k8s.io/v1beta1
    kind: PriorityClass
    metadata:
      name: medium-priority
    value: 100
    globalDefault: false
    description: "This priority class should be used for XYZ service pods only."
  • 您也可以将一个Workflow设置为高优先级,示例如下:

    创建并拷贝内容到arguments-high-priority-Workflow.yaml文件中,并执行ags submit arguments-high-priority-Workflow.yaml命令,给Workflow中所有的Pod设置高优先级。

    apiVersion: argoproj.io/v1alpha1
    kind: Workflow                  # new type of k8s spec
    metadata:
      generateName: high-proty-   # name of the workflow spec
    spec:
      entrypoint: whalesay          # invoke the whalesay template
      podPriorityClassName: high-priority # workflow level priority
      templates:
      - name: whalesay              # name of the template
        container:
          image: ubuntu
          command: ["/bin/bash", "-c", "sleep 1000"]
          resources:
            requests:
              cpu: 3
                            

下面以一个Workflow里面含有两个Pod,分别给一个Pod设置中优先级,另一个Pod设置高优先级,此时,高优先级的Pod就能抢占低优先级Pod的资源。

  1. 创建并拷贝内容到arguments-high-priority-steps.yaml文件中,并执行ags submit arguments-high-priority-steps.yaml命令,给Pod设置优先级。

    apiVersion: argoproj.io/v1alpha1
    kind: Workflow
    metadata:
      generateName: steps-
    spec:
      entrypoint: hello-hello-hello
    
      templates:
      - name: hello-hello-hello
        steps:
        - - name: low           
            template: low
        - - name: low-2          
            template: low
          - name: high           
            template: high
    
      - name: low
        container:
          image: ubuntu
          command: ["/bin/bash", "-c", "sleep 30"]
          resources:
            requests:
              cpu: 3
    
      - name: high
        priorityClassName: high-priority # step level priority
        container:
          image: ubuntu
          command: ["/bin/bash", "-c", "sleep 30"]
          resources:
            requests:
              cpu: 3
  2. 运行结果为高优先级Pod会抢占并删除旧的Pod,执行结果如下所示。

    Name:                steps-sxvrv
    Namespace:           default
    ServiceAccount:      default
    Status:              Failed
    Message:             child 'steps-sxvrv-1724235106' failed
    Created:             Wed Apr 17 15:06:16 +0800 (1 minute ago)
    Started:             Wed Apr 17 15:06:16 +0800 (1 minute ago)
    Finished:            Wed Apr 17 15:07:34 +0800 (now)
    Duration:            1 minute 18 seconds
    
    STEP            PODNAME                 DURATION  MESSAGE
     ✖ steps-sxvrv                                    child 'steps-sxvrv-1724235106' failed
     ├---✔ low      steps-sxvrv-3117418100  33s
     └-·-✔ high     steps-sxvrv-603461277   45s
       └-⚠ low-2    steps-sxvrv-1724235106  45s       pod deleted
    说明

    这里高优先级任务会自动抢占低优先Pod所占用资源,会停止低优先级任务,中断正在运行的进程,所以使用时要非常谨慎。

Workflow Filter

在ags get workflow中,针对较大的Workflow可以使用filter列出指定状态的Pod。

  1. 执行ags get <pod名称> --status Running命令,列出指定状态的Pod。

    # ags get pod-limits-n262v --status Running
    Name:                pod-limits-n262v
    Namespace:           default
    ServiceAccount:      default
    Status:              Running
    Created:             Wed Apr 17 15:59:08 +0800 (1 minute ago)
    Started:             Wed Apr 17 15:59:08 +0800 (1 minute ago)
    Duration:            1 minute 17 seconds
    Parameters:
      limit:             300
    
    STEP                   PODNAME                      DURATION  MESSAGE
     ● pod-limits-n262v
       ├-● run-pod(13:13)  pod-limits-n262v-3643890604  1m
       ├-● run-pod(14:14)  pod-limits-n262v-4115394302  1m
       ├-● run-pod(16:16)  pod-limits-n262v-3924248206  1m
       ├-● run-pod(17:17)  pod-limits-n262v-3426515460  1m
       ├-● run-pod(18:18)  pod-limits-n262v-824163662   1m
       ├-● run-pod(20:20)  pod-limits-n262v-4224161940  1m
       ├-● run-pod(22:22)  pod-limits-n262v-1343920348  1m
       ├-● run-pod(2:2)    pod-limits-n262v-3426502220  1m
       ├-● run-pod(32:32)  pod-limits-n262v-2723363986  1m
       ├-● run-pod(34:34)  pod-limits-n262v-2453142434  1m
       ├-● run-pod(37:37)  pod-limits-n262v-3225742176  1m
       ├-● run-pod(3:3)    pod-limits-n262v-2455811176  1m
       ├-● run-pod(40:40)  pod-limits-n262v-2302085188  1m
       ├-● run-pod(6:6)    pod-limits-n262v-1370561340  1m
  2. 执行ags get <pod名称> --sum-info命令,统计当前Pod状态信息。

    # ags get pod-limits-n262v --sum-info --status Error
    Name:                pod-limits-n262v
    Namespace:           default
    ServiceAccount:      default
    Status:              Running
    Created:             Wed Apr 17 15:59:08 +0800 (2 minutes ago)
    Started:             Wed Apr 17 15:59:08 +0800 (2 minutes ago)
    Duration:            2 minutes 6 seconds
    Pending:             198
    Running:             47
    Succeeded:           55
    Parameters:
      limit:             300
    
    STEP                 PODNAME  DURATION  MESSAGE
     ● pod-limits-n262v

敏捷版Autoscaler使用流程

在敏捷版使用autoscaler需要用户提前创建或者已经具备以下资源:

  • 您已经有一个VPC。

  • 您已经有一个vSwitch。

  • 您已经设置好一个安全组。

  • 您已经获取到敏捷版的APIServer内网地址。

  • 您明确扩容节点的规格。

  • 您已创建好一个ECS实例且拥有公网访问能力。

您可以在AGS命令行,按照界面提示进行如下操作。

$ags config autoscaler根据提示输入对应的值Please input vswitchs with comma separated
vsw-hp3cq3fnv47bpz7x58wfe
Please input security group id
sg-hp30vp05x6tlx13my0qu
Please input the instanceTypes with comma separated
ecs.c5.xlarge
Please input the new ecs ssh password
xxxxxxxx
Please input k8s cluster APIServer address like(192.168.1.100)
172.24.61.156
Please input the autoscaling mode (current: release. Type enter to skip.)
Please input the min size of group (current: 0. Type enter to skip.)
Please input the max size of group (current: 1000. Type enter to skip.)
Create scaling group successfully.
Create scaling group config successfully.
Enable scaling group successfully.
Succeed

配置完成后,登录弹性伸缩控制台, 可以看到创建好的自动伸缩组。

配置使用ags configmap

本例中,默认使用hostNetwork。

  1. 执行kubectl get cm -n argo命令,获取Workflow对应的YAML文件名称。

    # kubectl get cm -n argo
    NAME                            DATA   AGE
    workflow-controller-configmap   1      6d23h
  2. 执行kubectl edit cm workflow-controller-configmap -n argo命令,打开workflow-controller-configmap.yaml文件,将如下内容填入当前YAML文件中。

    data:
      config: |
        extraConfig:
          enableHostNetwork: true
          defaultDnsPolicy: Default

    填入完成后,workflow-controller-configmap.yaml全文如下所示。

    apiVersion: v1
    data:
      config: |
        extraConfig:
          enableHostNetwork: true
          defaultDnsPolicy: Default
    kind: ConfigMap
    metadata:
      name: workflow-controller-configmap
      namespace: argo
  3. 配置完成后,新部署的Workflow均会默认使用hostNetwork,且dnsPolicyDefault

  4. 可选:如果配置了psp,需要在psp中对应的YAML文件增加如下内容。

    hostNetwork: true
    说明

    如果该YAML文件中已有hostNetwork参数,需要将值改为true

    完整YAML示例模板如下:

    apiVersion: policy/v1beta1
    kind: PodSecurityPolicy
    metadata:
      name: restricted
      annotations:
        seccomp.security.alpha.kubernetes.io/allowedProfileNames: 'docker/default,runtime/default'
        apparmor.security.beta.kubernetes.io/allowedProfileNames: 'runtime/default'
        seccomp.security.alpha.kubernetes.io/defaultProfileName:  'runtime/default'
        apparmor.security.beta.kubernetes.io/defaultProfileName:  'runtime/default'
    spec:
      privileged: false
      # Required to prevent escalations to root.
      allowPrivilegeEscalation: false
      # This is redundant with non-root + disallow privilege escalation,
      # but we can provide it for defense in depth.
      requiredDropCapabilities:
        - ALL
      # Allow core volume types.
      volumes:
        - 'configMap'
        - 'emptyDir'
        - 'projected'
        - 'secret'
        - 'downwardAPI'
        # Assume that persistentVolumes set up by the cluster admin are safe to use.
        - 'persistentVolumeClaim'
      hostNetwork: false
      hostIPC: false
      hostPID: false
      runAsUser:
        # Require the container to run without root privileges.
        rule: 'MustRunAsNonRoot'
      seLinux:
        # This policy assumes the nodes are using AppArmor rather than SELinux.
        rule: 'RunAsAny'
      supplementalGroups:
        rule: 'MustRunAs'
        ranges:
          # Forbid adding the root group.
          - min: 1
            max: 65535
      fsGroup:
        rule: 'MustRunAs'
        ranges:
          # Forbid adding the root group.
          - min: 1
            max: 65535
      readOnlyRootFilesystem: false     
    
  • 本页导读 (1)
文档反馈