Run Argo workflows on ECI

更新时间:
复制 MD 格式

Serverless Kubernetes offers pod-level elasticity, with benefits like startup within seconds, per-second billing, and a high concurrency of 2,000 pods per minute. This makes it an increasingly popular choice for running Argo. This topic describes how to run Argo workflows on Elastic Container Instance (ECI) using an Alibaba Cloud Container Service for Kubernetes (ACK) cluster.

Set up Kubernetes and Argo

  1. Set up an Alibaba Cloud Serverless Kubernetes cluster.

  2. Deploy Argo in the Kubernetes cluster.

    • (Recommended) Install the ack-workflow component. For more information, see Argo Workflows.

    • Deploy Argo manually. For more information, see the Argo Quick Start.

  3. Install the Argo CLI. For more information, see argo-workflows.

Optimize basic resource configurations

By default, after you deploy Argo, resource requests and limits are not specified for the argo-server and workflow-controller core component pods. This assigns the pods a low Quality of Service (QoS) class, making them susceptible to Out of Memory (OOM) kills or pod evictions when cluster resources are insufficient. We recommend that you adjust the resources for these two component pods based on your cluster's scale. As a starting point, set their requests or limits to 2 vCPU and 4 GiB of memory or higher.

Use OSS as an artifact repository

By default, Argo uses MinIO as its artifact repository. In a production environment, the stability of the artifact repository is critical. The ack-workflow component supports using Alibaba Cloud Object Storage Service (OSS) as a durable and reliable artifact repository. For instructions on how to configure OSS as your artifact repository, see Configuring Alibaba Cloud OSS.

After completing the configuration, use the following example to create a workflow and verify the setup.

  1. Save the following content as workflow-oss.yaml.

    apiVersion: argoproj.io/v1alpha1
    kind: Workflow
    metadata:
      generateName: artifact-passing-
    spec:
      entrypoint: artifact-example
      templates:
      - name: artifact-example
        steps:
        - - name: generate-artifact
            template: whalesay
        - - name: consume-artifact
            template: print-message
            arguments:
              artifacts:
              # bind message to the hello-art artifact
              # generated by the generate-artifact step
              - name: message
                from: "{{steps.generate-artifact.outputs.artifacts.hello-art}}"
      - name: whalesay
        container:
          image: docker/whalesay:latest
          command: [sh, -c]
          args: ["cowsay hello world | tee /tmp/hello_world.txt"]
        outputs:
          artifacts:
          # generate hello-art artifact from /tmp/hello_world.txt
          # artifacts can be directories as well as files
          - name: hello-art
            path: /tmp/hello_world.txt
      - name: print-message
        inputs:
          artifacts:
          # unpack the message input artifact
          # and put it at /tmp/message
          - name: message
            path: /tmp/message
        container:
          image: alpine:latest
          command: [sh, -c]
          args: ["cat /tmp/message"]
  2. Create the workflow.

    argo -n argo submit workflow-oss.yaml
  3. View the execution result of the workflow.

    argo -n argo list

    Expected output:

    s@xxxxxxxxxid:~$ argo -n argo list
    NAME                    STATUS      AGE   DURATION   PRIORITY   MESSAGE
    artifact-passing-2746t  Succeeded   3m    30s        0

Choose an executor

Each Argo worker pod contains at least two containers:

  • The main container

    This is your application container where your business logic runs.

  • The wait container

    Argo injects this system component into the pod as a sidecar. Its core responsibilities are:

    • Startup phase

      • Load artifacts and inputs that the main container depends on.

    • Running phase

      • Wait for the main container to exit, then kill any associated sidecar containers.

      • Collect outputs and artifacts from the main container and report its status.

The executor acts as a bridge, allowing the wait container to access and control the main container. Argo abstracts this into the ContainerRuntimeExecutor interface, which defines the following operations:

  • GetFileContents: Gets output parameters (outputs/parameters) from the main container.

  • CopyFile: Gets output artifacts (outputs/artifacts) from the main container.

  • GetOutputStream: Gets the standard output (including standard error) of the main container.

  • Wait: Waits for the main container to exit.

  • Kill: Kills associated sidecar containers.

  • ListContainerNames: Lists the names of the containers within the pod.

Argo supports multiple executors with different underlying mechanisms, all designed for standard Kubernetes architectures. Because the architecture of Serverless Kubernetes differs from standard Kubernetes, you must choose a compatible executor. We recommend using the Emissary executor for running Argo in a Serverless Kubernetes environment. The following table details the available executors:

Executor

Description

Emissary

It functions by sharing files through an emptyDir volume.

This executor is recommended because it depends only on the standard emptyDir capability and has no other dependencies.

Kubernetes API

It uses the Kubernetes API, but its functionality is incomplete.

Because this executor offers incomplete functionality and can pressure the Kubernetes control plane at scale, it is not recommended.

PNS

Relies on process namespace (PID) sharing and chroot within the pod. This approach pollutes the pod's process space and requires privilege.

Serverless Kubernetes enforces a higher level of security isolation and does not support privileged containers. Therefore, this executor is incompatible.

Docker

Uses the Docker CLI to perform its functions, which requires direct access to the underlying Docker container runtime.

Because Serverless Kubernetes does not expose underlying nodes, you cannot access the Docker daemon on the node. Therefore, this executor is incompatible.

Kubelet

Uses the Kubelet Client API to perform its functions, which requires access to the underlying Kubelet component on the node.

Because Serverless Kubernetes does not expose underlying nodes, you cannot access the Kubelet component. Therefore, this executor is incompatible.

Schedule Argo tasks to ECI

An ACK Serverless cluster automatically schedules all pods to ECI, so no extra configuration is needed. For an ACK managed cluster, you must configure it to schedule pods to ECI. For more information, see Schedule pods to an x86-based virtual node.

The following YAML example demonstrates how to use a label for scheduling:

  • Add the label alibabacloud.com/eci: "true": This label automatically schedules the pod to ECI.

  • (Optional) Specify {"schedulerName": "eci-scheduler"}: This is recommended. During an upgrade or change of the virtual node, the admission webhook might be briefly unavailable. This setting prevents pods from being scheduled to regular nodes during this temporary unavailability.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: parallelism-limit1-
spec:
  entrypoint: parallelism-limit1
  parallelism: 10
  podSpecPatch: '{"schedulerName": "eci-scheduler"}'  # Schedule the pod to ECI.
  podMetadata:
    labels:
      alibabacloud.com/eci: "true"   # Use a label to schedule the pod to ECI.
  templates:
  - name: parallelism-limit1
    steps:
    - - name: sleep
        template: sleep
        withSequence:
          start: "1"
          end: "10"
  - name: sleep
    container:
      image: alpine:latest
      command: ["sh", "-c", "sleep 30"]

Improve pod creation success rate

In a production environment, an Argo workflow often involves multiple compute pods. The failure of any single pod can cause the entire workflow to fail. If your workflow success rate is low, you may need to perform multiple reruns, which impacts execution efficiency and increases costs. Therefore, you should adopt strategies to improve the pod creation success rate:

  • In your Argo workflow definition:

    • Configure an Argo retry policy to automatically retry failed steps. For more information, see Retries.

    • Configure a workflow timeout to limit the total runtime of the workflow. For more information, see Timeouts.

  • When creating ECI pods:

    • Configure multiple zones to prevent pod creation failures due to insufficient inventory in a single zone. For more information, see Deploy pods in multiple zones.

    • Specify multiple instance specifications to avoid creation failures due to insufficient inventory of a specific instance type. For more information, see Create pods by specifying multiple specifications.

    • Specify vCPU and memory requirements instead of a specific instance type. ECI automatically matches your request to an available instance specification based on current inventory.

    • Use instance specifications with 2 vCPU and 4 GiB of memory or more. These are enterprise-grade instances with dedicated resources, which ensures stable performance.

    • Configure a pod fault handling policy to define whether to retry pod creation upon failure. For more information, see Configure a fault handling policy for a pod.

The following is a sample configuration:

  1. Edit the eci-profile ConfigMap to configure multiple zones.

    kubectl edit -n kube-system cm eci-profile

    In the data section, configure vSwitchIds with the IDs of multiple vSwitches:

    data:
      # ...other configurations...
      vSwitchIds: vsw-2ze23nqzig8inprou****,vsw-2ze94pjtfuj9vaymf****  # Specify multiple vSwitch IDs to configure multiple zones.
      vpcId: vpc-2zeghwzptn5zii0w7****
      # ...other configurations...
  2. Use multiple strategies to improve the success rate when you create a pod.

    • Use the k8s.aliyun.com/eci-use-specs annotation to specify multiple instance specifications. In this example, three specifications are listed. ECI attempts to match them in order: ecs.c6.large, ecs.c5.large, and then 2-4Gi (2 vCPU, 4 GiB memory).

    • Use the k8s.aliyun.com/eci-schedule-strategy annotation to set the multi-zone scheduling strategy. This example uses VSwitchRandom, which schedules pods randomly across the configured zones.

    • Configure the retryStrategy to set the Argo retry policy. This example sets retryPolicy: "Always", which retries all failed steps.

    • Use the k8s.aliyun.com/eci-fail-strategy annotation to set the pod fault handling policy. This example uses fail-fast. If pod creation fails, the system immediately reports an error, and the pod status becomes ProviderFailed. The higher-level orchestration system can then decide whether to retry or schedule the pod to a regular node.

    apiVersion: argoproj.io/v1alpha1
    kind: Workflow
    metadata:
      generateName: parallelism-limit1-
    spec:
      entrypoint: parallelism-limit1
      parallelism: 10
      podSpecPatch: '{"schedulerName": "eci-scheduler"}'
      podMetadata:
        labels:
          alibabacloud.com/eci: "true"
        annotations:
          k8s.aliyun.com/eci-use-specs: "ecs.c6.large,ecs.c5.large,2-4Gi"
          k8s.aliyun.com/eci-schedule-strategy: "VSwitchRandom"
          k8s.aliyun.com/eci-fail-strategy: "fail-fast"
      templates:
      - name: parallelism-limit1
        steps:
        - - name: sleep
            template: sleep
            withSequence:
              start: "1"
              end: "10"
      - name: sleep
        retryStrategy:
          limit: "3"
          retryPolicy: "Always"
        container:
          image: alpine:latest
          command: [sh, -c, "sleep 30"]

Optimize pod costs

ECI supports multiple billing methods. By choosing the right billing method for your workload, you can significantly reduce your compute costs.

For more information on cost optimization methods, see the following topics:

Accelerate pod creation

A pod's startup time is often dominated by the image pull, a process dependent on image size and network speed. To accelerate pod creation, ECI provides an image cache feature. By creating an image cache from an image in advance, you can reduce or eliminate download time for subsequent pods that use the cache.

There are two types of image caches:

  • Automatic creation: This feature is enabled by default in ECI. When you create an ECI pod, if an exact image cache is not found, ECI automatically creates one from the pod's image.

  • Manual creation: You can manually create image caches by using a Custom Resource Definition (CRD).

    We recommend that you manually create an image cache before you run high-concurrency Argo tasks. After the image cache is created, you can specify it in your pod definition and set the pod's image pull policy to IfNotPresent. This allows the pod to skip the image pull step during startup, accelerating pod creation, reducing the runtime of Argo tasks, and lowering operational costs. For more information, see Use ImageCache to accelerate the creation of pods.

If you have already run the preceding examples, ECI has automatically created an image cache for you. You can log on to the Elastic Container Instance console to check the image cache status. You can use the following YAML to create a workflow that leverages the existing cache and test the pod startup speed.

apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: parallelism-limit1-
spec:
  entrypoint: parallelism-limit1
  parallelism: 100
  podSpecPatch: '{"schedulerName": "eci-scheduler"}'
  podMetadata:
    labels:
      alibabacloud.com/eci: "true"
    annotations:
      k8s.aliyun.com/eci-use-specs: "ecs.c6.large,ecs.c5.large,2-4Gi"
      k8s.aliyun.com/eci-schedule-strategy: "VSwitchRandom"
      k8s.aliyun.com/eci-fail-strategy: "fail-fast"
  templates:
  - name: parallelism-limit1
    steps:
    - - name: sleep
        template: sleep
        withSequence:
          start: "1"
          end: "100"
  - name: sleep
    retryStrategy:
      limit: "3"
      retryPolicy: "Always"
    container:
      imagePullPolicy: IfNotPresent
      image: alpine:latest
      command: [sh, -c, "sleep 30"]

After the workflow is created, you can check the ECI pod's events to see the ID of the matched image cache. The events also show that the image pull process was skipped during pod startup.

Events:
  Type    Reason                  Age    From        Message
  ----    ------                  ---    ----        -------
  Normal  SuccessfulHitImageCache  2m22s  EciService  [eci.imagecache]Successfully hit image cache imc-2zeedp8bxor2kcxxx, eci will be scheduled with this image cache.
  Normal  Pulled                   2m12s  kubelet     Container image "registry.cn-hangzhou.aliyuncs.com/acs/argoexec:v3.3-0d060b7" already present on machine
  Normal  Created                  2m11s  kubelet     Created container init
  Normal  Started                  2m11s  kubelet     Started container init
  Normal  Pulled                   2m11s  kubelet     Container image "registry.cn-hangzhou.aliyuncs.com/acs/argoexec:v3.3-0d060b7" already present on machine
  Normal  Created                  2m11s  kubelet     Created container wait
  Normal  Started                  2m10s  kubelet     Started container wait
  Normal  Pulled                   2m10s  kubelet     Container image "alpine:latest" already present on machine
  Normal  Created                  2m10s  kubelet     Created container main
  Normal  Started                  2m10s  kubelet     Started container main

Accelerate data loading

Argo is widely used in AI inference, where tasks often access large datasets. In compute-storage separation architectures, data loading efficiency directly impacts task duration and cost. Concurrent data access from many Argo tasks can create a storage bottleneck. For example, when concurrent Argo tasks load data from OSS and the OSS bucket's bandwidth is saturated, each compute node becomes blocked at the data loading stage. This increases task duration and compute costs while reducing efficiency.

Fluid, a data acceleration service, solves this problem. Before you run a batch computation, you can create and preload a Fluid dataset. This pre-caches data from OSS onto a small number of cache nodes. Then, you can start your concurrent Argo tasks. Argo tasks then read data from the cache nodes instead of directly from OSS. The cache nodes effectively expand the available bandwidth from OSS, improving data loading efficiency for the compute nodes. This approach boosts Argo task performance and reduces running costs. For more information about Fluid, see Overview of Fluid.

The following example demonstrates how to use Fluid to load a 10 GB test file from OSS and calculate its MD5 hash across 100 concurrent tasks.

  1. Deploy Fluid.

    1. Log on to the ACK console.

    2. In the left-side navigation pane, choose Marketplace>Marketplace.

    3. Find and click the ack-fluid card.

    4. On the ack-fluid page, click Deploy.

    5. In the panel that appears, select your target cluster, configure the parameters, and click OK.

      After the deployment is complete, you are redirected to the release details page for ack-fluid. If you return to the Helm page, you can see that the status of ack-fluid is Deployed. You can also run a kubectl command to verify that Fluid was deployed successfully.

      ~$ kubectl get pod -n fluid-system
      NAME                                  READY   STATUS    RESTARTS   AGE
      dataset-controller-6f9967d766-pm22l   1/1     Running   0          5m18s
      fluid-webhook-5777b78c-8mt4h          1/1     Running   0          5m18s
  2. Prepare test data.

    After deploying Fluid, use a Fluid dataset to accelerate data access. Before you proceed, upload a 10 GB test file to your OSS bucket.

    1. Generate a test file.

      dd if=/dev/zero of=/test.dat bs=1G count=10
    2. Upload the test file to your OSS bucket. For more information, see Simple upload.

  3. Create an accelerated dataset.

    1. Create the Dataset and JindoRuntime resources.

      kubectl -n argo apply -f dataset.yaml

      The following is an example of the dataset.yaml file. Replace the placeholder AccessKey and OSS bucket information with your values.

      apiVersion: v1
      kind: Secret
      metadata:
        name: access-key
      stringData:
        fs.oss.accessKeyId: ***************         # The AccessKey ID that has permissions to access the OSS bucket.
        fs.oss.accessKeySecret: ******************  # The AccessKey secret that has permissions to access the OSS bucket.
      ---
      apiVersion: data.fluid.io/v1alpha1
      kind: Dataset
      metadata:
        name: serverless-data
      spec:
        mounts:
        - mountPoint: oss://oss-bucket-name/            # The path to your OSS bucket.
          name: demo
          path: /
          options:
            fs.oss.endpoint: oss-cn-shanghai-internal.aliyuncs.com  # The endpoint of the OSS bucket.
          encryptOptions:
            - name: fs.oss.accessKeyId
              valueFrom:
                secretKeyRef:
                  name: access-key
                  key: fs.oss.accessKeyId
            - name: fs.oss.accessKeySecret
              valueFrom:
                secretKeyRef:
                  name: access-key
                  key: fs.oss.accessKeySecret
      ---
      apiVersion: data.fluid.io/v1alpha1
      kind: JindoRuntime
      metadata:
        name: serverless-data
      spec:
        replicas: 10         # The number of JindoRuntime cache nodes to create.
        podMetadata:
          annotations:
            k8s.aliyun.com/eci-use-specs: ecs.g6.2xlarge  # Specify a suitable instance specification.
            k8s.aliyun.com/eci-image-cache: "true"
          labels:
            alibabacloud.com/eci: "true"
        worker:
          podMetadata:
            annotations:
              k8s.aliyun.com/eci-use-specs: ecs.g6.2xlarge # Specify a suitable instance specification.
        tieredstore:
          levels:
            - mediumtype: MEM          # Cache medium type. Use MEM for memory, or LoadRaid0 if you specify an instance with local disks.
              volumeType: emptyDir
              path: /local-storage     # Cache path.
              quota: 12Gi              # Maximum cache capacity.
              high: "0.99"             # High watermark for storage capacity.
              low: "0.99"              # Low watermark for storage capacity.
      Note

      This example uses the memory of ECI pods as data cache nodes. Because each ECI pod has a dedicated VPC network interface, its bandwidth is not affected by other pods.

    2. View the results.

      • Check the status of the dataset. A PHASE of Bound indicates successful creation.

        kubectl -n argo get dataset

        Expected output:

        $ kubectl -n argo get dataset
        NAME               UFS TOTAL SIZE   CACHED   CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
        serverless-data    10.00GiB         0.00B    24.00GiB         0.0%                Bound   92s
      • Check the pod information. You can see that 10 JindoRuntime cache nodes have been created by the dataset.

        kubectl -n argo get pods

        Expected output:

        ~$ kubectl -n argo get pods
        NAME                                READY   STATUS    RESTARTS   AGE
        ack-workflow-ddd86b88c-r8fcj        1/1     Running   0          100m
        argo-server-84d69d65dd-1f2hj        1/1     Running   0          100m
        serverless-data-jindofs-master-0    1/1     Running   0          10m
        serverless-data-jindofs-worker-0    1/1     Running   0          9m20s
        serverless-data-jindofs-worker-1    1/1     Running   0          9m19s
        serverless-data-jindofs-worker-2    1/1     Running   0          9m19s
        serverless-data-jindofs-worker-3    1/1     Running   0          9m19s
        serverless-data-jindofs-worker-4    1/1     Running   0          9m19s
        serverless-data-jindofs-worker-5    1/1     Running   0          9m19s
        serverless-data-jindofs-worker-6    1/1     Running   0          9m19s
        serverless-data-jindofs-worker-7    1/1     Running   0          9m19s
        serverless-data-jindofs-worker-8    1/1     Running   0          9m19s
        serverless-data-jindofs-worker-9    1/1     Running   0          9m19s
  4. Preload the data.

    After the dataset is ready, create a DataLoad resource to trigger data preloading.

    1. Create a DataLoad resource to trigger data preloading.

      kubectl -n argo apply -f dataload.yaml

      The following is an example of the dataload.yaml file:

      apiVersion: data.fluid.io/v1alpha1
      kind: DataLoad
      metadata:
        name: serverless-data-warmup
        namespace: argo
      spec:
        dataset:
          name: serverless-data
          namespace: argo
        loadMetadata: true
    2. Check the progress of the DataLoad operation.

      kubectl -n argo get dataload

      The expected output shows that even though the test file is 10 GB, the data preloading process is very fast.

      :~$ kubectl -n argo get dataload
      NAME                      DATASET          PHASE      AGE   DURATION
      serverless-data-warmup    serverless-data  Complete   30s   14s
  5. Run the Argo workflow.

    After the data is preloaded, you can run the concurrent Argo tasks. For best results, combine this approach with an image cache.

    1. Prepare the Argo workflow configuration file, argo-test.yaml.

      The following is an example of the argo-test.yaml file:

      apiVersion: argoproj.io/v1alpha1
      kind: Workflow
      metadata:
        generateName: parallelism-fluid-
      spec:
        entrypoint: parallelism-fluid
        parallelism: 100
        podSpecPatch: '{"terminationGracePeriodSeconds": 0, "schedulerName": "eci-scheduler"}'
        podMetadata:
          labels:
            alibabacloud.com/fluid-sidecar-target: eci
            alibabacloud.com/eci: "true"
          annotations:
            k8s.aliyun.com/eci-use-specs: 8-16Gi
        templates:
        - name: parallelism-fluid
          steps:
          - - name: domd5sum
              template: md5sum
              withSequence:
                start: "1"
                end: "100"
        - name: md5sum
          container:
            imagePullPolicy: IfNotPresent
            image: alpine:latest
            command: ["sh", "-c", "cp /data/test.dat /test.dat && md5sum test.dat"]
            volumeMounts:
            - name: data-vol
              mountPath: /data
          volumes:
          - name: data-vol
            persistentVolumeClaim:
              claimName: serverless-data
    2. Create the workflow.

      argo -n argo submit argo-test.yaml
    3. View the execution result of the workflow.

      argo -n argo list

      Expected output:

      xxx i:~$ argo -n argo list
      NAME                     STATUS    AGE   DURATION  PRIORITY  MESSAGE
      parallelism-fluid-56g2q  Running   8s    8s        0

      You can use the kubectl get pod -n argo --watch command to monitor the pod execution progress. In this example, the 100 Argo tasks are completed in about 2 to 4 minutes.

      parallelism-fluid-56g2q-412240702    0/3     Completed   0          3m17s
      parallelism-fluid-56g2q-563802762    0/3     Completed   0          3m19s
      parallelism-fluid-56g2q-693297214    0/3     Completed   0          3m17s
      parallelism-fluid-56g2q-615226358    0/3     Completed   0          3m18s
      parallelism-fluid-56g2q-982629280    0/3     Completed   0          3m20s
      parallelism-fluid-56g2q-918428816    0/3     Completed   0          3m16s
      parallelism-fluid-56g2q-3815880026   0/3     Completed   0          3m18s
      parallelism-fluid-56g2q-2992875428   0/3     Completed   0          3m19s
      parallelism-fluid-56g2q-3800105418   0/3     Completed   0          3m19s
      parallelism-fluid-56g2q-1897482410   0/3     Completed   0          3m17s

      In contrast, running the same Argo tasks without data acceleration—loading the 10 GB test file directly from OSS—takes about 14 to 15 minutes to calculate the MD5 hash.

      parallelism-fluid-fdr2j-2392572892    0/2    Completed    0    14m
      parallelism-fluid-fdr2j-1295033972    0/2    Completed    0    14m
      parallelism-fluid-fdr2j-2462229879    0/2    Completed    0    14m
      parallelism-fluid-fdr2j-4192350503    0/2    Completed    0    14m
      parallelism-fluid-fdr2j-4157125527    0/2    Completed    0    14m
      parallelism-fluid-fdr2j-4173654052    0/2    Completed    0    14m
      parallelism-fluid-fdr2j-1270167245    0/2    Completed    0    14m
      parallelism-fluid-fdr2j-1595813521    0/2    Completed    0    14m
      parallelism-fluid-fdr2j-1829788936    0/2    Completed    0    14m

      This comparison shows that Fluid improves computing efficiency and significantly reduces costs.