Troubleshoot host deployments

更新时间:
复制 MD 格式

This topic describes common issues and solutions for installation and host deployment with Alibaba Cloud DevOps.

Host agent issues — Legacy agent

Self-hosted instance is offline

Verify that the agent on your server is online. If it is offline, restart it. The basic agent management commands are:

  • Check status:

    /home/staragent/bin/staragentctl status
  • Restart:

    /home/staragent/bin/staragentctl restart
  • Uninstall:

    /home/staragent/bin/staragentctl stop
    rm -rf /home/staragent
    rm /usr/sbin/staragent_sn

Host deployment fails without error logs

If you are using a non-ECS machine, check if the self-hosted instance was created from an image. If so, uninstall the agent, add the self-hosted instance again, and retry the deployment.

Cannot add a non-ECS self-hosted instance to a host group

For non-ECS self-hosted instances, we recommend using the hybrid cloud hosting mode. For more information, see https://help.aliyun.com/document_detail/201140.html. Click New host group. In the Self-hosted Instance · Add Host dialog box, set Add Method to Hybrid Cloud Hosting, to register a non-Alibaba Cloud machine as an Alibaba Cloud-managed instance. Then, select a service authorization and region. You can then view and add your non-ECS self-hosted instance from the Available Hosts list.

Host agent issues — Flow Runner

Errors during cluster creation or runner installation

The error message is Failed to install runner for instance i-2ze99uh20nyrykgdobvb! (3000005).

In Alibaba Cloud DevOps Flow, you can create a host group or build cluster by either selecting an ECS instance or installing an agent on a self-hosted instance. However, you cannot use both methods to add the same machine to the same Alibaba Cloud DevOps organization.

For example, if you have already added Host A as an ECS instance, you cannot add it again as a self-hosted instance.

  • To switch the registration method, you must first uninstall the Flow Runner from the host. Use the following commands to uninstall the Flow Runner:

    # Stop the specified systemd service.
    systemctl stop runner-{version}-{tenant-name}.service
    # Delete the systemd configuration file for the service.
    rm -rf /etc/systemd/system/runner-{version}-{tenant-name}.service
    # Delete the runner configuration directory for the tenant.
    rm -rf /root/yunxiao/{tenant-name}/runner/config
  • The deployment command runs successfully when executed manually on the host but fails in the Flow pipeline.

    • Add the relevant environment variables for the command. For example: /source /root/.bash_profile; source /etc/profile.

    • Use absolute paths in deployment scripts (for example, /home/admin/app/deploy.sh) instead of relative paths.

Flow Runner configuration conflicts from ECS image cloning

During a host deployment, if you clone Machine A to create Machine B, Machine B reuses the Flow Runner configuration from Machine A. This can cause channel conflicts, leading to the following issues:

  • Incorrect deployment target: Tasks intended for Machine B are incorrectly dispatched to Machine A.

  • Root cause: After the image was cloned, the unique identifier of the Flow Runner (such as its registration information) was not reset. As a result, the scheduling system cannot distinguish between Machine A and Machine B.

  • Key impact: This conflict disrupts automated workflows and can lead to environment contamination or service overlap.

  • Solution:

    Note

    If you clone a machine with a configured Flow Runner, you must reinstall the Flow Runner on the new machine (Machine B) to ensure that Machine A and Machine B have unique channel IDs.

    # Uninstall the Flow Runner on Machine B
    # Check the Flow Runner service name. The name is typically in the format runner-{version}-{tenant-name}.service
    ls -al /etc/systemd/system | grep runner
    systemctl stop runner-{version}-{tenant-name}.service
    rm -rf /etc/systemd/system/runner-{version}-{tenant-name}.service
    rm -rf /root/yunxiao/{tenant-name}/runner/config
    # Delete the runner-related directory and its files on Machine B
    rm -rf /root/yunxiao/ 
    # Verify that the Flow Runner has been completely removed
    ps -ef | grep runner

Troubleshoot Flow Runner issues

Important

Before you troubleshoot issues related to host deployments or environment provisioning for private build machines, first check the Flow Runner status on the machine.

Use the diagnostic tool

This tool supports only Linux systems.

  1. Download the tool:

    wget "https://rdc-public-software.oss-cn-hangzhou.aliyuncs.com/runner/runnerStatusCheck" -O runnerStatusCheck
  2. Grant execute permissions:

    chmod u+x runnerStatusCheck
  3. Run the tool:

    ./runnerStatusCheck
  4. Follow the instructions in the tool's output, as shown in the example below.

    [INFO] prepare to check disk
    check disk output
    Filesystem      Size  Used Avail Use% Mounted on
    /dev/root        99G   59G   36G  62% /
    devtmpfs        3.7G     0  3.7G   0% /dev
    tmpfs           3.7G     0  3.7G   0% /dev/shm
    tmpfs           3.7G  600K  3.7G   1% /run
    tmpfs           3.7G     0  3.7G   0% /sys/fs/cgroup
    tmpfs           748M     0  748M   0% /run/user/0
    [INFO] prepare to check runner status
    RunnerService: runner-v0.0.7-be-1fy1ifhg1ylcvqv7rwinqpgh.service  status is active running
    RunnerService: runner-v0.0.7-be-3n5goas4jkp3clmcaopzeebj.service  status is active running
    RunnerService: runner-v0.0.7-be-6fcm3z72s2fsnrkcjl7pbpsa.service  status is not running , please exeute 'systemctl restart runner-v0.0.7-be-6fcm3z72s2fsnrkcjl7pbpsa.service ' to restart runner service
    RunnerService: runner-v0.0.7-be-7srpetjhkcy40btoontpsy8a.service  status is not running , please exeute 'systemctl restart runner-v0.0.7-be-7srpetjhkcy40btoontpsy8a.service ' to restart runner service
    RunnerService: runner-v0.0.7-be-7ypo76lza21lljapgp5xtopi.service  status is active running
    RunnerService: runner-v0.0.7-be-9zouh99ta6ycx2dq3iggekk9.service  status is not running , please exeute 'systemctl restart runner-v0.0.7-be-9zouh99ta6ycx2dq3iggekk9.service ' to restart runner service
    RunnerService: runner-v0.0.7-be-ao81clhgw56bo0jopj5jgwn3.service  status is active running
    RunnerService: runner-v0.0.7-be-bn9wuh8n4kiba0aa0a8dcirb.service  status is active running
    RunnerService: runner-v0.0.7-be-fsr2gvdefe5yxl3nyijp36d3.service  status is active running
    RunnerService: runner-v0.0.7-be-haphfbrhuudbxkzwpcuhn0hk.service  status is not running , please exeute 'systemctl restart runner-v0.0.7-be-haphfbrhuudbxkzwpcuhn0hk.service ' to restart runner service
    RunnerService: runner-v0.0.7-be-mwkcqcv2hbwozrwmmczopzgm.service  status is active running
    RunnerService: runner-v0.0.7-be-qdiyjifmn0bbkafobtfpkrst.service  status is not running , please exeute 'systemctl restart runner-v0.0.7-be-qdiyjifmn0bbkafobtfpkrst.service ' to restart runner service
    RunnerService: runner-v0.0.7-be-qx6gaxojjjpzeblor692hrwk.service  status is active running
    RunnerService: runner-v0.0.7-be-s5f7tixziynazefazqhls9ob.service  status is active running
    RunnerService: runner-v0.0.7-be-uwdlmn516wd8emqebhgqwq6x.service  status is active running
    RunnerService: runner-v0.0.7-be-xjmxrlbzrtksuafv4oyg8qtf.service  status is active running
    Check runner service status finished.

Manual troubleshooting

  1. Check if your Linux version is supported.

    Run the lsb_release -a command to check your Linux distribution. The unified Flow Runner supports the following Linux distributions:

    • CentOS 6+

    • Ubuntu 16.04+

    • Alibaba Cloud Linux 2/3

  2. Check the Flow Runner service status and logs.

    Run the ls -al /etc/systemd/system | grep runner command to find the service name. The service name is typically in the format runner-{version}-{tenant-name}.service.

    [root@ecs-for-batch-deploy-1 ~]# lsb_release -a
    LSB Version:    :core-4.1-amd64:core-4.1-noarch
    Distributor ID: CentOS
    Description:    CentOS Linux release 8.2.2004 (Core)
    Release:        8.2.2004
    Codename:       Core
    [root@ecs-for-batch-deploy-1 ~]# ls -al /etc/systemd/system | grep runner
    -rw-r--r--    1 root root  550 Oct 11 10:50 runner-v0.xxx.service
    [root@ecs-for-batch-deploy-1 ~]#

    Run the systemctl status runner-{version}-{tenant-name}.service command to check the service status. If the status is active (running), the service is operating correctly.

    Release:        8.2.2004
    Codename:       Core
    [root@ecs-for-batch-deploy-1 ~]# ls -al /etc/systemd/system | gr
    -rw-r--r--   1 root root  550 Oct 11 10:50 runner-v0.0.5-be-bnyxxx...yt.service
    [root@ecs-for-batch-deploy-1 ~]# systemctl status runner-v0.0.5-be-bnyxxx...yt.service
    ● runner-v0.0.5-be-bnxxx...yt.service - Aliyun yunxiao runner polling jo
         Loaded: loaded (/etc/systemd/system/runner-v0.xxx...yt.service)
         Active: active (running) since Tue 2022-10-11 10:50:56 CST; 2h ago
       Main PID: 1405243 (runner)
          Tasks: 10 (limit: 48875)
         Memory: 13.6M
         CGroup: /system.slice/runner-v0.0.5-be-brxxx...yt.service
                 └─1405243 /usr/local/share/yunxiao-runner/v0.0.5/runner run --configPath=/r...
    Oct 11 11:01:03 ecs-for-batch-deploy-1 runner[1405243]: INFO[2022-10-11T11:01:03+08:00]
    Oct 11 11:01:03 ecs-for-batch-deploy-1 runner[1405243]: INFO[2022-10-11T11:01:03+08:00]
    Oct 11 11:01:18 ecs-for-batch-deploy-1 runner[1405243]: WARN[2022-10-11T11:01:18+08:00]

    To view real-time execution logs, run the journalctl -u runner-{version}-{tenant-name}.service -a --no-pager --since '5 minutes ago' -f command.

ECS instance shows a "deploy channel error" or appears offline

  • On the ECS instance, verify that Cloud Assistant is running correctly. If not, restart it. For more information, see View execution results and troubleshoot common issues.

  • If Cloud Assistant is working properly, check if the ECS instance disk is full. If so, free up disk space.

Host deployment fails to download the latest artifact

Check the logs for messages indicating a full disk. If the server's disk is full, free up space.

Roll back multiple tasks

Rollbacks are performed on a per-task basis. If you have configured three tasks, you must roll back each one individually by selecting the task from the drop-down menu in the deployment history.

Deployment fails and the service does not start

Run the deployment commands directly on the server to debug your deployment script and ensure it is free of errors.

  1. Alibaba Cloud DevOps executes the commands you configure in the deployment settings. Copy these commands and run them manually on the server. If the failure persists, debug the deployment script. For example, you can create a new .sh file on the server, paste the commands from your deployment script into it, and then execute the file to debug.

  2. If the script runs successfully on the server but fails to start the service when run by Alibaba Cloud DevOps, check for relative paths in your script. Replace them with absolute paths and try again.

Windows host deployment fails with a "deploy channel error"

Alibaba Cloud DevOps does not currently support adding Windows hosts. Try one of the following workarounds:

  1. Use a Linux server as an intermediary. Write commands in your deployment script that run on the Linux server to interact with the Windows host.

  2. Use an Alibaba Cloud DevOps component to upload the build artifact to your own OSS bucket. The Windows host can then download the artifact from OSS for deployment.

Host deployment fails with a 'not a valid identifier' error

+ export pause_strategy=FirstBatchPause
+ pause_strategy=FirstBatchPause
+ export rootMixActionCode=ENTER
+ rootMixActionCode=ENTER
+ export TIMESTAMP=1607427603743
+ TIMESTAMP=1607427603743
+ export appId=0
+ appId=0
+ export execute_user=root
+ execute_user=root
+ export ENGINE_PIPELINE_CREATOR_ALIYUN_PK=1218858987274713
+ ENGINE_PIPELINE_CREATOR_ALIYUN_PK=1218858987274713
+ export triggerMode=1
+ triggerMode=1
+ export abc=hello world
+ abc=hello
/tmp/rdc_deploy_command_1607427664248_globalParams.sh: line 16: export: 'world': not a valid identifier
[ERROR] The environment variable contains special characters. Please follow the documentation to troubleshoot.
https://thoughts.aliyun.com/sharespace/5e86a419546fd9001aee81f2/docs/5fbca20e550406001d388abc
------Print environment variables------
]
[2020-12-08 19:41:09]Execution result code:[
exit 0
]
DeployCommand execution completed

This error occurs because an environment variable contains special characters, such as spaces. To use these variables, you must configure your pipeline as follows:

  1. In the host deployment task, select the Encode variables checkbox.

  2. In your deployment script, decode any environment variable you need to use with Base64. For example, to use the PIPELINE_ID environment variable, add the following line at the beginning of your script: export PIPELINE_ID=$(echo $PIPELINE_ID | base64 -d).

    + export PIPELINE_NAME=6aKE5Y+RYmFzZTY05rWL6K+V5rWB5rC057q/
    + PIPELINE_NAME=6aKE5Y+RYmFzZTY05rWL6K+V5rWB5rC057q/
    + export BUILD_NUMBER=Ng==
    + BUILD_NUMBER=Ng==
    + export FLOW_ENGINE_ENCODE_GLOBAL_PARAM=dHJ1ZQ==
    + FLOW_ENGINE_ENCODE_GLOBAL_PARAM=dHJ1ZQ==
    + export DATETIME=MjAyMC0xMi0wOC0yMC0yMi01MA==
    + DATETIME=MjAyMC0xMi0wOC0yMC0yMi01MA==
    + export parentMixFlowInstId=MA==
    + parentMixFlowInstId=MA==
    + export GIT_BRANCH=null
    + GIT_BRANCH=null
    + export CI_COMMIT_SHA=MmZhNThkMTk1OWFjYjcwZDQzNDZjZDBlM2E2YzhhYzQyNmE2MTE4NQ==
    + CI_COMMIT_SHA=MmZhNThkMTk1OWFjYjcwZDQzNDZjZDBlM2E2YzhhYzQyNmE2MTE4NQ==
    + export machine_group_id=MTE5MDA=
    + machine_group_id=MTE5MDA=
    + export ENGINE_PIPELINE_INST_ID=MjExNzYxMw==
    + ENGINE_PIPELINE_INST_ID=MjExNzYxMw==
    ------Print environment variables------
    hello world
    ]
    [2020-12-08 20:23:46]Execution result code:[
    exit 0
    ]
    DeployCommand execution completed

Host deployment fails with a User.NoPermission error

The deployment details show a specific error code. The following information describes the error and how to troubleshoot it.

User.NoPermission means the user lacks the required API permissions. Verify that the service connection for the deployment group is configured correctly.

Pipeline status is stuck on "Deploying" after successful service start

A host deployment script that remains in the "Deploying" state typically has one of two causes: an incorrectly passed exit code or a child process that fails to exit.

For a sample script, see https://atomgit.com/flow-example/spring-boot/blob/master/deploy.sh.

To diagnose and resolve the issue:

  1. Verify the exit code

    • After critical commands in your script, such as the service startup command, add echo $? to ensure the exit code is 0. If a non-zero value is returned, check the error handling logic for that command.

    • Explicitly declare exit 0 at the end of your script to avoid relying on the exit code of the last command.

  2. Manage child processes

    • If you use nohup to start a background process, such as a Java service, ensure you use the standard format: nohup java -jar ${JAR_NAME} > ${JAVA_OUT} 2>&1 &. The ampersand (&) at the end runs the process in the background and prevents it from blocking the main script.

    • Check other commands that might create child processes, such as systemd or docker run -d, to ensure they detach from the process context correctly.

  3. Implement a robust timeout mechanism

    • If your service takes a long time to start, add a polling check (such as an HTTP health check) to your script. The script should only exit after the service is ready, instead of relying on a fixed timeout.

Example of a corrected script:

# Start the service and capture the exit code
nohup java -jar app.jar > log.txt 2>&1 &
echo "Service started with exit code: $?"
# Explicitly exit the script with a success code
exit 0

If the issue persists after these optimizations, check the pipeline configuration parameters for the current stage, or check if the Flow Runner service status is abnormal.

"Clean Workspace" task gets stuck during a host deployment

If you manually installed the Docker service, use the following commands to check its status and perform related operations.

# Check Docker status
systemctl status docker
### Key output information:
# Active: active (running): The service is running.
# Active: inactive (dead): The service is not running.
# Loaded: loaded (...): The service configuration has been loaded correctly.
# View detailed information about the Docker client and server, including images, containers, and volumes.
docker ps
# Restart the Docker service
sudo systemctl restart docker

No logs generated or host is offline during build or deployment

  • Run df -hl to check if the host's disk is full. If it is, free up space and retry.

  • Check the Flow Runner service status. If the status is not active(running), restart the service by following these steps:

    1. Get the service name. The format is runner-{version}-{tenant-name}.service.

      systemctl | grep "runner-v" | awk '{print $1}'
    2. Replace $SERVICE_NAME in the following command with the service name and run it to restart the Flow Runner service.

      systemctl restart $SERVICE_NAME
  • Check for network connectivity issues.

    • Run systemctl status runner-{version}-{tenant-name}.service to query the Flow Runner service status and find the --configPath parameter, as shown below:

      [root@ecs-for-batch-deploy-1 ~]# systemctl status runner-v0.0.5-be-bnyxxx1llzpci0q606yt.service
      ● runner-v0.0.5-be-xxx...yt.service - Aliyun yunxiao runner polling jobs to execute
         Loaded: loaded (/etc/systemd/system/runner-v0.0.5-be-bxxx...yt.service; enabled; vendor preset: disabled)
         Active: active (running) since Tue 2022-10-11 10:50:56 CST; 41min ago
       Main PID: 1405243 (runner)
          Tasks: 10 (limit: 48875)
         Memory: 13.7M
         CGroup: /system.slice/runner-v0.0.5-be-bnyxxx...yt.service
                 └─1405243 /usr/local/share/yunxiao-runner/v0.0.5/runner run --configPath=/root/yunxiao/be-bnykv60t831llzpci0q606yt/runner/config
      Oct 11 11:31:46 ecs-for-batch-deploy-1 runner[1405243]: INFO[2022-10-11T11:31:46+08:00] [runner] no new job, skip.                    runner=9e7073c92b2fa8b35bc0efd3eec051f8
      Oct 11 11:31:46 ecs-for-batch-deploy-1 runner[1405243]: INFO[2022-10-11T11:31:46+08:00] counter current: 0, limit: 50
      Oct 11 11:32:01 ecs-for-batch-deploy-1 runner[1405243]: WARN[2022-10-11T11:32:01+08:00] Response status code: 204 body data:
      Oct 11 11:32:01 ecs-for-batch-deploy-1 runner[1405243]: INFO[2022-10-11T11:32:01+08:00] POST /api/v2/builds/request, time spent 15.07 s
      Oct 11 11:32:01 ecs-for-batch-deploy-1 runner[1405243]: INFO[2022-10-11T11:32:01+08:00] [runner] no new job, skip.                    runner=9e7073c92b2fa8b35bc0efd3eec051f8
      Oct 11 11:32:01 ecs-for-batch-deploy-1 runner[1405243]: INFO[2022-10-11T11:32:01+08:00] counter current: 0, limit: 50
      Oct 11 11:32:16 ecs-for-batch-deploy-1 runner[1405243]: WARN[2022-10-11T11:32:16+08:00] ...
    • View the URL in the configPath by running cat {PATH_TO_CONFIG}/config.yml | grep url.

      [root@ecs-for-batch-deploy-1 config]# cat /root/yunxiao/be-bnykv60t83lllzpci0q606yt/runner/config/config.yml | grep url
      url: https://xxx.com
      [root@ecs-for-batch-deploy-1 config]#
    • Run the following command to check URL accessibility.

      # Note: Replace {url} in the command below with your actual URL.
      curl '{url}/api/v2/runner/storage/latest?os=linux&arch=amd64'