This topic describes common issues and solutions for installation and host deployment with Alibaba Cloud DevOps.
Host agent issues — Legacy agent
Self-hosted instance is offline
Verify that the agent on your server is online. If it is offline, restart it. The basic agent management commands are:
-
Check status:
/home/staragent/bin/staragentctl status -
Restart:
/home/staragent/bin/staragentctl restart -
Uninstall:
/home/staragent/bin/staragentctl stop rm -rf /home/staragent rm /usr/sbin/staragent_sn
Host deployment fails without error logs
If you are using a non-ECS machine, check if the self-hosted instance was created from an image. If so, uninstall the agent, add the self-hosted instance again, and retry the deployment.
Cannot add a non-ECS self-hosted instance to a host group
For non-ECS self-hosted instances, we recommend using the hybrid cloud hosting mode. For more information, see https://help.aliyun.com/document_detail/201140.html. Click New host group. In the Self-hosted Instance · Add Host dialog box, set Add Method to Hybrid Cloud Hosting, to register a non-Alibaba Cloud machine as an Alibaba Cloud-managed instance. Then, select a service authorization and region. You can then view and add your non-ECS self-hosted instance from the Available Hosts list.
Host agent issues — Flow Runner
Errors during cluster creation or runner installation
The error message is Failed to install runner for instance i-2ze99uh20nyrykgdobvb! (3000005).
In Alibaba Cloud DevOps Flow, you can create a host group or build cluster by either selecting an ECS instance or installing an agent on a self-hosted instance. However, you cannot use both methods to add the same machine to the same Alibaba Cloud DevOps organization.
For example, if you have already added Host A as an ECS instance, you cannot add it again as a self-hosted instance.
-
To switch the registration method, you must first uninstall the Flow Runner from the host. Use the following commands to uninstall the Flow Runner:
# Stop the specified systemd service. systemctl stop runner-{version}-{tenant-name}.service # Delete the systemd configuration file for the service. rm -rf /etc/systemd/system/runner-{version}-{tenant-name}.service # Delete the runner configuration directory for the tenant. rm -rf /root/yunxiao/{tenant-name}/runner/config -
The deployment command runs successfully when executed manually on the host but fails in the Flow pipeline.
-
Add the relevant environment variables for the command. For example:
/source /root/.bash_profile; source /etc/profile. -
Use absolute paths in deployment scripts (for example,
/home/admin/app/deploy.sh) instead of relative paths.
-
Flow Runner configuration conflicts from ECS image cloning
During a host deployment, if you clone Machine A to create Machine B, Machine B reuses the Flow Runner configuration from Machine A. This can cause channel conflicts, leading to the following issues:
-
Incorrect deployment target: Tasks intended for Machine B are incorrectly dispatched to Machine A.
-
Root cause: After the image was cloned, the unique identifier of the Flow Runner (such as its registration information) was not reset. As a result, the scheduling system cannot distinguish between Machine A and Machine B.
-
Key impact: This conflict disrupts automated workflows and can lead to environment contamination or service overlap.
-
Solution:
NoteIf you clone a machine with a configured Flow Runner, you must reinstall the Flow Runner on the new machine (Machine B) to ensure that Machine A and Machine B have unique channel IDs.
# Uninstall the Flow Runner on Machine B # Check the Flow Runner service name. The name is typically in the format runner-{version}-{tenant-name}.service ls -al /etc/systemd/system | grep runner systemctl stop runner-{version}-{tenant-name}.service rm -rf /etc/systemd/system/runner-{version}-{tenant-name}.service rm -rf /root/yunxiao/{tenant-name}/runner/config # Delete the runner-related directory and its files on Machine B rm -rf /root/yunxiao/ # Verify that the Flow Runner has been completely removed ps -ef | grep runner
Troubleshoot Flow Runner issues
Before you troubleshoot issues related to host deployments or environment provisioning for private build machines, first check the Flow Runner status on the machine.
Use the diagnostic tool
This tool supports only Linux systems.
-
Download the tool:
wget "https://rdc-public-software.oss-cn-hangzhou.aliyuncs.com/runner/runnerStatusCheck" -O runnerStatusCheck -
Grant execute permissions:
chmod u+x runnerStatusCheck -
Run the tool:
./runnerStatusCheck -
Follow the instructions in the tool's output, as shown in the example below.
[INFO] prepare to check disk check disk output Filesystem Size Used Avail Use% Mounted on /dev/root 99G 59G 36G 62% / devtmpfs 3.7G 0 3.7G 0% /dev tmpfs 3.7G 0 3.7G 0% /dev/shm tmpfs 3.7G 600K 3.7G 1% /run tmpfs 3.7G 0 3.7G 0% /sys/fs/cgroup tmpfs 748M 0 748M 0% /run/user/0 [INFO] prepare to check runner status RunnerService: runner-v0.0.7-be-1fy1ifhg1ylcvqv7rwinqpgh.service status is active running RunnerService: runner-v0.0.7-be-3n5goas4jkp3clmcaopzeebj.service status is active running RunnerService: runner-v0.0.7-be-6fcm3z72s2fsnrkcjl7pbpsa.service status is not running , please exeute 'systemctl restart runner-v0.0.7-be-6fcm3z72s2fsnrkcjl7pbpsa.service ' to restart runner service RunnerService: runner-v0.0.7-be-7srpetjhkcy40btoontpsy8a.service status is not running , please exeute 'systemctl restart runner-v0.0.7-be-7srpetjhkcy40btoontpsy8a.service ' to restart runner service RunnerService: runner-v0.0.7-be-7ypo76lza21lljapgp5xtopi.service status is active running RunnerService: runner-v0.0.7-be-9zouh99ta6ycx2dq3iggekk9.service status is not running , please exeute 'systemctl restart runner-v0.0.7-be-9zouh99ta6ycx2dq3iggekk9.service ' to restart runner service RunnerService: runner-v0.0.7-be-ao81clhgw56bo0jopj5jgwn3.service status is active running RunnerService: runner-v0.0.7-be-bn9wuh8n4kiba0aa0a8dcirb.service status is active running RunnerService: runner-v0.0.7-be-fsr2gvdefe5yxl3nyijp36d3.service status is active running RunnerService: runner-v0.0.7-be-haphfbrhuudbxkzwpcuhn0hk.service status is not running , please exeute 'systemctl restart runner-v0.0.7-be-haphfbrhuudbxkzwpcuhn0hk.service ' to restart runner service RunnerService: runner-v0.0.7-be-mwkcqcv2hbwozrwmmczopzgm.service status is active running RunnerService: runner-v0.0.7-be-qdiyjifmn0bbkafobtfpkrst.service status is not running , please exeute 'systemctl restart runner-v0.0.7-be-qdiyjifmn0bbkafobtfpkrst.service ' to restart runner service RunnerService: runner-v0.0.7-be-qx6gaxojjjpzeblor692hrwk.service status is active running RunnerService: runner-v0.0.7-be-s5f7tixziynazefazqhls9ob.service status is active running RunnerService: runner-v0.0.7-be-uwdlmn516wd8emqebhgqwq6x.service status is active running RunnerService: runner-v0.0.7-be-xjmxrlbzrtksuafv4oyg8qtf.service status is active running Check runner service status finished.
Manual troubleshooting
-
Check if your Linux version is supported.
Run the
lsb_release -acommand to check your Linux distribution. The unified Flow Runner supports the following Linux distributions:-
CentOS 6+
-
Ubuntu 16.04+
-
Alibaba Cloud Linux 2/3
-
-
Check the Flow Runner service status and logs.
Run the
ls -al /etc/systemd/system | grep runnercommand to find the service name. The service name is typically in the format runner-{version}-{tenant-name}.service.[root@ecs-for-batch-deploy-1 ~]# lsb_release -a LSB Version: :core-4.1-amd64:core-4.1-noarch Distributor ID: CentOS Description: CentOS Linux release 8.2.2004 (Core) Release: 8.2.2004 Codename: Core [root@ecs-for-batch-deploy-1 ~]# ls -al /etc/systemd/system | grep runner -rw-r--r-- 1 root root 550 Oct 11 10:50 runner-v0.xxx.service [root@ecs-for-batch-deploy-1 ~]#Run the
systemctl status runner-{version}-{tenant-name}.servicecommand to check the service status. If the status is active (running), the service is operating correctly.Release: 8.2.2004 Codename: Core [root@ecs-for-batch-deploy-1 ~]# ls -al /etc/systemd/system | gr -rw-r--r-- 1 root root 550 Oct 11 10:50 runner-v0.0.5-be-bnyxxx...yt.service [root@ecs-for-batch-deploy-1 ~]# systemctl status runner-v0.0.5-be-bnyxxx...yt.service ● runner-v0.0.5-be-bnxxx...yt.service - Aliyun yunxiao runner polling jo Loaded: loaded (/etc/systemd/system/runner-v0.xxx...yt.service) Active: active (running) since Tue 2022-10-11 10:50:56 CST; 2h ago Main PID: 1405243 (runner) Tasks: 10 (limit: 48875) Memory: 13.6M CGroup: /system.slice/runner-v0.0.5-be-brxxx...yt.service └─1405243 /usr/local/share/yunxiao-runner/v0.0.5/runner run --configPath=/r... Oct 11 11:01:03 ecs-for-batch-deploy-1 runner[1405243]: INFO[2022-10-11T11:01:03+08:00] Oct 11 11:01:03 ecs-for-batch-deploy-1 runner[1405243]: INFO[2022-10-11T11:01:03+08:00] Oct 11 11:01:18 ecs-for-batch-deploy-1 runner[1405243]: WARN[2022-10-11T11:01:18+08:00]To view real-time execution logs, run the
journalctl -u runner-{version}-{tenant-name}.service -a --no-pager --since '5 minutes ago' -fcommand.
ECS instance shows a "deploy channel error" or appears offline
-
On the ECS instance, verify that Cloud Assistant is running correctly. If not, restart it. For more information, see View execution results and troubleshoot common issues.
-
If Cloud Assistant is working properly, check if the ECS instance disk is full. If so, free up disk space.
Host deployment fails to download the latest artifact
Check the logs for messages indicating a full disk. If the server's disk is full, free up space.
Roll back multiple tasks
Rollbacks are performed on a per-task basis. If you have configured three tasks, you must roll back each one individually by selecting the task from the drop-down menu in the deployment history.
Deployment fails and the service does not start
Run the deployment commands directly on the server to debug your deployment script and ensure it is free of errors.
-
Alibaba Cloud DevOps executes the commands you configure in the deployment settings. Copy these commands and run them manually on the server. If the failure persists, debug the deployment script. For example, you can create a new .sh file on the server, paste the commands from your deployment script into it, and then execute the file to debug.
-
If the script runs successfully on the server but fails to start the service when run by Alibaba Cloud DevOps, check for relative paths in your script. Replace them with absolute paths and try again.
Windows host deployment fails with a "deploy channel error"
Alibaba Cloud DevOps does not currently support adding Windows hosts. Try one of the following workarounds:
-
Use a Linux server as an intermediary. Write commands in your deployment script that run on the Linux server to interact with the Windows host.
-
Use an Alibaba Cloud DevOps component to upload the build artifact to your own OSS bucket. The Windows host can then download the artifact from OSS for deployment.
Host deployment fails with a 'not a valid identifier' error
+ export pause_strategy=FirstBatchPause
+ pause_strategy=FirstBatchPause
+ export rootMixActionCode=ENTER
+ rootMixActionCode=ENTER
+ export TIMESTAMP=1607427603743
+ TIMESTAMP=1607427603743
+ export appId=0
+ appId=0
+ export execute_user=root
+ execute_user=root
+ export ENGINE_PIPELINE_CREATOR_ALIYUN_PK=1218858987274713
+ ENGINE_PIPELINE_CREATOR_ALIYUN_PK=1218858987274713
+ export triggerMode=1
+ triggerMode=1
+ export abc=hello world
+ abc=hello
/tmp/rdc_deploy_command_1607427664248_globalParams.sh: line 16: export: 'world': not a valid identifier
[ERROR] The environment variable contains special characters. Please follow the documentation to troubleshoot.
https://thoughts.aliyun.com/sharespace/5e86a419546fd9001aee81f2/docs/5fbca20e550406001d388abc
------Print environment variables------
]
[2020-12-08 19:41:09]Execution result code:[
exit 0
]
DeployCommand execution completed
This error occurs because an environment variable contains special characters, such as spaces. To use these variables, you must configure your pipeline as follows:
-
In the host deployment task, select the Encode variables checkbox.
-
In your deployment script, decode any environment variable you need to use with Base64. For example, to use the
PIPELINE_IDenvironment variable, add the following line at the beginning of your script:export PIPELINE_ID=$(echo $PIPELINE_ID | base64 -d).+ export PIPELINE_NAME=6aKE5Y+RYmFzZTY05rWL6K+V5rWB5rC057q/ + PIPELINE_NAME=6aKE5Y+RYmFzZTY05rWL6K+V5rWB5rC057q/ + export BUILD_NUMBER=Ng== + BUILD_NUMBER=Ng== + export FLOW_ENGINE_ENCODE_GLOBAL_PARAM=dHJ1ZQ== + FLOW_ENGINE_ENCODE_GLOBAL_PARAM=dHJ1ZQ== + export DATETIME=MjAyMC0xMi0wOC0yMC0yMi01MA== + DATETIME=MjAyMC0xMi0wOC0yMC0yMi01MA== + export parentMixFlowInstId=MA== + parentMixFlowInstId=MA== + export GIT_BRANCH=null + GIT_BRANCH=null + export CI_COMMIT_SHA=MmZhNThkMTk1OWFjYjcwZDQzNDZjZDBlM2E2YzhhYzQyNmE2MTE4NQ== + CI_COMMIT_SHA=MmZhNThkMTk1OWFjYjcwZDQzNDZjZDBlM2E2YzhhYzQyNmE2MTE4NQ== + export machine_group_id=MTE5MDA= + machine_group_id=MTE5MDA= + export ENGINE_PIPELINE_INST_ID=MjExNzYxMw== + ENGINE_PIPELINE_INST_ID=MjExNzYxMw== ------Print environment variables------ hello world ] [2020-12-08 20:23:46]Execution result code:[ exit 0 ] DeployCommand execution completed
Host deployment fails with a User.NoPermission error
The deployment details show a specific error code. The following information describes the error and how to troubleshoot it.
User.NoPermission means the user lacks the required API permissions. Verify that the service connection for the deployment group is configured correctly.
Pipeline status is stuck on "Deploying" after successful service start
A host deployment script that remains in the "Deploying" state typically has one of two causes: an incorrectly passed exit code or a child process that fails to exit.
For a sample script, see https://atomgit.com/flow-example/spring-boot/blob/master/deploy.sh.
To diagnose and resolve the issue:
-
Verify the exit code
-
After critical commands in your script, such as the service startup command, add
echo $?to ensure the exit code is 0. If a non-zero value is returned, check the error handling logic for that command. -
Explicitly declare
exit 0at the end of your script to avoid relying on the exit code of the last command.
-
-
Manage child processes
-
If you use
nohupto start a background process, such as a Java service, ensure you use the standard format:nohup java -jar ${JAR_NAME} > ${JAVA_OUT} 2>&1 &. The ampersand (&) at the end runs the process in the background and prevents it from blocking the main script. -
Check other commands that might create child processes, such as
systemdordocker run -d, to ensure they detach from the process context correctly.
-
-
Implement a robust timeout mechanism
-
If your service takes a long time to start, add a polling check (such as an HTTP health check) to your script. The script should only exit after the service is ready, instead of relying on a fixed timeout.
-
Example of a corrected script:
# Start the service and capture the exit code
nohup java -jar app.jar > log.txt 2>&1 &
echo "Service started with exit code: $?"
# Explicitly exit the script with a success code
exit 0
If the issue persists after these optimizations, check the pipeline configuration parameters for the current stage, or check if the Flow Runner service status is abnormal.
"Clean Workspace" task gets stuck during a host deployment
If you manually installed the Docker service, use the following commands to check its status and perform related operations.
# Check Docker status
systemctl status docker
### Key output information:
# Active: active (running): The service is running.
# Active: inactive (dead): The service is not running.
# Loaded: loaded (...): The service configuration has been loaded correctly.
# View detailed information about the Docker client and server, including images, containers, and volumes.
docker ps
# Restart the Docker service
sudo systemctl restart docker
No logs generated or host is offline during build or deployment
-
Run
df -hlto check if the host's disk is full. If it is, free up space and retry. -
Check the Flow Runner service status. If the status is not
active(running), restart the service by following these steps:-
Get the service name. The format is
runner-{version}-{tenant-name}.service.systemctl | grep "runner-v" | awk '{print $1}' -
Replace
$SERVICE_NAMEin the following command with the service name and run it to restart the Flow Runner service.systemctl restart $SERVICE_NAME
-
-
Check for network connectivity issues.
-
Run
systemctl status runner-{version}-{tenant-name}.serviceto query the Flow Runner service status and find the--configPathparameter, as shown below:[root@ecs-for-batch-deploy-1 ~]# systemctl status runner-v0.0.5-be-bnyxxx1llzpci0q606yt.service ● runner-v0.0.5-be-xxx...yt.service - Aliyun yunxiao runner polling jobs to execute Loaded: loaded (/etc/systemd/system/runner-v0.0.5-be-bxxx...yt.service; enabled; vendor preset: disabled) Active: active (running) since Tue 2022-10-11 10:50:56 CST; 41min ago Main PID: 1405243 (runner) Tasks: 10 (limit: 48875) Memory: 13.7M CGroup: /system.slice/runner-v0.0.5-be-bnyxxx...yt.service └─1405243 /usr/local/share/yunxiao-runner/v0.0.5/runner run --configPath=/root/yunxiao/be-bnykv60t831llzpci0q606yt/runner/config Oct 11 11:31:46 ecs-for-batch-deploy-1 runner[1405243]: INFO[2022-10-11T11:31:46+08:00] [runner] no new job, skip. runner=9e7073c92b2fa8b35bc0efd3eec051f8 Oct 11 11:31:46 ecs-for-batch-deploy-1 runner[1405243]: INFO[2022-10-11T11:31:46+08:00] counter current: 0, limit: 50 Oct 11 11:32:01 ecs-for-batch-deploy-1 runner[1405243]: WARN[2022-10-11T11:32:01+08:00] Response status code: 204 body data: Oct 11 11:32:01 ecs-for-batch-deploy-1 runner[1405243]: INFO[2022-10-11T11:32:01+08:00] POST /api/v2/builds/request, time spent 15.07 s Oct 11 11:32:01 ecs-for-batch-deploy-1 runner[1405243]: INFO[2022-10-11T11:32:01+08:00] [runner] no new job, skip. runner=9e7073c92b2fa8b35bc0efd3eec051f8 Oct 11 11:32:01 ecs-for-batch-deploy-1 runner[1405243]: INFO[2022-10-11T11:32:01+08:00] counter current: 0, limit: 50 Oct 11 11:32:16 ecs-for-batch-deploy-1 runner[1405243]: WARN[2022-10-11T11:32:16+08:00] ... -
View the URL in the configPath by running
cat {PATH_TO_CONFIG}/config.yml | grep url.[root@ecs-for-batch-deploy-1 config]# cat /root/yunxiao/be-bnykv60t83lllzpci0q606yt/runner/config/config.yml | grep url url: https://xxx.com [root@ecs-for-batch-deploy-1 config]# -
Run the following command to check URL accessibility.
# Note: Replace {url} in the command below with your actual URL. curl '{url}/api/v2/runner/storage/latest?os=linux&arch=amd64'
-