How to troubleshoot Nginx Ingress issues-Container Service for Kubernetes(ACK)-阿里云帮助中心

This topic describes the diagnostic workflow, troubleshooting approach, common inspection methods, and solutions for Nginx Ingress issues.

Category	Content
Diagnostic workflow	Diagnostic workflow
Troubleshooting approach	Troubleshooting approach
Common inspection methods	View access logs from the Controller Pod in Simple Log Service (SLS) Check error logs in the Controller Pod Manually access the Ingress and backend pods from the Controller Pod Commands for troubleshooting Nginx Ingress Nginx Ingress status Capture packets
Common issues and solutions	Connection-related issues Unable to access the cluster LoadBalancer external address from within the cluster Unable to access the Ingress Controller itself The default or old TLS certificate is still used after adding or modifying a TLS certificate in the cluster Unable to connect to a gRPC service exposed through Ingress Unable to connect to a backend HTTPS service Unable to preserve the client source IP in the Ingress pod Grayscale-related Phased release rules do not take effect Traffic distribution does not match phased release rules, or other traffic enters the phased release service Error-related issues Error "failed calling webhook" when creating an Ingress resource SSL_ERROR_RX_RECORD_TOO_LONG error during HTTPS access Common HTTP error codes appear net::ERR_HTTP2_SERVER_REFUSED_STREAM error appears Error "The param of ServerGroupName is illegal" appears Error "certificate signed by unknown authority" when creating an Ingress Other issues Ingress pod health check fails and causes restarts Add TCP or UDP services Ingress rules do not take effect Some resources fail to load or a blank screen appears after rewriting to the root directory How to fix abnormal SLS log parsing after a version upgrade Error "cannot list/get/update resource" appears Error "configuration file failed" appears Error "Unexpected error validating SSL certificate" appears Issue with many uncleaned configuration files in the Controller Troubleshoot persistent Pending status of pods after Controller upgrade TCP stream mixing issue with multiple CLBs under high concurrency in Flannel CNI + IPVS clusters

Background information

The Kubernetes community officially maintains the Ingress NGINX Controller. ACK's Nginx Ingress Controller uses the version provided by the Kubernetes community and supports all community Annotations configurations.

For Nginx Ingress resources to work properly, you must deploy an Nginx Ingress Controller in the cluster to parse the Ingress forwarding rules. The Nginx Ingress Controller receives requests, matches them against the Ingress rules, and then forwards them to the corresponding backend Service pods for processing. In Kubernetes, the relationship among a Service, an Nginx Ingress, and the Nginx Ingress Controller is as follows:

A Service is an abstraction of backend services. One Service can represent multiple identical backend services.
An Nginx Ingress defines reverse proxy rules that specify which Service pods receive HTTP or HTTPS requests. For example, requests are routed to different Service pods based on the Host and URL path in each request.
The Nginx Ingress Controller is a component in the Kubernetes cluster that parses Nginx Ingress reverse proxy rules. When an Ingress is added, deleted, or modified, the Nginx Ingress Controller immediately updates its forwarding rules. When the controller receives a request, it forwards the request to the appropriate Service pod based on these rules.

The Nginx Ingress Controller retrieves Ingress resource changes from the API Server, dynamically generates configuration files required by the load balancer (such as nginx.conf), and reloads the load balancer (for example, by running nginx -s reload to reload Nginx) to apply new routing rules.

Diagnostic workflow

诊断流程Ingress.png

Follow these steps to check if the issue is caused by Ingress and ensure the Ingress Controller configuration is correct.
1. Verify that access from the Controller Pod meets expectations. For details, see Manually access the Ingress and backend pods from the Controller Pod.
2. Confirm that you use the Nginx Ingress Controller correctly. For details, see the Nginx Ingress community documentation.
Use the Ingress diagnostics feature to check Ingress and component configurations, and make modifications based on the prompts. For details about the Ingress diagnostics feature, see Use the Ingress diagnostics feature.
Follow the troubleshooting approach to identify related issues and solutions.
If the preceding steps do not resolve the issue, follow these steps:
- For HTTPS certificate issues:
  1. Check whether WAF or WAF in transparent proxy mode is enabled for the domain name.
    - If enabled, confirm that WAF or transparent WAF does not have a TLS certificate configured.
    - If not enabled, proceed to the next step.
  2. Check whether the SLB uses Layer 7 listeners.
    - If yes, confirm that no TLS certificate is configured on the SLB Layer 7 listener.
    - If no, proceed to the next step.
- For non-HTTPS certificate issues, check the error logs in the Controller Pod. For details, see Check error logs in the Controller Pod.
If the preceding steps do not resolve the issue, capture packets in both the Controller Pod and the corresponding backend application pod to identify the problem. For details, see Capture packets.

Troubleshooting approach

Troubleshooting approach	Issue symptoms	Solutions
Access failure	Pods inside the cluster cannot access the Ingress	Unable to access the cluster LoadBalancer external address from within the cluster
	Ingress cannot access itself	Unable to access the Ingress Controller itself
	Unable to access TCP or UDP services	Add TCP or UDP services
HTTPS access issues	Certificate not updated or default certificate returned	The default or old TLS certificate is still used after adding or modifying a TLS certificate in the cluster
HTTPS access issues	Returns `RX_RECORD_TOO_LONG/wrong version number`	SSL_ERROR_RX_RECORD_TOO_LONG error during HTTPS access
Issues when adding Ingress resources	Error "failed calling webhook..."	Error "failed calling webhook" when creating an Ingress resource
Issues when adding Ingress resources	Ingress added but not effective	Ingress rules do not take effect
Access does not meet expectations	Unable to obtain the client source IP	Unable to preserve the client source IP in the Ingress pod
	IP allowlist does not take effect or does not work as expected	Unable to preserve the client source IP in the Ingress pod
	Unable to connect to a gRPC service exposed through Ingress	Unable to connect to a gRPC service exposed through Ingress
	Phased release does not take effect	Phased release rules do not take effect
	Phased release rules are incorrect or affect other traffic	Traffic distribution does not match phased release rules, or other traffic enters the phased release service
	The following error occurred: `The plain HTTP request was sent to an HTTPS port`	Unable to connect to a backend HTTPS service
	Errors such as 502, 503, 413, or 499 appear	Common HTTP error codes appear
Some resources fail to load when loading a page	Configured `rewrite-target`, but resources return 404	Some resources fail to load or a blank screen appears after rewriting to the root directory
Some resources fail to load when loading a page	Resource access returns `net::ERR_FAILED` or `net::ERR_HTTP2_SERVER_REFUSED_STREAM`	net::ERR_HTTP2_SERVER_REFUSED_STREAM error appears

Common inspection methods

Use the Ingress diagnostics feature

Log on to the ACK console. In the left navigation pane, click Clusters.
On the Clusters page, click the name of your cluster. In the left navigation pane, click Inspections and Diagnostics > Diagnostics.
On the Diagnostics page, click Ingress diagnostics.
In the Ingress diagnostics panel, click Diagnose. Enter the problematic URL, for example, https://www.example.com. Select I acknowledge and agree, and then click Start Diagnosis.
After the diagnosis is complete, resolve the issue based on the diagnosis results.

View access logs from the Controller Pod in Simple Log Service (SLS)

You can view the Ingress Controller access log format in the ConfigMap (the default ConfigMap is nginx-configuration in the kube-system namespace).

The default log format for ACK Ingress Controller is:

$remote_addr - [$remote_addr] - $remote_user [$time_local]
    "$request" $status $body_bytes_sent "$http_referer" "$http_user_agent" $request_length
    $request_time [$proxy_upstream_name] $upstream_addr $upstream_response_length
    $upstream_response_time $upstream_status $req_id $host [$proxy_alternative_upstream_name]

Important

After modifying the log format, you must also update the log collection rules in SLS. Otherwise, correct log information will not appear in the SLS log console. Modify the log format with caution.

The Ingress Controller logs in the Simple Log Service console are shown in the following figure. For details, see Collect container logs from ACK clusters.

SLS日志.png

The field names in the Simple Log Service console differ slightly from the actual log field names. The following table lists these fields and their descriptions.

Field	Description
`remote_addr/client_ip`	The client's originating IP address.
`request/(method+url+version)`	Request information, including the request method, URL, and HTTP version.
`request_time`	Total time for this request, from receiving the client request to sending the complete response. This value may be affected by client network conditions and does not represent the actual request processing speed.
`upstream_addr`	Address of the backend upstream. If the request does not reach the backend, this value is empty. When multiple upstreams are requested due to backend failures, this value is a comma-separated list.
`upstream_status`	HTTP code returned by the backend upstream. If this value is a normal HTTP status code, it is returned by the backend upstream. When no backend is accessible, this value is 502. Multiple values are separated by commas.
`upstream_response_time`	Response time of the backend upstream, in seconds.
`proxy_upstream_name`	Name of the backend upstream. The naming convention is `<namespace>-<service name>-<port>`.
`proxy_alternative_upstream_name`	Name of the alternative backend upstream. This value is not empty when the request is forwarded to an alternative upstream (for example, a phased release service configured with Canary).

By default, run the following command to view recent access logs directly in the container.

kubectl logs <controller pod name> -n <namespace> | less

Expected output:

42.11.**.** - [42.11.**.**]--[25/Nov/2021:11:40:30 +0800]"GET / HTTP/1.1" 200 615 "_" "curl/7.64.1" 76 0.001 [default-nginx-svc-80] 172.16.254.208:80 615 0.000 200 46b79dkahflhakjhdhfkah**** 47.11.**.**[]
42.11.**.** - [42.11.**.**]--[25/Nov/2021:11:40:31 +0800]"GET / HTTP/1.1" 200 615 "_" "curl/7.64.1" 76 0.001 [default-nginx-svc-80] 172.16.254.208:80 615 0.000 200 fadgrerthflhakjhdhfkah**** 47.11.**.**[]

Check error logs in the Controller Pod

Use the logs from the Ingress Controller Pod to narrow down the issue scope. Error logs in the Controller Pod fall into two categories:

Controller error logs: Generated when Ingress configuration errors occur. Run the following command to filter Controller error logs.
```
kubectl logs <controller pod name> -n <namespace> | grep -E ^[WE]
```
Note
The Ingress Controller generates several Warning-level logs during startup, which is normal. For example, warnings about missing kubeConfig or Ingress Class do not affect the normal operation of the Ingress Controller and can be ignored.
Nginx error logs: Generated when request processing errors occur. Run the following command to filter Nginx error logs.
```
kubectl logs <controller pod name> -n <namespace> | grep error
```

Manually access the Ingress and backend pods from the Controller Pod

Run the following command to enter the Controller Pod.

kubectl exec <controller pod name> -n <namespace> -it -- bash

The Pod has pre-installed tools such as curl and OpenSSL. Use these tools to test connectivity and certificate configuration correctness.
- Run the following command to test access to the backend through Ingress.
```
# Replace your.domain.com with the actual domain name to test.
curl -H "Host: your.domain.com" http://127.0.0.1/ # for http
curl --resolve your.domain.com:443:127.0.0.1 https://127.0.0.1/ # for https
```
- Run the following command to verify certificate information.
```
openssl s_client -servername your.domain.com -connect 127.0.0.1:443
```
- Test access to backend pods.
  Note
  The Ingress Controller accesses backend pods directly by pod IP, not through the Service Cluster IP.
  1. Run the following command to obtain the backend pod IP using kubectl.
```
kubectl get pod -n <namespace> <pod name> -o wide
```
    Expected output:
    NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-dp-7f5fcc7f-**** 1/1 Running 0 23h 10.71.0.146 cn-beijing.192.168.**.** <none> <none>
    From the expected output, the backend pod IP is 10.71.0.146.
  2. Run the following command to access the pod from the Controller Pod and confirm connectivity between the Controller Pod and the backend pod.
```
curl http://<your pod ip>:<port>/path
```

Commands for troubleshooting Nginx Ingress

kubectl-plugin
The official Kubernetes Ingress controller was originally based on Nginx but switched to OpenResty starting from version 0.25.0. The controller listens for changes to Ingress resources on the API Server, automatically generates the corresponding Nginx configuration, and reloads the configuration to apply changes. For more information, see the official documentation.
As the number of Ingress resources increases, all configurations are consolidated into a single nginx.conf file, making the file lengthy and difficult to debug. Starting from version 0.14.0, the Upstream section is dynamically generated using Lua-resty-balancer, further increasing debugging difficulty. To simplify debugging, the community contributed a kubectl plugin called Ingress-nginx. For details, see kubectl-plugin.
Run the following command to obtain information about the backend services currently known to the Ingress-nginx controller.
```
kubectl ingress-nginx backends -n ingress-nginx
```

dbg command

Besides using the kubectl-plugin, you can also use the dbg command to view and diagnose related information.

Run the following command to enter the Nginx Ingress container.

kubectl exec -it -n kube-system <nginx-ingress-pod-name> -- bash

Run the /dbg command to see the following output.

nginx-ingress-controller-69f46d8b7-qmt25:/$ /dbg

dbg is a tool for quickly inspecting the state of the nginx instance
Usage:
  dbg [command]
Available Commands:
  backends    Inspect the dynamically-loaded backends information
  certs       Inspect dynamic SSL certificates
  completion  Generate the autocompletion script for the specified shell
  conf        Dump the contents of /etc/nginx/nginx.conf
  general     Output the general dynamic lua state
  help        Help about any command
Flags:
  -h, --help              help for dbg
      --status-port int   Port to use for the lua HTTP endpoint configuration. (default 10246)
Use "dbg [command] --help" for more information about a command.

Check whether a certificate exists for a specific domain name.

/dbg certs get <hostname>

View information about all current backend services.

/dbg backends all

Nginx Ingress status

Nginx includes a self-check module that outputs runtime statistics. In the Nginx Ingress container, use curl to access nginx_status on port 10246 to view request and connection statistics for Nginx.

Run the following command to enter the Nginx Ingress container.
```
kubectl exec -itn kube-system  <nginx-ingress-pod-name>  bash
```
Run the following command to view the current request and connection statistics for Nginx.
```
nginx-ingress-controller-79c5b4d87f-xxx:/etc/nginx$ curl localhost:10246/nginx_status
Active connections: 12818 
server accepts handled requests
 22717127 22717127 823821421 
Reading: 0 Writing: 382 Waiting: 12483 
```
Since Nginx started, it has accepted and handled 22,717,127 connections, with each connection successfully processed without immediate closure. These 22,717,127 connections handled 823,821,421 requests, averaging about 36.2 requests per connection.
- Active connections: Total active connections on the Nginx server is 12,818.
- Reading: Number of connections where Nginx is reading the request header is 0.
- Writing: Number of connections where Nginx is sending the response is 382.
- Waiting: Number of keep-alive connections is 12,483.

Capture packets

When you cannot locate the issue, capture packets for auxiliary diagnosis.

Based on preliminary issue identification, determine whether the network issue occurs in the Ingress Pod or the application Pod. If information is insufficient, capture packets from both sides.
Log on to the node where the problematic application Pod or Ingress Pod resides.
On the ECS instance (outside the container), run the following command to capture Ingress traffic to a file.
```
tcpdump -i any host <application Pod IP or Ingress Pod IP> -C 20 -W 200 -w /tmp/ingress.pcap
```
Monitor logs and stop packet capture when the expected error occurs.
Combine error logs with business logs to locate the exact packet information at the time of the error.
Note
- Under normal circumstances, packet capture does not affect business operations and only slightly increases CPU load and disk writes.
- The preceding command rotates captured packets, generating up to 200 .pcap files of 20 MB each.

Unable to access the cluster LoadBalancer external address from within the cluster

Issue symptoms

Some pods under certain nodes in the cluster cannot access backend pods through the external address (SLB instance IP address) of the Nginx Ingress Controller, while others can.

Root cause

This issue is caused by the externalTrafficPolicy configuration of the Service associated with the Controller. This configuration determines how external traffic is handled: when set to Local, only backend pods on the same node as the Controller pod can successfully receive requests; when set to Cluster, access works normally. Requests from within the cluster to the LoadBalancer external address are treated as external traffic.

Solutions

(Recommended) Access from within the Kubernetes cluster using ClusterIP or the Ingress service name. The Ingress service name is nginx-ingress-lb.kube-system.
Run the kubectl edit svc nginx-ingress-lb -n kube-system command to modify the Ingress service. Change externalTrafficPolicy in the LoadBalancer Service to Cluster. If the cluster uses the Flannel container network plugin, the client source IP will be lost. If Terway is used, the source IP is preserved.

Example:

apiVersion: v1
kind: Service
metadata:
  annotations:
    service.beta.kubernetes.io/backend-type: eni   # Direct ENI.
  labels:
    app: nginx-ingress-lb
  name: nginx-ingress-lb
  namespace: kube-system
spec:
  externalTrafficPolicy: Cluster

For more information about Service Annotations, see Configure Classic Load Balancer (CLB) using Annotations.

Unable to access the Ingress Controller itself

Issue symptoms

In Flannel clusters, accessing the Ingress itself from the Ingress Pod using the domain name, SLB IP, or Cluster IP results in partial or complete request failures.

Root cause

The default Flannel configuration does not allow loopback access.

Solutions

(Recommended) If possible, rebuild the cluster and use the Terway network plugin. Migrate existing cluster workloads to the Terway-mode cluster.
If rebuilding the cluster is not feasible, modify the Flannel configuration to enable hairpinMode. After modifying the configuration, recreate the Flannel Pod to apply the changes.
1. Run the following command to edit Flannel.
```
kubectl edit cm kube-flannel-cfg -n kube-system
```
2. In the returned cni-conf.json file, add "hairpinMode": true to the delegate section.
  Example:
```
cni-conf.json: |
    {
      "name": "cb0",
      "cniVersion":"0.3.1",
      "type": "flannel",
      "delegate": {
        "isDefaultGateway": true,
        "hairpinMode": true
      }
    }
```
3. Run the following command to delete and recreate Flannel.
```
kubectl delete pod -n kube-system -l app=flannel   
```

The default or old TLS certificate is still used after adding or modifying a TLS certificate in the cluster

Issue symptoms

After adding or modifying a Secret in the cluster and specifying secretName in the Ingress, access still uses the default certificate (Kubernetes Ingress Controller Fake Certificate) or the old certificate.

Root cause

The certificate is not returned by the Ingress Controller in the cluster.
The certificate is invalid and not correctly loaded by the Controller.
The Ingress Controller returns certificates based on SNI, but the TLS handshake might not carry SNI.

Solutions

Use one of the following methods to confirm whether the SNI field is included during TLS connection establishment:
- Use a newer browser version that supports SNI.
- When testing certificates with the openssl s_client command, include the -servername parameter.
- When using the curl command, add hosts or use the --resolve parameter to map the domain name instead of using IP + Host request header.
Confirm that no TLS certificate is configured on WAF, WAF in transparent proxy mode, or the SLB Layer 7 listener. The TLS certificate should be returned by the Ingress Controller in the cluster.
Use the Ingress diagnostics feature in the intelligent operations console to check for configuration errors and error logs. For details, see Use the Ingress diagnostics feature.
Run the following command to manually view error logs in the Ingress Pod and make modifications based on the prompts in the error logs.
```
kubectl logs <ingress pod name> -n <pod namespace> | grep -E ^[EW]
```

Unable to connect to a gRPC service exposed through Ingress

Issue symptoms

Unable to access the gRPC service behind the Ingress.

Root cause

Annotation specifying the backend protocol type is not set in the Ingress resource.
The gRPC service can only be accessed through a TLS port.

Solutions

Set the following Annotation in the corresponding Ingress resource: nginx.ingress.kubernetes.io/backend-protocol:"GRPC".
Confirm that the client sends requests using the TLS port and encrypts the traffic.

Unable to connect to a backend HTTPS service

Issue symptoms

Unable to access the HTTPS service behind the Ingress.
The response might be 400 with the message The plain HTTP request was sent to HTTPS port.

Root cause

The Ingress Controller sends requests to backend pods using the default HTTP protocol.

Solution

Set the following Annotation in the Ingress resource: nginx.ingress.kubernetes.io/backend-protocol:"HTTPS".

Unable to preserve the client source IP in the Ingress pod

Issue symptoms

The Ingress Pod cannot preserve the client's originating IP address and displays the node IP, an IP in the 100.XX.XX.XX CIDR block, or another address.

Root cause

The externalTrafficPolicy of the Service used by Ingress is set to Cluster.
Layer 7 proxy is used on the SLB.
WAF or WAF in transparent proxy mode is used.

Solutions

For cases where externalTrafficPolicy is set to Cluster and a Layer 4 SLB is used on the frontend:
Change externalTrafficPolicy to Local. However, this might cause internal cluster access to the Ingress via SLB IP to fail. For the solution, see Unable to access the cluster LoadBalancer external address from within the cluster.
For cases using Layer 7 proxy (Layer 7 SLB, WAF, or transparent WAF):
1. Ensure that the Layer 7 proxy is used and the X-Forwarded-For request header is enabled.
2. Add enable-real-ip: "true" to the Ingress Controller's ConfigMap (default is nginx-configuration in the kube-system namespace).
3. Check logs to verify whether the source IP can be obtained.
For long chains with multiple forwards (for example, an additional reverse proxy service configured before the Ingress Controller), enable enable-real-ip and observe the value of remote_addr in the logs to determine whether the real IP is passed to the Ingress container via the X-Forwarded-For request header. If not, use X-Forwarded-For or similar methods to carry the client's real IP before the request reaches the Ingress Controller.

Phased release rules do not take effect

Issue symptoms

Phased release is configured in the cluster, but the phased release rules do not take effect.

Root cause

Possible causes:

When using canary-* related Annotations, nginx.ingress.kubernetes.io/canary: "true" is not set.
When using canary-* related Annotations, Nginx Ingress Controller versions earlier than 0.47.0 require the Host field in the Ingress rule to contain your business domain name and cannot be empty.

Solutions

Modify nginx.ingress.kubernetes.io/canary: "true" or the Host field in the Ingress rule based on the preceding causes. For details, see Routing rules.
If the preceding situations do not apply, see Traffic distribution does not match phased release rules, or other traffic enters the phased release service.

Traffic distribution does not match phased release rules, or other traffic enters the phased release service

Issue symptoms

Phased release rules are configured, but traffic is not distributed according to the rules, or normal Ingress traffic enters the phased release service.

Root cause

In the Nginx Ingress Controller, phased release rules apply to all Ingress resources that use the same Service, not just a single Ingress.

For details about this issue, see Ingress with phased release rules affects all Ingress resources that use the same Service.

Solution

For Ingress resources that require phased release (including those using service-match and canary-* related Annotations), create independent Services (including both production and phased release Services) pointing to the original pods. Then enable phased release for the Ingress. For details, see Implement phased release and blue-green deployment using Nginx Ingress.

Error "failed calling webhook" when creating an Ingress resource

Issue symptoms

When adding an Ingress resource, the error "Internal error occurred: failed calling webhook..." appears, as shown in the following figure.

Ingress FAQ.png

Root cause

When adding an Ingress resource, the system validates the Ingress resource through a service (default is ingress-nginx-controller-admission). If the validation chain fails (for example, the service is deleted or the Ingress controller is deleted), validation fails and the Ingress resource addition is rejected.

Solutions

Check along the webhook chain to ensure all resources exist and work properly. The chain is ValidatingWebhookConfiguration → Service → Pod.
Confirm that the admission feature of the Ingress Controller Pod is enabled and the Pod can be accessed externally.
If the Ingress Controller has been deleted or webhook functionality is not needed, delete the ValidatingWebhookConfiguration resource directly.

SSL_ERROR_RX_RECORD_TOO_LONG error during HTTPS access

Issue symptoms

During HTTPS access, the error SSL_ERROR_RX_RECORD_TOO_LONG or routines:CONNECT_CR_SRVR_HELLO:wrong version number appears.

Root cause

The HTTPS request is sent to a non-HTTPS port, such as an HTTP port.

Common causes:

Port 443 of the SLB is bound to port 80 of the Ingress Pod.
Port 443 of the Service corresponding to the Ingress Controller is mapped to port 80 of the Ingress Pod.

Solution

Modify the SLB or Service settings based on the actual situation to ensure HTTPS requests reach the correct port.

Common HTTP error codes appear

Issue symptoms

Requests return non-2xx and non-3xx errors, such as 502, 503, 413, or 499.

Root cause and solutions

Check logs to determine whether the error is returned by the Ingress Controller. For details, see View access logs from the Controller Pod in Simple Log Service (SLS). If so, refer to the following solutions:

413 error
- Root cause: The Nginx Ingress Controller works normally, but the request payload exceeds the maximum allowed size.
- Solution: Modify the Controller configuration using kubectl edit cm -n kube-system nginx-configuration and adjust the values of nginx.ingress.kubernetes.io/client-max-body-size and nginx.ingress.kubernetes.io/proxy-body-size as needed (default value 20m).
499 error
- Root cause: The client disconnects early for some reason, which is not necessarily a component or backend issue.
- Solutions:
  - If there are few 499 errors, they may be normal depending on the business and can be ignored.
  - If there are many 499 errors, check whether the backend processing time and frontend request timeout meet expectations.
502 error
- Root cause: The Nginx Ingress works normally, but its pod cannot connect to the target backend pod.
- Solutions:
  - Consistent occurrence:
    - Possibly caused by backend Service or pod configuration errors. Check the backend Service port configuration and application code in the container.
  - Intermittent occurrence:
    - Possibly caused by high load on the Nginx Ingress Controller pod. Evaluate the load using request and connection statistics from the associated SLB instance and refer to Configure Nginx Ingress Controller for high-load scenarios to allocate more resources to the Controller.
    - Possibly caused by the backend pod actively closing sessions. The Nginx Ingress Controller enables persistent connections by default. Confirm that the backend persistent connection idle timeout is longer than the Controller's idle timeout (default 900 seconds).
  - If none of the above methods identify the issue, perform packet capture analysis.
503 error
- Root cause: The Ingress Controller cannot find backend pods or all pods are inaccessible.
- Solutions:
  - Intermittent occurrence:
    - Refer to the 502 error solutions.
    - Check the backend application readiness status and configure appropriate health checks.
  - Consistent occurrence:
    Check whether the backend Service configuration is correct and whether Endpoints exist.

net::ERR_HTTP2_SERVER_REFUSED_STREAM error appears

Issue symptoms

When accessing a webpage, some resources fail to load correctly, and the console shows net::ERR_HTTP2_SERVER_REFUSED_STREAM or net::ERR_FAILED errors.

Root cause

The number of concurrent resource requests exceeds the HTTP/2 maximum stream limit.

Solutions

(Recommended) Adjust http2-max-concurrent-streams to a higher value (default 128) in the ConfigMap based on actual needs. For details, see http2-max-concurrent-streams.
Disable HTTP/2 support directly in the ConfigMap by setting use-http2 to false. For details, see use-http2.

Error "The param of ServerGroupName is illegal" appears

Root cause

The ServerGroupName is generated in the format namespace+svcName+port. The server group name must be 2–128 characters long, start with a letter (uppercase or lowercase) or Chinese character, and can contain digits, periods (.), underscores (_), and hyphens (-).

Solution

Modify the server group name to comply with the format requirements.

Error "certificate signed by unknown authority" when creating an Ingress

Ingress

Root cause

When creating an Ingress, the error shown in the preceding figure appears because multiple Ingress deployments share the same resources (such as Secrets, services, and webhook configurations), causing SSL certificate mismatches during webhook execution and communication with backend services.

Solution

Redeploy the two Ingress sets with non-overlapping resources. For information about resources included in Ingress, see What updates does the system perform when upgrading the Nginx Ingress Controller component in ACK component management?.

Ingress Pod health check fails and causes restarts

Issue symptoms

The Controller Pod restarts due to health check failures.

Root cause

High load on the Ingress Pod or its node causes health check failures.
Kernel parameters tcp_tw_reuse or tcp_timestamps are set on cluster nodes, which may cause health check failures.

Solutions

Scale out the Ingress Pod and observe whether the issue persists. For details, see High-availability deployment of Nginx Ingress Controller.
Disable tcp_tw_reuse or set it to 2, and simultaneously disable tcp_timestamps. Observe whether the issue persists.

Add TCP or UDP services

Add corresponding entries to the relevant ConfigMap (default is tcp-services and udp-services in the kube-system namespace).
For example, to map port 8080 of example-go in the default namespace to port 9000, use the following example.
```
apiVersion: v1
kind: ConfigMap
metadata:
  name: tcp-services
  namespace: ingress-nginx
data:
  9000: "default/example-go:8080"  # Map port 8080 to port 9000.
```
Add the mapped ports to the Ingress Deployment (default is nginx-ingress-controller in the kube-system namespace).

Add the mapped ports to the Service corresponding to the Ingress.

Expand to view sample code

apiVersion: v1
kind: Service
metadata:
  name: ingress-nginx
  namespace: ingress-nginx
  labels:
    app.kubernetes.io/name: ingress-nginx
    app.kubernetes.io/part-of: ingress-nginx
spec:
  type: LoadBalancer
  ports:
    - name: http
      port: 80
      targetPort: 80
      protocol: TCP
    - name: https
      port: 443
      targetPort: 443
      protocol: TCP
    - name: proxied-tcp-9000
      port: 9000
      targetPort: 9000
      protocol: TCP
  selector:
    app.kubernetes.io/name: ingress-nginx
    app.kubernetes.io/part-of: ingress-nginx

For more information about adding TCP and UDP services, see Expose TCP and UDP services.

Ingress rules do not take effect

Issue symptoms

Added or modified Ingress rules do not take effect.

Root cause

Ingress configuration errors prevent new rules from loading correctly.
Ingress resource configuration errors do not match expected configurations.
Permission issues with the Ingress Controller prevent it from monitoring Ingress resource changes.
An old Ingress uses the server-alias to configure a domain name that conflicts with the new Ingress, causing the rules to be ignored.

Solutions

Use the Ingress diagnostics tool in the intelligent operations console to diagnose and follow the prompts. For details, see Use the Ingress diagnostics feature.
Check old Ingress configurations for errors or conflicts:
- For non-rewrite-target cases where regular expressions are used in paths, confirm that nginx.ingress.kubernetes.io/use-regex: "true" is configured in the Annotation.
- Check whether PathType matches expectations (ImplementationSpecific defaults to the same behavior as Prefix).
Confirm that the ClusterRole, ClusterRoleBinding, Role, RoleBinding, and ServiceAccount associated with the Ingress Controller exist. The default names are all ingress-nginx.
Enter the Controller Pod container and check the nginx.conf file for added rules.
Run the following command to manually view container logs and identify the issue.
```
kubectl logs <ingress pod name> -n <pod namespace> | grep -E ^[EW]
```

Some resources fail to load or a blank screen appears after rewriting to the root directory

Issue symptoms

After rewriting access using the Ingress rewrite-target annotation, some page resources fail to load or a blank screen appears.

Root cause

rewrite-target might not be configured using regular expressions.
The application hardcodes resource paths to the root directory.

Solutions

Check whether rewrite-target is used with regular expressions and capturing groups. For details, see Rewrite.
Check whether frontend requests access the correct paths.

How to fix abnormal SLS log parsing after a version upgrade

Issue symptoms

The ingress-nginx-controller component has two main versions: 0.20 and 0.30. After upgrading from version 0.20 to 0.30 using the Add-ons feature in the console, the Ingress Dashboard fails to correctly display actual backend service access during phased release or blue-green deployment.

Root cause

Versions 0.20 and 0.30 have different default output formats, causing the Ingress Dashboard to fail to correctly display actual backend service access during phased release or blue-green deployment.

Solution

Perform the following steps to fix the issue by updating the nginx-configuration configmap and k8s-nginx-ingress configuration.

Update the nginx-configuration configmap.

If you have not modified the nginx-configuration configmap, save the following content as nginx-configuration.yaml and run the kubectl apply -f nginx-configuration.yaml command to deploy it.

apiVersion: v1
kind: ConfigMap
data:
  allow-backend-server-header: "true"
  enable-underscores-in-headers: "true"
  generate-request-id: "true"
  ignore-invalid-headers: "true"
  log-format-upstream: $remote_addr - [$remote_addr] - $remote_user [$time_local] "$request" $status $body_bytes_sent "$http_referer" "$http_user_agent" $request_length $request_time [$proxy_upstream_name] $upstream_addr $upstream_response_length $upstream_response_time $upstream_status $req_id $host [$proxy_alternative_upstream_name]
  max-worker-connections: "65536"
  proxy-body-size: 20m
  proxy-connect-timeout: "10"
  reuse-port: "true"
  server-tokens: "false"
  ssl-redirect: "false"
  upstream-keepalive-timeout: "900"
  worker-cpu-affinity: auto
metadata:
  labels:
    app: ingress-nginx
  name: nginx-configuration
  namespace: kube-system

If you have modified the nginx-configuration configmap, run the following command to avoid overwriting your configuration:
```
kubectl edit configmap nginx-configuration -n kube-system
```

Add [$proxy_alternative_upstream_name] to the end of the log-format-upstream field, then save and exit.

Update the k8s-nginx-ingress configuration.

Save the following content as k8s-nginx-ingress.yaml and run the kubectl apply -f k8s-nginx-ingress.yaml command to deploy it.

Expand to view YAML content

apiVersion: log.alibabacloud.com/v1alpha1
kind: AliyunLogConfig
metadata:
  namespace: kube-system
  # your config name, must be unique in you k8s cluster
  name: k8s-nginx-ingress
spec:
  # logstore name to upload log
  logstore: nginx-ingress
  # product code, only for k8s nginx ingress
  productCode: k8s-nginx-ingress
  # logtail config detail
  logtailConfig:
    inputType: plugin
    # logtail config name, should be same with [metadata.name]
    configName: k8s-nginx-ingress
    inputDetail:
      plugin:
        inputs:
        - type: service_docker_stdout
          detail:
            IncludeLabel:
              io.kubernetes.container.name: nginx-ingress-controller
            Stderr: false
            Stdout: true
        processors:
        - type: processor_regex
          detail:
            KeepSource: false
            Keys:
            - client_ip
            - x_forward_for
            - remote_user
            - time
            - method
            - url
            - version
            - status
            - body_bytes_sent
            - http_referer
            - http_user_agent
            - request_length
            - request_time
            - proxy_upstream_name
            - upstream_addr
            - upstream_response_length
            - upstream_response_time
            - upstream_status
            - req_id
            - host
            - proxy_alternative_upstream_name
            NoKeyError: true
            NoMatchError: true
            Regex: ^(\S+)\s-\s\[([^]]+)]\s-\s(\S+)\s\[(\S+)\s\S+\s"(\w+)\s(\S+)\s([^"]+)"\s(\d+)\s(\d+)\s"([^"]*)"\s"([^"]*)"\s(\S+)\s(\S+)+\s\[([^]]*)]\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s(\S+)\s*(\S*)\s*\[*([^]]*)\]*.*
            SourceKey: content

Error "cannot list/get/update resource" appears

Issue symptoms

Using the method described in Check error logs in the Controller Pod, you find Controller error logs in the Pod similar to the following:

User "system:serviceaccount:kube-system:ingress-nginx" cannot list/get/update resource "xxx" in API group "xxx" at the cluster scope/ in the namespace "kube-system"

Root cause

The Nginx Ingress Controller lacks permissions to update related resources.

Solutions

Confirm from the logs whether the issue stems from a ClusterRole or Role.
- If the log contains at the cluster scope, the issue originates from the ClusterRole (ingress-nginx).
- If the log contains in the namespace "kube-system", the issue originates from the Role (kube-system/ingress-nginx).
Confirm that the corresponding permissions and bindings are complete.
- For ClusterRole:
  - Ensure that the ClusterRole ingress-nginx and ClusterRoleBinding ingress-nginx exist. If they do not exist, consider recreating them, restoring them, or reinstalling the component.
  - Ensure that the ClusterRole ingress-nginx includes the permissions mentioned in the logs (for example, List permission for networking.k8s.io/ingresses). If the permission is missing, add it manually to the ClusterRole.
- For Role:
  - Confirm that the Role kube-system/ingress-nginx and RoleBinding kube-system/ingress-nginx exist. If they do not exist, consider recreating them, restoring them, or reinstalling the component.
  - Confirm that the Role ingress-nginx includes the permissions mentioned in the logs (for example, Update permission for the ConfigMap ingress-controller-leader-nginx). If the permission is missing, add it manually to the Role.

Error "configuration file failed" appears

Issue symptoms

Using the method described in Check error logs in the Controller Pod, you find Controller error logs in the Pod similar to the following:

requeuing……nginx: configuration file xxx test failed（multiple lines）

Root cause

Configuration errors cause Nginx configuration reload to fail, usually due to syntax errors in Ingress rules or Snippets inserted in the ConfigMap.

Solutions

Check the error messages in the logs (ignore warn-level messages) to roughly locate the issue. If the error message is unclear, use the line number in the error to enter the Pod and view the corresponding file. In the following example, the file is /tmp/nginx/nginx-cfg2825306115, line 449.
Run the following command to check for configuration errors near the specified line.
```
# Enter the Pod to run commands.
kubectl exec -n <namespace> <controller pod name> -it -- bash
# View the erroneous file with line numbers and check for configuration errors near the specified line.
cat -n /tmp/nginx/nginx-cfg2825306115
```
Based on the error message and configuration file, identify and fix the error according to your actual configuration.

Error "Unexpected error validating SSL certificate" appears

Issue symptoms

Using the method described in Check error logs in the Controller Pod, you find Controller error logs in the Pod similar to the following:

Unexpected error validating SSL certificate "xxx" for server "xxx"

Root cause

Certificate configuration errors. Common causes include domain names in the certificate not matching those configured in the Ingress. Some Warning-level logs (such as missing SAN extensions in the certificate) do not affect normal certificate usage. Determine whether an issue exists based on the actual situation.

Solution

Check certificate issues in the cluster based on the error content.

Verify that the cert and key formats and contents are correct.
Verify that the domain names in the certificate match those configured in the Ingress.
Check whether the certificate has expired.

Issue with many uncleaned configuration files in the Controller

Issue symptoms

In Nginx Ingress Controller versions earlier than 1.10, a known bug exists. Normally, generated nginx-cfg files should be cleaned up promptly. However, when Ingress configuration errors cause the final rendered nginx.conf to be invalid, these erroneous configuration files are not cleaned up as expected, leading to gradual accumulation of nginx-cfgxxx files and excessive disk space consumption.

Root cause

The cleanup logic is flawed. Although correctly generated configuration files are properly cleaned up, the mechanism fails to work for invalid configuration files, causing them to remain in the system. For details, see Community GitHub Issue #11568.

Solutions

To resolve this issue, consider the following options.

Upgrade the Nginx Ingress Controller: Upgrade to version 1.10 or later. For details, see Upgrade the Nginx Ingress Controller component.
Manually clean up old files: Periodically delete uncleaned nginx-cfgxxx files. You can write a script to automate this process and reduce manual effort.
Check configuration errors: Carefully verify the correctness of new Ingress configurations before applying them to avoid generating invalid configuration files.

Troubleshoot persistent Pending status of pods after Controller upgrade

Issue symptoms

During Nginx Ingress Controller upgrades, pods may fail to schedule and remain in the Pending state for an extended period.

Root cause

During Nginx Ingress Controller upgrades, default node affinity and pod anti-affinity rules may prevent new pods from scheduling. Ensure sufficient available resources in the cluster.

Run the following commands to view the specific cause:

kubectl -n kube-system describe pod <pending-pod-name>

kubectl -n kube-system get events

Solutions

To resolve this issue, consider the following options.

Scale out cluster resources: Add new nodes to meet affinity rule requirements. For details, see Manually scale node pools.

Adjust affinity rules: In resource-constrained scenarios, run the kubectl edit deploy nginx-ingress-controller -n kube-system command to relax anti-affinity requirements, allowing pods to schedule on the same node. This approach may reduce component high availability.

Expand to view configuration example

      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:   ## Replace with preferredDuringSchedulingIgnoredDuringExecution.
            - labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - ingress-nginx
              topologyKey: "kubernetes.io/hostname"
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
            - matchExpressions:
              # virtual nodes have this label
              - key: type
                operator: NotIn
                values:
                - virtual-kubelet
          preferredDuringSchedulingIgnoredDuringExecution:
          - preference:
              matchExpressions:
              # autoscaled nodes have this label
              - key: k8s.aliyun.com
                operator: NotIn
                values:
                - "true"
            weight: 100

TCP stream mixing issue with multiple CLBs under high concurrency in Flannel CNI + IPVS clusters

Issue symptoms

In ACK clusters using Flannel CNI and IPVS network mode, if the Nginx Ingress Controller is bound to multiple load balancers (CLBs), TCP stream mixing may occur under high concurrency. Packet capture reveals the following anomalies.

Message retransmission
Abnormal TCP connection resets

Root cause

In ACK clusters configured with the Flannel network plugin, CLBs forward traffic to the NodePort of the node where the Nginx Ingress Controller resides. However, when multiple Services use different NodePorts, IPVS session conflicts may occur under high concurrency.

Solutions

Single load balancer strategy: Create only one LoadBalancer Service for the Nginx Ingress Controller. Bind other CLBs manually to the node's NodePort to reduce conflict likelihood.
Avoid multiple active NodePorts: On the same node, avoid having multiple NodePorts active simultaneously to reduce IPVS session conflict risks.

Troubleshoot Nginx Ingress issues

Table of contents

Background information

Diagnostic workflow

Troubleshooting approach

Common inspection methods

Use the Ingress diagnostics feature

View access logs from the Controller Pod in Simple Log Service (SLS)

Check error logs in the Controller Pod

Manually access the Ingress and backend pods from the Controller Pod

Commands for troubleshooting Nginx Ingress

Nginx Ingress status

Unable to access the cluster LoadBalancer external address from within the cluster

Unable to access the Ingress Controller itself

The default or old TLS certificate is still used after adding or modifying a TLS certificate in the cluster

Unable to connect to a gRPC service exposed through Ingress

Unable to connect to a backend HTTPS service

Unable to preserve the client source IP in the Ingress pod

Phased release rules do not take effect

Traffic distribution does not match phased release rules, or other traffic enters the phased release service

Error "failed calling webhook" when creating an Ingress resource

SSL_ERROR_RX_RECORD_TOO_LONG error during HTTPS access

Common HTTP error codes appear

net::ERR_HTTP2_SERVER_REFUSED_STREAM error appears

Error "The param of ServerGroupName is illegal" appears

Error "certificate signed by unknown authority" when creating an Ingress

Ingress Pod health check fails and causes restarts

Add TCP or UDP services

Ingress rules do not take effect

Some resources fail to load or a blank screen appears after rewriting to the root directory

How to fix abnormal SLS log parsing after a version upgrade

Error "cannot list/get/update resource" appears

Error "configuration file failed" appears

Error "Unexpected error validating SSL certificate" appears

Issue with many uncleaned configuration files in the Controller

Troubleshoot persistent Pending status of pods after Controller upgrade

TCP stream mixing issue with multiple CLBs under high concurrency in Flannel CNI + IPVS clusters