This topic describes the diagnostic workflow, troubleshooting approaches, common solutions, and methods for resolving DNS resolution issues.
Diagnostic workflow
Important notes
DNS troubleshooting in Kubernetes is complex due to the multilayered and dynamic nature of network architectures. In addition to errors in CoreDNS and NodeLocal DNSCache components, the following factors can also cause DNS resolution failures.
DNS best practices provide recommendations for different scenarios. Following these recommendations helps you configure DNS more effectively and reduces the likelihood of encountering DNS-related issues.
-
Network architecture load
The DNS resolution path involves multiple components, including CoreDNS/kube-dns, kube-proxy, and CNI plug-ins. A failure at any layer can cause issues. Therefore, troubleshoot each component step by step to identify the root cause. For the complete troubleshooting path, see Troubleshoot other components in the DNS path.
-
Service discovery mechanism and namespace obscurity
-
FQDN dependency: Accessing services across namespaces requires using the full domain name (for example, service.namespace.svc.cluster.local). If the namespace is not specified in the domain name, DNS searches only within the current namespace. Cross-namespace access fails without clear error messages.
-
Headless Services behavior: Headless Services return pod IPs directly. Improper configuration can result in incomplete or missing DNS records.
-
-
Network policy restrictions
-
Implicit blocking: If the Pod NetworkPolicy does not allow traffic on DNS ports (default UDP and TCP port 53), the pod cannot communicate with CoreDNS.
-
VPC security group interference: The internal firewall or security group rules might drop DNS traffic, especially VPC security group configurations.
-
Troubleshooting approach: Check network connectivity for CoreDNS pods in the kube-system namespace and verify that policies allow inbound and outbound traffic.
-
-
Limitations of debugging tools and logs
-
Missing tools: Container images do not include dig or nslookup by default. Install them manually or use a temporary debug container.
-
Scattered logs: Enable Debug mode manually in CoreDNS (by adding the log plug-in) to view logs, which are distributed across multiple replica instances.
-
Debugging tip: Run quick DNS tests from a temporary pod:
kubectl run -it --rm debug --image=nicolaka/netshoot -- digor use:
nslookup <target-domain>
-
Terms
-
Cluster-internal domain names: CoreDNS exposes services in the cluster as cluster-internal domain names, which end with
.cluster.localby default. CoreDNS resolves these domain names using its internal cache and does not query upstream DNS servers. -
Cluster-external domain names: Authoritative DNS resolution registered with third-party DNS providers, Alibaba Cloud DNS (Cloud DNS), PrivateZone, and similar products. Upstream DNS servers handle resolution for these domain names, and CoreDNS only forwards the resolution requests.
-
Application pod: A container pod you deploy in a Kubernetes cluster, excluding Kubernetes system component containers.
-
Application pod connected to CoreDNS: An application pod whose DNS server points to CoreDNS.
-
Application pod connected to NodeLocal DNSCache: After installing the NodeLocal DNSCache plug-in in the cluster, application pods automatically or manually inject DNSConfig. These pods prioritize accessing the local cache component for domain name resolution. If the local cache component is unreachable, they fall back to the kube-dns service provided by CoreDNS.
CoreDNS and NodeLocal DNSCache troubleshooting workflow

-
Identify the current cause of the issue. For details, see Common client errors.
-
If the cause of the error is that the domain name does not exist, see Troubleshooting By domain name type involved in parsing errors.
-
If the cause of the error is that you are unable to connect to the DNS server, see Troubleshooting’s “By frequency of parsing errors”.
-
-
If the preceding steps yield no results, follow these steps.
-
Check whether the application pod’s DNS configuration connects to CoreDNS. For details, see Check the DNS configuration of application pods.
-
If the pod does not connect to CoreDNS, the issue might be caused by client-side load or a full Conntrack table. For details, see Resolution failures caused by client-side load and Full Conntrack table.
-
If you have integrated CoreDNS, follow the steps below to troubleshoot.
-
Diagnose by checking the CoreDNS pod status. For details, see Check CoreDNS pod status and Abnormal CoreDNS pod status.
-
Diagnose by checking CoreDNS operational logs. For details, see Check CoreDNS operational logs and Cluster-external domain name resolution issues.
-
Determine whether the issue can be consistently reproduced.
-
If the issue consistently reproduces, see Check CoreDNS DNS query request logs and Check network connectivity between application pods and CoreDNS.
-
If the issue does not consistently reproduce, see Capture packets.
-
-
-
-
If you use NodeLocal DNSCache, see NodeLocal DNSCache not taking effect and PrivateZone domain name resolution issues.
-
Common client errors
|
Client |
Error log |
Possible issue |
|
ping |
|
The domain name does not exist or the DNS server is unreachable. If resolution latency exceeds 5 seconds, the DNS server is likely unreachable. |
|
curl |
|
|
|
PHP HTTP client |
|
|
|
Golang HTTP client |
|
The domain name does not exist. |
|
dig |
|
|
|
Golang HTTP client |
|
The DNS server is unreachable. |
|
dig |
|
Troubleshoot other components in the DNS path
The following diagram shows the overall DNS resolution path. Components other than CoreDNS and NodeLocal DNSCache can also cause DNS resolution failures:
-
DNS Resolver: Programming languages such as Go and libraries such as glibc and musl may contain defects in their DNS resolution implementations, leading to occasional DNS resolution failures.
-
/etc/resolv.conf file: The DNS configuration file in containers contains DNS server IPs and DNS search domains. Incorrect configuration of this file causes DNS resolution failures.
-
kube-proxy: kube-proxy uses IPVS/Iptables to forward requests. If kube-proxy does not update promptly when CoreDNS configuration changes, CoreDNS becomes inaccessible, causing intermittent DNS resolution failures.
-
Upstream DNS Servers: CoreDNS resolves only cluster-internal domain names. For domain names that do not match the
clusterDomain, CoreDNS queries higher-level DNS servers, such as VPC internal DNS. Misconfiguration of upstream DNS servers causes pods to fail when accessing non-cluster domain names.
Troubleshooting approaches
|
Troubleshooting approach |
Basis for troubleshooting |
Issues and solutions |
|
Troubleshoot by domain name type |
Both cluster-internal and cluster-external domain names fail |
|
|
Only cluster-external domain names fail |
||
|
Only PrivateZone or vpc-proxy domain names fail |
||
|
Only Headless service domain names fail |
||
|
Troubleshoot by issue frequency |
Complete resolution failure |
|
|
Issues occur only during peak business hours |
||
|
Issues occur very frequently |
||
|
Issues occur very infrequently |
||
|
Issues occur only during node scaling or CoreDNS scale-in |
DNS resolution failures after CoreDNS pod anomalies in IPVS mode |
Common inspection methods
Check the DNS configuration of application pods
-
Command
# View the YAML configuration of the foo container and confirm that the DNSPolicy field meets expectations. kubectl get pod foo -o yaml # If DNSPolicy meets expectations, enter the pod container to check the effective DNS configuration. # Enter the foo container using bash. If bash is unavailable, use sh instead. kubectl exec -it foo bash # After entering the container, view the DNS configuration. The nameserver entry shows the DNS server address. cat /etc/resolv.conf -
DNS Policy configuration description
The following examples show DNS Policy configurations. Choose the appropriate configuration based on your scenario:
Example 1: DNS Policy configuration for default scenarios
apiVersion: v1 kind: Pod metadata: name: <pod-name> namespace: <pod-namespace> spec: containers: - image: <container-image> name: <container-name> dnsPolicy: ClusterFirst securityContext: {} serviceAccount: default serviceAccountName: default terminationGracePeriodSeconds: 30Example 2: DNS Policy configuration when using NodeLocal DNSCache
apiVersion: v1 kind: Pod metadata: name: <pod-name> namespace: <pod-namespace> spec: containers: - image: <container-image> name: <container-name> dnsPolicy: None dnsConfig: nameservers: - 169.254.20.10 - 172.21.0.10 options: - name: ndots value: "3" - name: timeout value: "1" - name: attempts value: "2" searches: - default.svc.cluster.local - svc.cluster.local - cluster.local securityContext: {} serviceAccount: default serviceAccountName: default terminationGracePeriodSeconds: 30DNSPolicy value
DNS server used
Default
Applies only to scenarios where cluster-internal services are not accessed. When creating a pod, it inherits the DNS server list from the ECS node’s /etc/resolv.conf file.
ClusterFirst
This is the default DNSPolicy value. The pod uses the kube-dns service IP provided by CoreDNS as the DNS server. Pods with HostNetwork enabled behave like Default mode when using ClusterFirst.
ClusterFirstWithHostNet
Pods with HostNetwork enabled behave like ClusterFirst when using ClusterFirstWithHostNet.
None
Use with DNSConfig to customize DNS servers and parameters. When NodeLocal DNSCache injection is enabled, DNSConfig points the DNS server to the local cache IP and the kube-dns service IP provided by CoreDNS.
Check CoreDNS pod status
Command
-
Run the following command to view pod information.
kubectl -n kube-system get pod -o wide -l k8s-app=kube-dnsExpected output:
NAME READY STATUS RESTARTS AGE IP NODE coredns-xxxxxxxxx-xxxxx 1/1 Running 0 25h 172.20.6.53 cn-hangzhou.192.168.0.198 -
Run the following command to view real-time resource usage of pods.
kubectl -n kube-system top pod -l k8s-app=kube-dnsExpected output:
NAME CPU(cores) MEMORY(bytes) coredns-xxxxxxxxx-xxxxx 3m 18Mi -
If the pod is not in Running state, run
kubectl -n kube-system describe pod <CoreDNS Pod name>to identify the issue.
Check CoreDNS operational logs
Command
Run the following command to check CoreDNS operational logs.
kubectl -n kube-system logs -f --tail=500 --timestamps coredns-xxxxxxxxx-xxxxx
|
Parameter |
Description |
|
|
Continuous output. |
|
|
Output the last 500 lines of logs. |
|
|
Display timestamps alongside logs. |
|
|
Name of the CoreDNS pod replica. |
Check CoreDNS DNS query request logs
Command
DNS query request logs appear in container logs only after enabling the Log plug-in in CoreDNS. For instructions on enabling the Log plug-in, see Unmanaged CoreDNS configuration.
The command is the same as for checking CoreDNS operational logs. See Check CoreDNS operational logs.
Check CoreDNS pod network connectivity
You can use the console or command line to check CoreDNS pod network connectivity.
Console
Use the network diagnostics capabilities provided by the cluster.
-
Log on to the ACK console. In the navigation pane on the left, click Clusters.
-
On the Clusters page, click the name of your target cluster. In the navigation pane on the left, choose .
-
On the Diagnostics page, click the Network diagnostics tab, and then click Diagnose in the upper-left corner.
-
On the Network diagnostics page, click Diagnose. In the Access Information panel, fill in the diagnostic parameters as follows:
-
Source address: Enter the CoreDNS pod IP.
-
Destination address: Enter the upstream DNS server address. Default options are 100.100.2.136 or 100.100.2.138.
-
Port:
53 -
Protocol:
udp
After filling in the parameters, carefully read the notes, select I acknowledge and agree, then click Start Diagnosis.
-
-
On the Diagnosis Results page, view the network diagnosis results. In the Access Overview section, the full access path of this diagnosis is displayed.
In this example, the diagnosis result states, "No obvious issues found. Please further analyze based on diagnostic items or submit a ticket." The access path is kube-system/coredns Pod → ECS node (cn-hangzhou.172.xxx.xxx.240) → target DNS server (100.100.2.136).
Command line
Procedure
-
Log on to the cluster node where the CoreDNS pod resides.
-
Run
ps aux | grep corednsto query the CoreDNS process ID. -
Run
nsenter -t <pid> -n -- <command>to enter the container network namespace where CoreDNS resides. Replacepidwith thecorednsprocess ID obtained in the previous step. -
Test network connectivity.
-
Run
telnet <apiserver_clusterip> 6443to test connectivity to the Kubernetes API Server.where
apiserver_clusteripis the IP address of the Kubernetes Service in the default namespace. -
Run
dig <domain> @<upstream_dns_server_ip>to test connectivity from the CoreDNS pod to the upstream DNS server.Replace
domainwith the test domain name andupstream_dns_server_ipwith the upstream DNS server address. Default addresses are 100.100.2.136 and 100.100.2.138.
-
Common issues
|
Phenomenon |
Cause |
Solution |
|
CoreDNS cannot connect to the Kubernetes API Server |
API Server anomalies, high machine load, or kube-proxy not running properly. |
submit a ticket for troubleshooting. |
|
CoreDNS cannot connect to the upstream DNS server |
High machine load, CoreDNS misconfiguration, or leased line routing issues. |
submit a ticket for troubleshooting. |
Check network connectivity between application pods and CoreDNS
You can use the console or command line to check network connectivity between application pods and CoreDNS.
Console
-
Log on to the ACK console. In the navigation pane on the left, click Clusters.
-
On the Clusters page, click the name of your target cluster. In the navigation pane on the left, choose .
-
On the Diagnostics page, click the Network diagnostics tab, and then click Diagnose in the upper-left corner.
-
On the Network diagnostics page, click Diagnose. In the Access Information panel, fill in the diagnostic parameters as follows:
-
Source address: Enter the application pod IP.
-
Destination address: Enter the CoreDNS instance PodIP or ClusterIP.
-
Port:
53 -
Protocol:
udp
After filling in the parameters, carefully read the notes, select I acknowledge and agree, then click Start Diagnosis.
-
-
On the Diagnosis Results page, view the network diagnosis results. In the Access Overview section, the full access path of this diagnosis is displayed.
The diagnosis result shows a FATAL record. The node is cn-hangzhou.172.xx.0.240, and the diagnosis content is invalid route:
invalid route "0.0.0.0/0 dev eth1 via 172.16.3.253 scope universe type unicast" for packet (src=172.16.1.45, dst=172.16.1.3). The expected route isdev: calibb5fee8d7c0 scope: link type: unicast. The network topology in the Access Overview section shows the access path from the nginx pod through the faulty node cn-hangzhou.172.xx.0.240 (highlighted in red) to two coredns pods, clearly indicating the FATAL fault location.
Command line
Procedure
-
Choose one of the following methods to enter the client pod container network.
-
Method 1: Use the
kubectl execcommand. -
Method 2:
-
Log on to the cluster node where the application pod resides.
-
Run
ps aux | grep <application-process-name>to query the application container process ID. -
Run
nsenter -t <pid> -n bashto enter the container network namespace where the application pod resides.Replace
pidwith the process ID obtained in the previous step.
-
-
Method 3: If frequent restarts occur, follow these steps.
-
Log on to the cluster node where the application pod resides.
-
Run
docker ps -a | grep <application-container-name>to find the sandbox container starting withk8s_POD_and record its container ID. -
Run
docker inspect <sandbox-container-ID> | grep netnsto find the container network namespace path, such as /var/run/docker/netns/xxxx. -
Run
nsenter -n<netns-path> bashto enter the container network namespace.Replace
netns-pathwith the path obtained in the previous step.NoteDo not add a space between
-nand<netns-path>.
-
-
-
Test network connectivity.
-
Run
dig <domain> @<kube_dns_svc_ip>to test connectivity for DNS resolution queries from the application pod to the CoreDNS kube-dns service.Replace
<domain>with the test domain name and<kube_dns_svc_ip>with the kube-dns service IP in the kube-system namespace. -
Run
ping <coredns_pod_ip>to test connectivity from the application pod to the CoreDNS pod replica.Replace
<coredns_pod_ip>with the CoreDNS pod IP in the kube-system namespace. -
Run
dig <domain> @<coredns_pod_ip>to test connectivity for DNS resolution queries from the application pod to the CoreDNS pod replica.Replace
<domain>with the test domain name and<coredns_pod_ip>with the CoreDNS pod IP in the kube-system namespace.
-
Common issues
|
Phenomenon |
Cause |
Solution |
|
Application pod cannot resolve through CoreDNS kube-dns service |
High machine load, kube-proxy not running properly, or security group not allowing UDP port 53. |
Check if the security group allows UDP port 53. If it does, submit a ticket for troubleshooting. |
|
Application pod cannot connect to CoreDNS pod replica |
Container network issues or security group not allowing ICMP. |
Check if the security group allows ICMP. If it does, submit a ticket for troubleshooting. |
|
Application pod cannot resolve through CoreDNS pod replica |
High machine load or security group not allowing UDP port 53. |
Check if the security group allows UDP port 53. If it does, submit a ticket for troubleshooting. |
Capture packets
When you cannot locate the issue, capture packets for auxiliary diagnosis.
-
Log on to the node where the problematic application pod or CoreDNS pod resides.
-
On the ECS instance (outside the container), run the following command to capture all port 53 traffic into a file.
tcpdump -i any port 53 -C 20 -W 200 -w /tmp/client_dns.pcap -
Locate the exact packet information corresponding to the error time in the application logs.
Note-
Under normal conditions, packet capture has no impact on business operations and only slightly increases CPU load and disk writes.
-
The preceding command rotates captured packets, writing up to 200 files of 20 MB each (.pcap files).
-
Cluster-external domain name resolution issues
Issue description
Application pods can resolve cluster-internal domain names normally but cannot resolve certain cluster-external domain names.
Root cause
The upstream server returns abnormal DNS resolution responses.
Solution
Check CoreDNS DNS query request logs.
Common request logs
CoreDNS logs a line after receiving a request and replying to the client. Example:
# The status code RCODE NOERROR indicates successful resolution.
[INFO] 172.20.2.25:44525 - 36259 "A IN redis-master.default.svc.cluster.local. udp 56 false 512" NOERROR qr,aa,rd 110 0.000116946s
Common RCODE return codes
For details on RCODE definitions, see the specification.
|
Return Code (RCODE) |
Meaning |
Cause |
|
NXDOMAIN |
Domain name does not exist |
Inside containers, requested domain names are appended with search suffixes. If the resulting domain name does not exist, this RCODE appears. If the requested domain name in the logs exists, an anomaly is present. |
|
SERVFAIL |
Upstream server anomaly |
Commonly occurs when the upstream DNS server is unreachable. |
|
REFUSED |
Response denied |
Commonly occurs when the upstream DNS server configured in CoreDNS or the cluster node’s /etc/resolv.conf file cannot handle the domain name. Check the CoreDNS configuration file. |
When CoreDNS DNS query request logs show NXDOMAIN, SERVFAIL, or REFUSED for cluster-external domain names, the upstream DNS server returns abnormal responses.
By default, the upstream DNS servers for CoreDNS in the cluster are the VPC-provided DNS servers (100.100.2.136 and 100.100.2.138). You can submit a ticket to Elastic Compute Service (ECS). Include the following information when submitting the ticket.
|
Field |
Description |
Example |
|
Affected domain name |
Cluster-external domain name with abnormal RCODE in CoreDNS logs |
www.aliyun.com |
|
Parse the return code (RCODE). |
Specific resolution error (NXDOMAIN, SERVFAIL, REFUSED) |
NXDOMAIN |
|
Affected time |
Log timestamp (accurate to the second) |
2022-12-22 20:00:03 |
|
Affected ECS instances |
ECS instance IDs where CoreDNS pod replicas reside |
i-xxxxx i-yyyyy |
Newly added Headless domain names cannot be resolved
Issue description
Application pods connected to CoreDNS cannot resolve newly added Headless domain names.
Root cause
CoreDNS versions earlier than 1.7.0 exit abnormally during API Server jitter, causing Headless domain names to stop updating.
Solution
Upgrade CoreDNS to version 1.7.0 or later. For details, see [Component upgrade] CoreDNS upgrade announcement.
Headless domain name resolution failures
Issue description
Application pods connected to CoreDNS cannot resolve Headless domain names. When using dig for resolution, the response shows the tc flag, indicating the response message is too large.
Root cause
When a Headless domain name corresponds to too many IP entries, DNS requests sent via UDP may exceed the UDP DNS message size limit, causing resolution failures.
Solution
To avoid resolution failures, adjust your client application to use TCP for DNS queries. CoreDNS supports both TCP and UDP queries. Modify your application based on the following scenarios:
-
glibc-based resolvers
If your client application uses a glibc-based Resolve resolver, add the
use-vcconfiguration indnsConfigto use TCP for DNS queries. These settings map to the correspondingoptionsconfiguration in/etc/resolv.conf. For details onoptionsconfiguration, see Linux man pages.dnsConfig: options: - name: use-vc -
Golang application logic
If you develop with Golang, refer to the following code to use TCP for DNS queries.
package main import ( "fmt" "net" "context" ) func main() { resolver := &net.Resolver{ PreferGo: true, Dial: func(ctx context.Context, network, address string) (net.Conn, error) { return net.Dial("tcp", address) }, } addrs, err := resolver.LookupHost(context.TODO(), "example.com") if err != nil { fmt.Println("Error:", err) return } fmt.Println("Addresses:", addrs) }
Headless domain names cannot be resolved after CoreDNS upgrade
Issue description
Some older open-source components (such as older versions of etcd, Nacos, and Kafka) do not work properly in environments with Kubernetes 1.20 or later and CoreDNS 1.8.4 or later.
Root cause
CoreDNS 1.8.4 and later prioritize the EndpointSlice API to synchronize Kubernetes service IP information. Some open-source components use the annotation service.alpha.kubernetes.io/tolerate-unready-endpoints from the Endpoint API to publish services that are not ready during initialization. This annotation is deprecated in the EndpointSlice API and replaced by publishNotReadyAddresses. After upgrading CoreDNS, unready services are not published, causing these components to fail at service discovery.
Solution
Check whether the YAML or Helm Chart of the open-source component contains the annotation service.alpha.kubernetes.io/tolerate-unready-endpoints. If it does, the component may not work properly. Upgrade the open-source component or consult its community.
StatefulSets pod domain names cannot be resolved
Issue description
Headless services cannot resolve pod domain names.
Root cause
In StatefulSets pod YAML, the ServiceName must match the name of the exposed service. Otherwise, pod domain names (for example, pod.headless-svc.ns.svc.cluster.local) cannot be accessed, and only service domain names (for example, headless-svc.ns.svc.cluster.local) are accessible.
Solution
Modify the ServiceName in the StatefulSets pod YAML.
Incorrect security group or vSwitch ACL configuration
Issue description
Application pods connected to CoreDNS on some or all nodes consistently fail to resolve domain names.
Root cause
Modifying the security group (or vSwitch ACL) used by ECS or containers blocks communication on UDP port 53.
Solution
Restore the security group and vSwitch ACL configurations to allow UDP communication on port 53.
Container network connectivity issues
Issue description
Application pods connected to CoreDNS on some or all nodes consistently fail to resolve domain names.
Root cause
Container network issues or other causes lead to persistent UDP port 53 unavailability.
Solution
You can use network diagnostics to diagnose network connectivity between application pods and CoreDNS addresses.
High CoreDNS pod load
Issue description
-
Application pods connected to CoreDNS on some or all nodes experience increased resolution latency and probabilistic or consistent failures.
-
Checking CoreDNS pod status shows CPU and memory usage of replicas approaching their resource limits.
Root cause
Insufficient CoreDNS replicas or high business request volume causes high CoreDNS load.
Solution
-
Consider using NodeLocal DNSCache to improve DNS resolution performance and reduce CoreDNS load. For details, see Use NodeLocal DNSCache.
-
Scale out CoreDNS replicas appropriately so that peak CPU usage per pod remains below the node’s idle CPU capacity.
CoreDNS pod load imbalance
Issue description
-
Some application pods connected to CoreDNS experience increased resolution latency and probabilistic or consistent failures.
-
Checking CoreDNS pod status shows uneven CPU usage across replicas.
-
Fewer than two CoreDNS replicas exist, or multiple replicas reside on the same node.
Root cause
Uneven CoreDNS replica scheduling or Service affinity settings cause CoreDNS pod load imbalance.
Solution
-
Scale out and distribute CoreDNS replicas across different nodes.
-
When load imbalance occurs, disable the affinity property of the kube-dns service. For details, see Unmanaged CoreDNS automatic upgrade.
Abnormal CoreDNS pod status
Issue description
-
Some application pods connected to CoreDNS experience increased resolution latency and probabilistic or consistent failures.
-
CoreDNS replica status is not Running, or the RESTARTS count keeps increasing.
-
CoreDNS operational logs show anomalies.
Root cause
CoreDNS YAML templates or configuration files cause CoreDNS to run abnormally.
Solution
Check CoreDNS pod status and operational logs.
Common abnormal logs and solutions
|
Log message |
Cause |
Solution |
|
|
The configuration file is incompatible with CoreDNS. The |
Remove the ready plug-in from the CoreDNS configuration item in the kube-system namespace. Apply the same approach to resolve similar errors. |
|
|
The API server was unavailable during the time period shown in the log. |
If the log timestamp does not match the time of the abnormal event, rule out this cause. Otherwise, check network connectivity for the CoreDNS pod. For more information, see Check network connectivity for the CoreDNS pod. |
|
|
CoreDNS could not connect to the upstream DNS server during the time period shown in the log. |
Resolution failures caused by client-side load
Issue description
Resolution failures occur sporadically during peak business hours or suddenly. ECS monitoring shows abnormal NIC retransmission rates and CPU load.
Root cause
The ECS instance hosting the application pod connected to CoreDNS reaches 100% load, causing UDP packet loss.
Solution
We recommend using NodeLocal DNSCache to improve DNS resolution performance and reduce CoreDNS load. For details, see Use NodeLocal DNSCache.
Full Conntrack table
Issue description
-
Application pods connected to CoreDNS on some or all nodes experience massive domain resolution failures during peak business hours, which disappear after the peak.
-
Running
dmesg -Hand scrolling to the issue period shows log entries containingconntrack full.
Root cause
The Linux Conntrack table has limited entries, preventing new UDP or TCP requests.
Solution
Increase the Conntrack table limit. For details, see How to increase the Linux connection tracking (Conntrack) limit?.
AutoPath plug-in issues
Issue description
-
Resolving cluster-external domain names probabilistically fails or resolves to incorrect IP addresses. Cluster-internal domain name resolution works normally.
-
During high-frequency container creation, cluster-internal service domain names resolve to incorrect IP addresses.
Root cause
A CoreDNS processing defect causes AutoPath to malfunction.
Solution
Follow these steps to disable the AutoPath plug-in.
-
Run
kubectl -n kube-system edit configmap corednsto open the CoreDNS configuration file. -
Delete the line
autopath @kubernetesand save the changes. -
Check CoreDNS pod status and operational logs. The appearance of
reloadin the logs indicates successful modification.
Concurrent A and AAAA record resolution issues
Issue description
-
Application pods connected to CoreDNS probabilistically fail to resolve domain names.
-
Packet capture or CoreDNS DNS query request logs show A and AAAA requests occurring simultaneously with identical source ports.
Root cause
-
Concurrent A and AAAA DNS requests trigger a defect in the Linux kernel Conntrack module, causing UDP packet loss.
-
Older libc versions (<2.33) on ARM architectures have concurrency issues when initiating simultaneous A and AAAA requests, causing request timeouts and retransmissions. See GLIBC#26600.
Solution
-
Consider using NodeLocal DNSCache to improve DNS resolution performance and reduce CoreDNS load. For details, see Use NodeLocal DNSCache.
-
For base images using libc (such as CentOS and Ubuntu), upgrade libc to version 2.33 or later to avoid concurrent A and AAAA resolution issues.
-
For base images like CentOS and Ubuntu, optimize using parameters such as
options timeout:2 attempts:3 rotate single-request-reopen. -
If your container image is based on Alpine, consider switching to a different base image. For more information, see Alpine.
-
PHP applications often face short-connection resolution issues. If using PHP Curl, use the
CURL_IPRESOLVE_V4parameter to send IPv4-only resolution requests. For more information, see Function reference.
DNS resolution failures after CoreDNS pod anomalies in IPVS mode
Issue description
In IPVS mode, CoreDNS pods may experience probabilistic DNS resolution failures under specific conditions, typically lasting about five minutes.
Root cause
Under specific conditions, DNS resolution requests are sent to CoreDNS pods in an abnormal state, causing resolution failures.
For example, when a node hosting a CoreDNS pod is removed, node resources are immediately released, and the pod stops working. However, the cluster takes about one minute to detect the node status update and mark it as NotReady. Before the node status updates, the pod is still considered healthy and accepts DNS resolution requests, causing probabilistic DNS resolution failures in the cluster.
After the node is marked NotReady, its CoreDNS pods are immediately removed from the CoreDNS Service backend and stop accepting new connections. However, if the cluster’s kube-proxy load balancing mode is IPVS, the IPVS UDP session persistence policy causes some DNS requests to continue being sent to the pod until the UDP timeout period ends, leading to prolonged DNS resolution failures in the cluster.
This issue may occur on CentOS and Alibaba Cloud Linux 2 nodes with kernel versions earlier than 4.19.91-25.1.al7.x86_64.
Solution
-
Consider using NodeLocal DNSCache, which tolerates IPVS packet loss. For details, see Use NodeLocal DNSCache.
-
Optimize IPVS UDP timeout duration. For details, see Configure UDP timeout for IPVS clusters.
NodeLocal DNSCache not taking effect
Issue description
No traffic enters NodeLocal DNSCache, and all requests still go to CoreDNS.
Root cause
-
DNSConfig injection is not configured, so application pods still use the CoreDNS kube-dns service IP as the DNS server address.
-
Application pods use Alpine as the base image. Alpine concurrently requests all nameservers, including the local cache and CoreDNS.
Solution
-
Configure automatic DNSConfig injection. For details, see Use NodeLocal DNSCache.
-
If your container image is based on Alpine, consider switching to a different base image. For more information, see Alpine.
PrivateZone domain name resolution issues
Issue description
For applications connected to NodeLocal DNSCache, pods cannot resolve domain names registered in PrivateZone, cannot resolve Alibaba Cloud product API domain names containing vpc-proxy, or resolve them incorrectly.
Root cause
PrivateZone does not support TCP protocol and requires UDP access.
Solution
Configure prefer_udp in CoreDNS. For details, see Unmanaged CoreDNS configuration.
DNS resolution issues caused by sudden traffic spikes
Issue description
After a sudden traffic surge, some DNS requests fail to resolve.
Root cause
Sudden traffic spikes cause a surge in DNS requests, leading to excessive inbound and outbound traffic to CoreDNS. This may throttle CoreDNS CPU usage and cause resolution anomalies. Verify this scenario as follows:
-
Check on the node where CoreDNS pods reside.
Run the following command on the node.
nsenter -t <coredns-pid> -n -- netstat -suCheck for
sendorrecv buffer errormessages. If present, UDP packet loss exists. Example:Udp: 1090421 packets received 850 packets to unknown port received 15662 packet receive errors 5607627 packets sent 15662 receive buffer errors 0 send buffer errors -
Check CoreDNS pod CPU throttling metrics.
If CoreDNS CPU is throttled, intermittent DNS resolution failures or increased DNS response latency may occur. Combine this with the first point to confirm packet loss.
NoteDue to CPU usage sampling and calculation cycles (15 seconds), CPU throttling may occur even when CPU usage appears low. For more information, see Enable CPU Burst performance optimization.
In the Prometheus monitoring page, choose Application Monitoring > Cluster Pod Monitoring. Filter by Namespace kube-system, select the corresponding CoreDNS pod, and check the CPU Throttled Percent line chart in the CPU Resource section. If this metric is close to 0%, no CPU throttling occurs.
-
Whether you use ARMS Prometheus or a self-managed Prometheus solution, always collect CoreDNS metrics and use the CoreDNS dashboard to check for anomalies and identify the issue timeframe. Log on to the Container Service for Kubernetes console, then navigate to and select the Network Monitoring tab to find CoreDNS.

Solution
-
Configure NodeSelector to schedule CoreDNS pods to a dedicated node pool, enabling dedicated node deployment without CPU resource limits.
-
If CoreDNS pods have CPU resource limits configured, we recommend also enabling CPU Burst performance optimization.
-
Install the NodeLocal DNSCache component and configure pods to enable DNS caching. For details, see Use NodeLocal DNSCache.