Troubleshoot DNS resolution issues

更新时间:
复制 MD 格式

This topic describes the diagnostic workflow, troubleshooting approaches, common solutions, and methods for resolving DNS resolution issues.

Diagnostic workflow

Important notes

DNS troubleshooting in Kubernetes is complex due to the multilayered and dynamic nature of network architectures. In addition to errors in CoreDNS and NodeLocal DNSCache components, the following factors can also cause DNS resolution failures.

Important

DNS best practices provide recommendations for different scenarios. Following these recommendations helps you configure DNS more effectively and reduces the likelihood of encountering DNS-related issues.

  • Network architecture load

    The DNS resolution path involves multiple components, including CoreDNS/kube-dns, kube-proxy, and CNI plug-ins. A failure at any layer can cause issues. Therefore, troubleshoot each component step by step to identify the root cause. For the complete troubleshooting path, see Troubleshoot other components in the DNS path.

  • Service discovery mechanism and namespace obscurity

    • FQDN dependency: Accessing services across namespaces requires using the full domain name (for example, service.namespace.svc.cluster.local). If the namespace is not specified in the domain name, DNS searches only within the current namespace. Cross-namespace access fails without clear error messages.

    • Headless Services behavior: Headless Services return pod IPs directly. Improper configuration can result in incomplete or missing DNS records.

  • Network policy restrictions

    • Implicit blocking: If the Pod NetworkPolicy does not allow traffic on DNS ports (default UDP and TCP port 53), the pod cannot communicate with CoreDNS.

    • VPC security group interference: The internal firewall or security group rules might drop DNS traffic, especially VPC security group configurations.

    • Troubleshooting approach: Check network connectivity for CoreDNS pods in the kube-system namespace and verify that policies allow inbound and outbound traffic.

  • Limitations of debugging tools and logs

    • Missing tools: Container images do not include dig or nslookup by default. Install them manually or use a temporary debug container.

    • Scattered logs: Enable Debug mode manually in CoreDNS (by adding the log plug-in) to view logs, which are distributed across multiple replica instances.

    • Debugging tip: Run quick DNS tests from a temporary pod:

      kubectl run -it --rm debug --image=nicolaka/netshoot -- dig

      or use:

      nslookup  <target-domain>

Terms

  • Cluster-internal domain names: CoreDNS exposes services in the cluster as cluster-internal domain names, which end with .cluster.local by default. CoreDNS resolves these domain names using its internal cache and does not query upstream DNS servers.

  • Cluster-external domain names: Authoritative DNS resolution registered with third-party DNS providers, Alibaba Cloud DNS (Cloud DNS), PrivateZone, and similar products. Upstream DNS servers handle resolution for these domain names, and CoreDNS only forwards the resolution requests.

  • Application pod: A container pod you deploy in a Kubernetes cluster, excluding Kubernetes system component containers.

  • Application pod connected to CoreDNS: An application pod whose DNS server points to CoreDNS.

  • Application pod connected to NodeLocal DNSCache: After installing the NodeLocal DNSCache plug-in in the cluster, application pods automatically or manually inject DNSConfig. These pods prioritize accessing the local cache component for domain name resolution. If the local cache component is unreachable, they fall back to the kube-dns service provided by CoreDNS.

CoreDNS and NodeLocal DNSCache troubleshooting workflow

故障手册流程.png

  1. Identify the current cause of the issue. For details, see Common client errors.

    • If the cause of the error is that the domain name does not exist, see Troubleshooting By domain name type involved in parsing errors.

    • If the cause of the error is that you are unable to connect to the DNS server, see Troubleshooting’s “By frequency of parsing errors”.

  2. If the preceding steps yield no results, follow these steps.

Common client errors

Client

Error log

Possible issue

ping

ping: xxx.yyy.zzz: Name or service not known

The domain name does not exist or the DNS server is unreachable. If resolution latency exceeds 5 seconds, the DNS server is likely unreachable.

curl

curl: (6) Could not resolve host: xxx.yyy.zzz

PHP HTTP client

php_network_getaddresses: getaddrinfo failed: Name or service not known in xxx.php on line yyy

Golang HTTP client

dial tcp: lookup xxx.yyy.zzz on 100.100.2.136:53: no such host

The domain name does not exist.

dig

;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: xxxxx

Golang HTTP client

dial tcp: lookup xxx.yyy.zzz on 100.100.2.139:53: read udp 192.168.0.100:42922->100.100.2.139:53: i/o timeout

The DNS server is unreachable.

dig

;; connection timed out; no servers could be reached

Troubleshoot other components in the DNS path

The following diagram shows the overall DNS resolution path. Components other than CoreDNS and NodeLocal DNSCache can also cause DNS resolution failures:

  • DNS Resolver: Programming languages such as Go and libraries such as glibc and musl may contain defects in their DNS resolution implementations, leading to occasional DNS resolution failures.

  • /etc/resolv.conf file: The DNS configuration file in containers contains DNS server IPs and DNS search domains. Incorrect configuration of this file causes DNS resolution failures.

  • kube-proxy: kube-proxy uses IPVS/Iptables to forward requests. If kube-proxy does not update promptly when CoreDNS configuration changes, CoreDNS becomes inaccessible, causing intermittent DNS resolution failures.

  • Upstream DNS Servers: CoreDNS resolves only cluster-internal domain names. For domain names that do not match the clusterDomain, CoreDNS queries higher-level DNS servers, such as VPC internal DNS. Misconfiguration of upstream DNS servers causes pods to fail when accessing non-cluster domain names.

Troubleshooting approaches

Troubleshooting approach

Basis for troubleshooting

Issues and solutions

Troubleshoot by domain name type

Both cluster-internal and cluster-external domain names fail

Only cluster-external domain names fail

Cluster-external domain name resolution issues

Only PrivateZone or vpc-proxy domain names fail

PrivateZone domain name resolution issues

Only Headless service domain names fail

Troubleshoot by issue frequency

Complete resolution failure

Issues occur only during peak business hours

Issues occur very frequently

Issues occur very infrequently

Issues occur only during node scaling or CoreDNS scale-in

DNS resolution failures after CoreDNS pod anomalies in IPVS mode

Common inspection methods

Check the DNS configuration of application pods

  • Command

    # View the YAML configuration of the foo container and confirm that the DNSPolicy field meets expectations.
    kubectl get pod foo -o yaml
    # If DNSPolicy meets expectations, enter the pod container to check the effective DNS configuration.
    # Enter the foo container using bash. If bash is unavailable, use sh instead.
    kubectl exec -it foo bash
    # After entering the container, view the DNS configuration. The nameserver entry shows the DNS server address.
    cat /etc/resolv.conf
  • DNS Policy configuration description

    The following examples show DNS Policy configurations. Choose the appropriate configuration based on your scenario:

    Example 1: DNS Policy configuration for default scenarios

    apiVersion: v1
    kind: Pod
    metadata:
      name: <pod-name>
      namespace: <pod-namespace>
    spec:
      containers:
      - image: <container-image>
        name: <container-name>
      dnsPolicy: ClusterFirst
      securityContext: {}
      serviceAccount: default
      serviceAccountName: default
      terminationGracePeriodSeconds: 30

    Example 2: DNS Policy configuration when using NodeLocal DNSCache

    apiVersion: v1
    kind: Pod
    metadata:
      name: <pod-name>
      namespace: <pod-namespace>
    spec:
      containers:
      - image: <container-image>
        name: <container-name>
      dnsPolicy: None
      dnsConfig:
        nameservers:
        - 169.254.20.10
        - 172.21.0.10
        options:
        - name: ndots
          value: "3"
        - name: timeout
          value: "1"
        - name: attempts
          value: "2"
        searches:
        - default.svc.cluster.local
        - svc.cluster.local
        - cluster.local
      securityContext: {}
      serviceAccount: default
      serviceAccountName: default
      terminationGracePeriodSeconds: 30

    DNSPolicy value

    DNS server used

    Default

    Applies only to scenarios where cluster-internal services are not accessed. When creating a pod, it inherits the DNS server list from the ECS node’s /etc/resolv.conf file.

    ClusterFirst

    This is the default DNSPolicy value. The pod uses the kube-dns service IP provided by CoreDNS as the DNS server. Pods with HostNetwork enabled behave like Default mode when using ClusterFirst.

    ClusterFirstWithHostNet

    Pods with HostNetwork enabled behave like ClusterFirst when using ClusterFirstWithHostNet.

    None

    Use with DNSConfig to customize DNS servers and parameters. When NodeLocal DNSCache injection is enabled, DNSConfig points the DNS server to the local cache IP and the kube-dns service IP provided by CoreDNS.

Check CoreDNS pod status

Command

  • Run the following command to view pod information.

    kubectl -n kube-system get pod -o wide -l k8s-app=kube-dns

    Expected output:

    NAME                      READY   STATUS    RESTARTS   AGE   IP            NODE
    coredns-xxxxxxxxx-xxxxx   1/1     Running   0          25h   172.20.6.53   cn-hangzhou.192.168.0.198
  • Run the following command to view real-time resource usage of pods.

    kubectl -n kube-system top pod -l k8s-app=kube-dns

    Expected output:

    NAME                      CPU(cores)   MEMORY(bytes)
    coredns-xxxxxxxxx-xxxxx   3m           18Mi
  • If the pod is not in Running state, run kubectl -n kube-system describe pod <CoreDNS Pod name> to identify the issue.

Check CoreDNS operational logs

Command

Run the following command to check CoreDNS operational logs.

kubectl -n kube-system logs -f --tail=500 --timestamps coredns-xxxxxxxxx-xxxxx

Parameter

Description

f

Continuous output.

tail=500

Output the last 500 lines of logs.

timestamps

Display timestamps alongside logs.

coredns-xxxxxxxxx-xxxxx

Name of the CoreDNS pod replica.

Check CoreDNS DNS query request logs

Command

DNS query request logs appear in container logs only after enabling the Log plug-in in CoreDNS. For instructions on enabling the Log plug-in, see Unmanaged CoreDNS configuration.

The command is the same as for checking CoreDNS operational logs. See Check CoreDNS operational logs.

Check CoreDNS pod network connectivity

You can use the console or command line to check CoreDNS pod network connectivity.

Console

Use the network diagnostics capabilities provided by the cluster.

  1. Log on to the ACK console. In the navigation pane on the left, click Clusters.

  2. On the Clusters page, click the name of your target cluster. In the navigation pane on the left, choose Inspections and Diagnostics > Diagnostics.

  3. On the Diagnostics page, click the Network diagnostics tab, and then click Diagnose in the upper-left corner.

  4. On the Network diagnostics page, click Diagnose. In the Access Information panel, fill in the diagnostic parameters as follows:

    • Source address: Enter the CoreDNS pod IP.

    • Destination address: Enter the upstream DNS server address. Default options are 100.100.2.136 or 100.100.2.138.

    • Port: 53

    • Protocol: udp

    After filling in the parameters, carefully read the notes, select I acknowledge and agree, then click Start Diagnosis.

  5. On the Diagnosis Results page, view the network diagnosis results. In the Access Overview section, the full access path of this diagnosis is displayed.

    In this example, the diagnosis result states, "No obvious issues found. Please further analyze based on diagnostic items or submit a ticket." The access path is kube-system/coredns Pod → ECS node (cn-hangzhou.172.xxx.xxx.240) → target DNS server (100.100.2.136).

Command line

Procedure

  1. Log on to the cluster node where the CoreDNS pod resides.

  2. Run ps aux | grep coredns to query the CoreDNS process ID.

  3. Run nsenter -t <pid> -n -- <command> to enter the container network namespace where CoreDNS resides. Replace pid with the coredns process ID obtained in the previous step.

  4. Test network connectivity.

    1. Run telnet <apiserver_clusterip> 6443 to test connectivity to the Kubernetes API Server.

      where apiserver_clusterip is the IP address of the Kubernetes Service in the default namespace.

    2. Run dig <domain> @<upstream_dns_server_ip> to test connectivity from the CoreDNS pod to the upstream DNS server.

      Replace domain with the test domain name and upstream_dns_server_ip with the upstream DNS server address. Default addresses are 100.100.2.136 and 100.100.2.138.

Common issues

Phenomenon

Cause

Solution

CoreDNS cannot connect to the Kubernetes API Server

API Server anomalies, high machine load, or kube-proxy not running properly.

submit a ticket for troubleshooting.

CoreDNS cannot connect to the upstream DNS server

High machine load, CoreDNS misconfiguration, or leased line routing issues.

submit a ticket for troubleshooting.

Check network connectivity between application pods and CoreDNS

You can use the console or command line to check network connectivity between application pods and CoreDNS.

Console

  1. Log on to the ACK console. In the navigation pane on the left, click Clusters.

  2. On the Clusters page, click the name of your target cluster. In the navigation pane on the left, choose Inspections and Diagnostics > Diagnostics.

  3. On the Diagnostics page, click the Network diagnostics tab, and then click Diagnose in the upper-left corner.

  4. On the Network diagnostics page, click Diagnose. In the Access Information panel, fill in the diagnostic parameters as follows:

    • Source address: Enter the application pod IP.

    • Destination address: Enter the CoreDNS instance PodIP or ClusterIP.

    • Port: 53

    • Protocol: udp

    After filling in the parameters, carefully read the notes, select I acknowledge and agree, then click Start Diagnosis.

  5. On the Diagnosis Results page, view the network diagnosis results. In the Access Overview section, the full access path of this diagnosis is displayed.

    The diagnosis result shows a FATAL record. The node is cn-hangzhou.172.xx.0.240, and the diagnosis content is invalid route: invalid route "0.0.0.0/0 dev eth1 via 172.16.3.253 scope universe type unicast" for packet (src=172.16.1.45, dst=172.16.1.3). The expected route is dev: calibb5fee8d7c0 scope: link type: unicast. The network topology in the Access Overview section shows the access path from the nginx pod through the faulty node cn-hangzhou.172.xx.0.240 (highlighted in red) to two coredns pods, clearly indicating the FATAL fault location.

Command line

Procedure

  1. Choose one of the following methods to enter the client pod container network.

    • Method 1: Use the kubectl exec command.

    • Method 2:

      1. Log on to the cluster node where the application pod resides.

      2. Run ps aux | grep <application-process-name> to query the application container process ID.

      3. Run nsenter -t <pid> -n bash to enter the container network namespace where the application pod resides.

        Replace pid with the process ID obtained in the previous step.

    • Method 3: If frequent restarts occur, follow these steps.

      1. Log on to the cluster node where the application pod resides.

      2. Run docker ps -a | grep <application-container-name> to find the sandbox container starting with k8s_POD_ and record its container ID.

      3. Run docker inspect <sandbox-container-ID> | grep netns to find the container network namespace path, such as /var/run/docker/netns/xxxx.

      4. Run nsenter -n<netns-path> bash to enter the container network namespace.

        Replace netns-path with the path obtained in the previous step.

        Note

        Do not add a space between -n and <netns-path>.

  2. Test network connectivity.

    1. Run dig <domain> @<kube_dns_svc_ip> to test connectivity for DNS resolution queries from the application pod to the CoreDNS kube-dns service.

      Replace <domain> with the test domain name and <kube_dns_svc_ip> with the kube-dns service IP in the kube-system namespace.

    2. Run ping <coredns_pod_ip> to test connectivity from the application pod to the CoreDNS pod replica.

      Replace <coredns_pod_ip> with the CoreDNS pod IP in the kube-system namespace.

    3. Run dig <domain> @<coredns_pod_ip> to test connectivity for DNS resolution queries from the application pod to the CoreDNS pod replica.

      Replace <domain> with the test domain name and <coredns_pod_ip> with the CoreDNS pod IP in the kube-system namespace.

Common issues

Phenomenon

Cause

Solution

Application pod cannot resolve through CoreDNS kube-dns service

High machine load, kube-proxy not running properly, or security group not allowing UDP port 53.

Check if the security group allows UDP port 53. If it does, submit a ticket for troubleshooting.

Application pod cannot connect to CoreDNS pod replica

Container network issues or security group not allowing ICMP.

Check if the security group allows ICMP. If it does, submit a ticket for troubleshooting.

Application pod cannot resolve through CoreDNS pod replica

High machine load or security group not allowing UDP port 53.

Check if the security group allows UDP port 53. If it does, submit a ticket for troubleshooting.

Capture packets

When you cannot locate the issue, capture packets for auxiliary diagnosis.

  1. Log on to the node where the problematic application pod or CoreDNS pod resides.

  2. On the ECS instance (outside the container), run the following command to capture all port 53 traffic into a file.

    tcpdump -i any port 53 -C 20 -W 200 -w /tmp/client_dns.pcap
  3. Locate the exact packet information corresponding to the error time in the application logs.

    Note
    • Under normal conditions, packet capture has no impact on business operations and only slightly increases CPU load and disk writes.

    • The preceding command rotates captured packets, writing up to 200 files of 20 MB each (.pcap files).

Cluster-external domain name resolution issues

Issue description

Application pods can resolve cluster-internal domain names normally but cannot resolve certain cluster-external domain names.

Root cause

The upstream server returns abnormal DNS resolution responses.

Solution

Check CoreDNS DNS query request logs.

Common request logs

CoreDNS logs a line after receiving a request and replying to the client. Example:

# The status code RCODE NOERROR indicates successful resolution.
[INFO] 172.20.2.25:44525 - 36259 "A IN redis-master.default.svc.cluster.local. udp 56 false 512" NOERROR qr,aa,rd 110 0.000116946s

Common RCODE return codes

For details on RCODE definitions, see the specification.

Return Code (RCODE)

Meaning

Cause

NXDOMAIN

Domain name does not exist

Inside containers, requested domain names are appended with search suffixes. If the resulting domain name does not exist, this RCODE appears. If the requested domain name in the logs exists, an anomaly is present.

SERVFAIL

Upstream server anomaly

Commonly occurs when the upstream DNS server is unreachable.

REFUSED

Response denied

Commonly occurs when the upstream DNS server configured in CoreDNS or the cluster node’s /etc/resolv.conf file cannot handle the domain name. Check the CoreDNS configuration file.

When CoreDNS DNS query request logs show NXDOMAIN, SERVFAIL, or REFUSED for cluster-external domain names, the upstream DNS server returns abnormal responses.

By default, the upstream DNS servers for CoreDNS in the cluster are the VPC-provided DNS servers (100.100.2.136 and 100.100.2.138). You can submit a ticket to Elastic Compute Service (ECS). Include the following information when submitting the ticket.

Field

Description

Example

Affected domain name

Cluster-external domain name with abnormal RCODE in CoreDNS logs

www.aliyun.com

Parse the return code (RCODE).

Specific resolution error (NXDOMAIN, SERVFAIL, REFUSED)

NXDOMAIN

Affected time

Log timestamp (accurate to the second)

2022-12-22 20:00:03

Affected ECS instances

ECS instance IDs where CoreDNS pod replicas reside

i-xxxxx i-yyyyy

Newly added Headless domain names cannot be resolved

Issue description

Application pods connected to CoreDNS cannot resolve newly added Headless domain names.

Root cause

CoreDNS versions earlier than 1.7.0 exit abnormally during API Server jitter, causing Headless domain names to stop updating.

Solution

Upgrade CoreDNS to version 1.7.0 or later. For details, see [Component upgrade] CoreDNS upgrade announcement.

Headless domain name resolution failures

Issue description

Application pods connected to CoreDNS cannot resolve Headless domain names. When using dig for resolution, the response shows the tc flag, indicating the response message is too large.

Root cause

When a Headless domain name corresponds to too many IP entries, DNS requests sent via UDP may exceed the UDP DNS message size limit, causing resolution failures.

Solution

To avoid resolution failures, adjust your client application to use TCP for DNS queries. CoreDNS supports both TCP and UDP queries. Modify your application based on the following scenarios:

  • glibc-based resolvers

    If your client application uses a glibc-based Resolve resolver, add the use-vc configuration in dnsConfig to use TCP for DNS queries. These settings map to the corresponding options configuration in /etc/resolv.conf. For details on options configuration, see Linux man pages.

    dnsConfig:
      options:
      - name: use-vc
  • Golang application logic

    If you develop with Golang, refer to the following code to use TCP for DNS queries.

    package main
    import (
    	"fmt"
    	"net"
    	"context"
    )
    func main() {
    	resolver := &net.Resolver{
    		PreferGo: true,
    		Dial: func(ctx context.Context, network, address string) (net.Conn, error) {
    			return net.Dial("tcp", address)
    		},
    	}
    	addrs, err := resolver.LookupHost(context.TODO(), "example.com")
    	if err != nil {
    		fmt.Println("Error:", err)
    		return
    	}
    	fmt.Println("Addresses:", addrs)
    }

Headless domain names cannot be resolved after CoreDNS upgrade

Issue description

Some older open-source components (such as older versions of etcd, Nacos, and Kafka) do not work properly in environments with Kubernetes 1.20 or later and CoreDNS 1.8.4 or later.

Root cause

CoreDNS 1.8.4 and later prioritize the EndpointSlice API to synchronize Kubernetes service IP information. Some open-source components use the annotation service.alpha.kubernetes.io/tolerate-unready-endpoints from the Endpoint API to publish services that are not ready during initialization. This annotation is deprecated in the EndpointSlice API and replaced by publishNotReadyAddresses. After upgrading CoreDNS, unready services are not published, causing these components to fail at service discovery.

Solution

Check whether the YAML or Helm Chart of the open-source component contains the annotation service.alpha.kubernetes.io/tolerate-unready-endpoints. If it does, the component may not work properly. Upgrade the open-source component or consult its community.

StatefulSets pod domain names cannot be resolved

Issue description

Headless services cannot resolve pod domain names.

Root cause

In StatefulSets pod YAML, the ServiceName must match the name of the exposed service. Otherwise, pod domain names (for example, pod.headless-svc.ns.svc.cluster.local) cannot be accessed, and only service domain names (for example, headless-svc.ns.svc.cluster.local) are accessible.

Solution

Modify the ServiceName in the StatefulSets pod YAML.

Incorrect security group or vSwitch ACL configuration

Issue description

Application pods connected to CoreDNS on some or all nodes consistently fail to resolve domain names.

Root cause

Modifying the security group (or vSwitch ACL) used by ECS or containers blocks communication on UDP port 53.

Solution

Restore the security group and vSwitch ACL configurations to allow UDP communication on port 53.

Container network connectivity issues

Issue description

Application pods connected to CoreDNS on some or all nodes consistently fail to resolve domain names.

Root cause

Container network issues or other causes lead to persistent UDP port 53 unavailability.

Solution

You can use network diagnostics to diagnose network connectivity between application pods and CoreDNS addresses.

High CoreDNS pod load

Issue description

  • Application pods connected to CoreDNS on some or all nodes experience increased resolution latency and probabilistic or consistent failures.

  • Checking CoreDNS pod status shows CPU and memory usage of replicas approaching their resource limits.

Root cause

Insufficient CoreDNS replicas or high business request volume causes high CoreDNS load.

Solution

  • Consider using NodeLocal DNSCache to improve DNS resolution performance and reduce CoreDNS load. For details, see Use NodeLocal DNSCache.

  • Scale out CoreDNS replicas appropriately so that peak CPU usage per pod remains below the node’s idle CPU capacity.

CoreDNS pod load imbalance

Issue description

  • Some application pods connected to CoreDNS experience increased resolution latency and probabilistic or consistent failures.

  • Checking CoreDNS pod status shows uneven CPU usage across replicas.

  • Fewer than two CoreDNS replicas exist, or multiple replicas reside on the same node.

Root cause

Uneven CoreDNS replica scheduling or Service affinity settings cause CoreDNS pod load imbalance.

Solution

  • Scale out and distribute CoreDNS replicas across different nodes.

  • When load imbalance occurs, disable the affinity property of the kube-dns service. For details, see Unmanaged CoreDNS automatic upgrade.

Abnormal CoreDNS pod status

Issue description

  • Some application pods connected to CoreDNS experience increased resolution latency and probabilistic or consistent failures.

  • CoreDNS replica status is not Running, or the RESTARTS count keeps increasing.

  • CoreDNS operational logs show anomalies.

Root cause

CoreDNS YAML templates or configuration files cause CoreDNS to run abnormally.

Solution

Check CoreDNS pod status and operational logs.

Common abnormal logs and solutions

Log message

Cause

Solution

/etc/coredns/Corefile:4 - Error during parsing: Unknown directive 'ready'

The configuration file is incompatible with CoreDNS. The Unknown directive error indicates that the current CoreDNS version does not support the ready plug-in.

Remove the ready plug-in from the CoreDNS configuration item in the kube-system namespace. Apply the same approach to resolve similar errors.

pkg/mod/k8s.io/client-go@v0.18.3/tools/cache/reflector.go:125: Failed to watch *v1.Pod: Get "https://192.168.0.1:443/api/v1/": dial tcp 192.168.0.1:443: connect: connection refused

The API server was unavailable during the time period shown in the log.

If the log timestamp does not match the time of the abnormal event, rule out this cause. Otherwise, check network connectivity for the CoreDNS pod. For more information, see Check network connectivity for the CoreDNS pod.

[ERROR] plugin/errors: 2 www.aliyun.com. A: read udp 172.20.6.53:58814->100.100.2.136:53: i/o timeout

CoreDNS could not connect to the upstream DNS server during the time period shown in the log.

Resolution failures caused by client-side load

Issue description

Resolution failures occur sporadically during peak business hours or suddenly. ECS monitoring shows abnormal NIC retransmission rates and CPU load.

Root cause

The ECS instance hosting the application pod connected to CoreDNS reaches 100% load, causing UDP packet loss.

Solution

We recommend using NodeLocal DNSCache to improve DNS resolution performance and reduce CoreDNS load. For details, see Use NodeLocal DNSCache.

Full Conntrack table

Issue description

  • Application pods connected to CoreDNS on some or all nodes experience massive domain resolution failures during peak business hours, which disappear after the peak.

  • Running dmesg -H and scrolling to the issue period shows log entries containing conntrack full.

Root cause

The Linux Conntrack table has limited entries, preventing new UDP or TCP requests.

Solution

Increase the Conntrack table limit. For details, see How to increase the Linux connection tracking (Conntrack) limit?.

AutoPath plug-in issues

Issue description

  • Resolving cluster-external domain names probabilistically fails or resolves to incorrect IP addresses. Cluster-internal domain name resolution works normally.

  • During high-frequency container creation, cluster-internal service domain names resolve to incorrect IP addresses.

Root cause

A CoreDNS processing defect causes AutoPath to malfunction.

Solution

Follow these steps to disable the AutoPath plug-in.

  1. Run kubectl -n kube-system edit configmap coredns to open the CoreDNS configuration file.

  2. Delete the line autopath @kubernetes and save the changes.

  3. Check CoreDNS pod status and operational logs. The appearance of reload in the logs indicates successful modification.

Concurrent A and AAAA record resolution issues

Issue description

  • Application pods connected to CoreDNS probabilistically fail to resolve domain names.

  • Packet capture or CoreDNS DNS query request logs show A and AAAA requests occurring simultaneously with identical source ports.

Root cause

  • Concurrent A and AAAA DNS requests trigger a defect in the Linux kernel Conntrack module, causing UDP packet loss.

  • Older libc versions (<2.33) on ARM architectures have concurrency issues when initiating simultaneous A and AAAA requests, causing request timeouts and retransmissions. See GLIBC#26600.

Solution

  • Consider using NodeLocal DNSCache to improve DNS resolution performance and reduce CoreDNS load. For details, see Use NodeLocal DNSCache.

  • For base images using libc (such as CentOS and Ubuntu), upgrade libc to version 2.33 or later to avoid concurrent A and AAAA resolution issues.

  • For base images like CentOS and Ubuntu, optimize using parameters such as options timeout:2 attempts:3 rotate single-request-reopen.

  • If your container image is based on Alpine, consider switching to a different base image. For more information, see Alpine.

  • PHP applications often face short-connection resolution issues. If using PHP Curl, use the CURL_IPRESOLVE_V4 parameter to send IPv4-only resolution requests. For more information, see Function reference.

DNS resolution failures after CoreDNS pod anomalies in IPVS mode

Issue description

In IPVS mode, CoreDNS pods may experience probabilistic DNS resolution failures under specific conditions, typically lasting about five minutes.

Root cause

Under specific conditions, DNS resolution requests are sent to CoreDNS pods in an abnormal state, causing resolution failures.

For example, when a node hosting a CoreDNS pod is removed, node resources are immediately released, and the pod stops working. However, the cluster takes about one minute to detect the node status update and mark it as NotReady. Before the node status updates, the pod is still considered healthy and accepts DNS resolution requests, causing probabilistic DNS resolution failures in the cluster.

After the node is marked NotReady, its CoreDNS pods are immediately removed from the CoreDNS Service backend and stop accepting new connections. However, if the cluster’s kube-proxy load balancing mode is IPVS, the IPVS UDP session persistence policy causes some DNS requests to continue being sent to the pod until the UDP timeout period ends, leading to prolonged DNS resolution failures in the cluster.

Note

This issue may occur on CentOS and Alibaba Cloud Linux 2 nodes with kernel versions earlier than 4.19.91-25.1.al7.x86_64.

Solution

NodeLocal DNSCache not taking effect

Issue description

No traffic enters NodeLocal DNSCache, and all requests still go to CoreDNS.

Root cause

  • DNSConfig injection is not configured, so application pods still use the CoreDNS kube-dns service IP as the DNS server address.

  • Application pods use Alpine as the base image. Alpine concurrently requests all nameservers, including the local cache and CoreDNS.

Solution

  • Configure automatic DNSConfig injection. For details, see Use NodeLocal DNSCache.

  • If your container image is based on Alpine, consider switching to a different base image. For more information, see Alpine.

PrivateZone domain name resolution issues

Issue description

For applications connected to NodeLocal DNSCache, pods cannot resolve domain names registered in PrivateZone, cannot resolve Alibaba Cloud product API domain names containing vpc-proxy, or resolve them incorrectly.

Root cause

PrivateZone does not support TCP protocol and requires UDP access.

Solution

Configure prefer_udp in CoreDNS. For details, see Unmanaged CoreDNS configuration.

DNS resolution issues caused by sudden traffic spikes

Issue description

After a sudden traffic surge, some DNS requests fail to resolve.

Root cause

Sudden traffic spikes cause a surge in DNS requests, leading to excessive inbound and outbound traffic to CoreDNS. This may throttle CoreDNS CPU usage and cause resolution anomalies. Verify this scenario as follows:

  1. Check on the node where CoreDNS pods reside.

    Run the following command on the node.

    nsenter -t <coredns-pid> -n -- netstat -su

    Check for send or recv buffer error messages. If present, UDP packet loss exists. Example:

    Udp:
        1090421 packets received
        850 packets to unknown port received
        15662 packet receive errors
        5607627 packets sent
        15662 receive buffer errors
        0 send buffer errors
  2. Check CoreDNS pod CPU throttling metrics.

    If CoreDNS CPU is throttled, intermittent DNS resolution failures or increased DNS response latency may occur. Combine this with the first point to confirm packet loss.

    Note

    Due to CPU usage sampling and calculation cycles (15 seconds), CPU throttling may occur even when CPU usage appears low. For more information, see Enable CPU Burst performance optimization.

    In the Prometheus monitoring page, choose Application Monitoring > Cluster Pod Monitoring. Filter by Namespace kube-system, select the corresponding CoreDNS pod, and check the CPU Throttled Percent line chart in the CPU Resource section. If this metric is close to 0%, no CPU throttling occurs.

  3. Whether you use ARMS Prometheus or a self-managed Prometheus solution, always collect CoreDNS metrics and use the CoreDNS dashboard to check for anomalies and identify the issue timeframe. Log on to the Container Service for Kubernetes console, then navigate to Operations > Prometheus Monitoring and select the Network Monitoring tab to find CoreDNS.

    image.png

Solution