Troubleshoot consumer issues

更新时间:
复制 MD 格式

Diagnose and resolve common consumer-side problems in ApsaraMQ for Kafka, including connection failures, frequent rebalances, slow message pulls, message accumulation, and authorization errors.

First-time connection failures

When a client fails to connect to ApsaraMQ for Kafka for the first time, work through these checks in order.

Network connectivity

Most first-time connection failures are network-related:

  • VPC mismatch: The client's Elastic Compute Service (ECS) instance must reside in the same virtual private cloud (VPC) as the ApsaraMQ for Kafka instance. For setup instructions, see Purchase and deploy a VPC-connected instance.

  • Wrong access method: A VPC-connected ApsaraMQ for Kafka instance cannot be accessed over the Internet. Use the VPC endpoint, not the public endpoint.

  • Missing whitelist entry: Add the client's IP address to the instance whitelist. See Configure whitelists.

Client version

Use a client version that matches the ApsaraMQ for Kafka instance version. For example, if the instance runs version 0.10.2.2, use a client of the same version. For client demos, see aliware-kafka-demos.

Note

The instance version takes precedence. Always match the client version to it.

Endpoint and permissions

Confirm these settings:

  • The default endpoint is valid and points to the correct access method (VPC or Internet).

  • The Resource Access Management (RAM) user has the required permissions.

For a complete setup walkthrough, see Getting started overview.

Frequent consumer rebalances

Frequent rebalances usually point to one of three causes: slow message processing, misconfigured consumer parameters, or an outdated client version.

How rebalances are triggered

The trigger mechanism depends on the client version:

Client versionTrigger mechanism
Earlier than 0.10.2No separate heartbeat thread. Heartbeats are sent through the poll() method. If message processing takes too long, the heartbeat times out and triggers a rebalance.
0.10.2 or laterA separate heartbeat thread exists. However, if poll() is not called within the max.poll.interval.ms interval (default: 5 minutes), the client leaves the consumer group and triggers a rebalance.

Key parameters

ParameterApplies toDescription
session.timeout.msAll versionsSession timeout. If the broker receives no heartbeat within this period, it removes the consumer from the group. A higher value tolerates longer pauses but delays detection of crashed consumers.
max.poll.recordsAll versionsMaximum number of messages returned per poll() call.
max.poll.interval.ms0.10.2 or laterMaximum allowed interval between consecutive poll() calls. Exceeding this causes the consumer to leave the group.

Solution

1. Tune parameters

Configure these parameters based on your client version:

session.timeout.ms

Client versionRecommended value
Earlier than 0.10.2Larger than the time to process one batch of messages, but no greater than 30 seconds. 25 seconds works well for most cases.
0.10.2 or laterKeep the default (10 seconds).
Note

A larger session.timeout.ms value tolerates longer processing times and network hiccups, but it also means the broker takes longer to detect a crashed consumer and reassign its partitions.

max.poll.records

Set this value well below the result of:

max.poll.records << messages_per_thread_per_second * number_of_threads * max.poll.interval.ms

max.poll.interval.ms (0.10.2 or later only)

Set this value above the result of:

max.poll.interval.ms > max.poll.records / (messages_per_thread_per_second * number_of_threads)

2. Offload message processing to a separate thread

Move business logic to a dedicated processing thread so that slow operations do not block heartbeats or delay poll() calls.

3. Limit topics per consumer group

Subscribe to no more than five topics per consumer group. For best stability, subscribe to one topic per group.

4. Upgrade to version 0.10.2 or later

Versions 0.10.2 and later use a separate heartbeat thread, which prevents processing delays from causing heartbeat timeouts.

Slow or failed message pulls

If a consumer pulls messages slowly or fails to pull them entirely -- even though the subscribed topics contain messages and the consumer has not caught up to the latest offset -- the issue is almost always bandwidth-related. This occurs more frequently over Internet connections.

Possible causes

  • Consumption traffic has saturated the network bandwidth.

  • A single message exceeds the available bandwidth.

  • The batch of messages pulled in one poll() call exceeds the available bandwidth.

Note

Three parameters control how much data a consumer pulls per poll() call:

  • max.poll.records: maximum number of messages per poll.

  • fetch.max.bytes: maximum total bytes per poll.

  • max.partition.fetch.bytes: maximum bytes per partition per poll.

Diagnose and fix

  1. Log in to the ApsaraMQ for Kafka console and query messages for the topic. If messages are returned, proceed to the next step.

  2. In the left-side navigation pane of the Instances page, choose Observability > CloudMonitor. On the Monitoring Chart tab, check the instance_internet_rx.rate(bit/s) chart to determine whether consumption traffic has hit the bandwidth cap. If it has, increase the network bandwidth.

  3. Check whether individual messages are too large for the available bandwidth. If so, increase the bandwidth or reduce the message size.

  4. Check whether the total size of messages pulled per poll() call exceeds the bandwidth. If so, adjust these parameters:

    • fetch.max.bytes: set to a value smaller than the network bandwidth.

    • max.partition.fetch.bytes: set to a value smaller than network bandwidth / number of subscribed partitions.

Important
  • For VPC access, "network bandwidth" refers to the maximum write traffic of the instance's elastic network interfaces (ENIs).

  • For Internet access, "network bandwidth" refers to the Internet bandwidth allocated to the instance.

Known issues with the Sarama Go client

The Sarama library for Go has known compatibility problems with ApsaraMQ for Kafka:

  • Partition detection: After a partition is added to a topic, the client may not detect it. A restart is required to consume from the new partition.

  • Protocol compliance: Several Sarama protocols deviate from Apache Kafka community standards. During broker exceptions, this can cause:

    • The OutOfRange mechanism to trigger. If the offset reset policy is set to Oldest(earliest), the consumer re-reads all messages from the earliest offset.

    • The client to remain stuck in a Rebalance state.

Recommended action

Replace the Sarama client with the Confluent Go client (confluent-kafka-go). For a demo, see kafka-confluent-go-demo.

Interim workarounds

If you cannot replace the client immediately:

  • In production, set the offset reset policy to Newest(latest). Use Oldest(earliest) only in debugging environments where duplicate messages are acceptable.

  • If an offset reset causes a large message backlog, reset the consumer offset to a specific point in time through the ApsaraMQ for Kafka console. This avoids code changes and consumer group changes. See Reset consumer offsets.

Message accumulation

Message accumulation occurs when a consumer group's committed offset falls behind the broker's high-water mark (the latest produced offset). The gap between these two values is the accumulated message count.

A large count alone does not indicate a problem. What matters is the trend: is the gap stable, growing, or caused by uncommitted offsets?

How consumption works

Each consumer follows a two-phase cycle:

  1. Pull: Fetch messages from the broker.

  2. Process: Run business logic on each message, then commit the consumer offset back to the broker.

Consumption cycle

The accumulated message count equals: high-water mark - committed offset.

Diagnose accumulation

Check consumer group metrics in the ApsaraMQ for Kafka console:

  1. Log in to the ApsaraMQ for Kafka console.

  2. In the top navigation bar, select the region of your instance.

  3. In the left-side navigation pane, click Instances.

  4. Click the name of your instance.

  5. In the left-side navigation pane, click Groups.

  6. Find the target group and choose More > Consumer Status in the Actions column.

  7. Check the Last Consumed At, Accumulated Messages, and Consumer Offset values.

Note

These values refresh at 1-minute intervals. Click Details to view per-partition consumer offsets.

Use the following table to interpret what you see:

SymptomDiagnosisAction
Last Consumed At is close to the current time, and Accumulated Messages fluctuates within a stable range.Normal. The consumer is keeping pace with production.No action required.
Accumulated Messages keeps increasing, and Consumer Offset stays unchanged.Abnormal. The consumer thread is blocked and has stopped committing offsets.See Resolve abnormal accumulation.
Accumulated Messages keeps increasing, but Consumer Offset is advancing.Abnormal. The consumer is processing messages, but slower than the production rate. The bottleneck is in the processing phase.See Resolve abnormal accumulation.
Messages appear accumulated, but downstream processing is working fine.Likely a false positive. If the consumer uses the assign mode, offsets are managed manually and may not be committed even though messages have been consumed.Commit offsets manually to clear the reported accumulation.
Note

The displayed accumulation count depends on the production rate and offset commit frequency. For example, if a topic receives 10,000 messages per second and offsets are committed once per second, a fluctuation around 10,000 is normal.

Resolve abnormal accumulation

Identify the bottleneck

  • Blocked thread: If Consumer Offset is not advancing, the consumer thread is likely stuck. Use jstack (for Java applications) to capture a thread dump and locate the blocking point. See jstack - Stack Trace.

  • Slow processing: If Consumer Offset is advancing but falling behind production, profile the message-processing logic in your application. Look for slow I/O calls, database writes, or blocking operations.

Increase the consumption rate

  • Add consumers: Add more consumer instances to the same consumer group -- either as additional threads in an existing process or as separate processes. Each consumer handles one or more partitions. If the number of consumers already equals or exceeds the number of partitions, additional consumers remain idle.

  • Use multi-threaded consumption: Process messages in parallel within each consumer. For implementation details, see the "Increase consumption rate" section in Best practices for consumers.

Note

In most cases, abnormal accumulation stems from slow processing or a blocked thread. Avoid setting long durations for timeout-related parameters in consumption logic.

Check for rebalances

If messages are accumulating and consumer status appears abnormal in the console, the consumer group may be rebalancing. No messages are consumed during a rebalance.

Frequent rebalances are typically caused by consumers repeatedly joining and leaving the group. For troubleshooting steps, see Frequent consumer rebalances in this guide.

"Not authorized to access group" error

This error means the consumer group does not exist. ApsaraMQ for Kafka requires consumer groups to be created before use, unlike open-source Apache Kafka where groups are created automatically.

Create the consumer group using either method:

References