Diagnose and resolve common consumer-side problems in ApsaraMQ for Kafka, including connection failures, frequent rebalances, slow message pulls, message accumulation, and authorization errors.
First-time connection failures
When a client fails to connect to ApsaraMQ for Kafka for the first time, work through these checks in order.
Network connectivity
Most first-time connection failures are network-related:
VPC mismatch: The client's Elastic Compute Service (ECS) instance must reside in the same virtual private cloud (VPC) as the ApsaraMQ for Kafka instance. For setup instructions, see Purchase and deploy a VPC-connected instance.
Wrong access method: A VPC-connected ApsaraMQ for Kafka instance cannot be accessed over the Internet. Use the VPC endpoint, not the public endpoint.
Missing whitelist entry: Add the client's IP address to the instance whitelist. See Configure whitelists.
Client version
Use a client version that matches the ApsaraMQ for Kafka instance version. For example, if the instance runs version 0.10.2.2, use a client of the same version. For client demos, see aliware-kafka-demos.
The instance version takes precedence. Always match the client version to it.
Endpoint and permissions
Confirm these settings:
The default endpoint is valid and points to the correct access method (VPC or Internet).
The Resource Access Management (RAM) user has the required permissions.
For a complete setup walkthrough, see Getting started overview.
Frequent consumer rebalances
Frequent rebalances usually point to one of three causes: slow message processing, misconfigured consumer parameters, or an outdated client version.
How rebalances are triggered
The trigger mechanism depends on the client version:
| Client version | Trigger mechanism |
|---|---|
| Earlier than 0.10.2 | No separate heartbeat thread. Heartbeats are sent through the poll() method. If message processing takes too long, the heartbeat times out and triggers a rebalance. |
| 0.10.2 or later | A separate heartbeat thread exists. However, if poll() is not called within the max.poll.interval.ms interval (default: 5 minutes), the client leaves the consumer group and triggers a rebalance. |
Key parameters
| Parameter | Applies to | Description |
|---|---|---|
session.timeout.ms | All versions | Session timeout. If the broker receives no heartbeat within this period, it removes the consumer from the group. A higher value tolerates longer pauses but delays detection of crashed consumers. |
max.poll.records | All versions | Maximum number of messages returned per poll() call. |
max.poll.interval.ms | 0.10.2 or later | Maximum allowed interval between consecutive poll() calls. Exceeding this causes the consumer to leave the group. |
Solution
1. Tune parameters
Configure these parameters based on your client version:
session.timeout.ms
| Client version | Recommended value |
|---|---|
| Earlier than 0.10.2 | Larger than the time to process one batch of messages, but no greater than 30 seconds. 25 seconds works well for most cases. |
| 0.10.2 or later | Keep the default (10 seconds). |
A larger session.timeout.ms value tolerates longer processing times and network hiccups, but it also means the broker takes longer to detect a crashed consumer and reassign its partitions.
max.poll.records
Set this value well below the result of:
max.poll.records << messages_per_thread_per_second * number_of_threads * max.poll.interval.msmax.poll.interval.ms (0.10.2 or later only)
Set this value above the result of:
max.poll.interval.ms > max.poll.records / (messages_per_thread_per_second * number_of_threads)2. Offload message processing to a separate thread
Move business logic to a dedicated processing thread so that slow operations do not block heartbeats or delay poll() calls.
3. Limit topics per consumer group
Subscribe to no more than five topics per consumer group. For best stability, subscribe to one topic per group.
4. Upgrade to version 0.10.2 or later
Versions 0.10.2 and later use a separate heartbeat thread, which prevents processing delays from causing heartbeat timeouts.
Slow or failed message pulls
If a consumer pulls messages slowly or fails to pull them entirely -- even though the subscribed topics contain messages and the consumer has not caught up to the latest offset -- the issue is almost always bandwidth-related. This occurs more frequently over Internet connections.
Possible causes
Consumption traffic has saturated the network bandwidth.
A single message exceeds the available bandwidth.
The batch of messages pulled in one
poll()call exceeds the available bandwidth.
Three parameters control how much data a consumer pulls per poll() call:
max.poll.records: maximum number of messages per poll.fetch.max.bytes: maximum total bytes per poll.max.partition.fetch.bytes: maximum bytes per partition per poll.
Diagnose and fix
Log in to the ApsaraMQ for Kafka console and query messages for the topic. If messages are returned, proceed to the next step.
In the left-side navigation pane of the Instances page, choose Observability > CloudMonitor. On the Monitoring Chart tab, check the instance_internet_rx.rate(bit/s) chart to determine whether consumption traffic has hit the bandwidth cap. If it has, increase the network bandwidth.
Check whether individual messages are too large for the available bandwidth. If so, increase the bandwidth or reduce the message size.
Check whether the total size of messages pulled per
poll()call exceeds the bandwidth. If so, adjust these parameters:fetch.max.bytes: set to a value smaller than the network bandwidth.max.partition.fetch.bytes: set to a value smaller thannetwork bandwidth / number of subscribed partitions.
For VPC access, "network bandwidth" refers to the maximum write traffic of the instance's elastic network interfaces (ENIs).
For Internet access, "network bandwidth" refers to the Internet bandwidth allocated to the instance.
Known issues with the Sarama Go client
The Sarama library for Go has known compatibility problems with ApsaraMQ for Kafka:
Partition detection: After a partition is added to a topic, the client may not detect it. A restart is required to consume from the new partition.
Protocol compliance: Several Sarama protocols deviate from Apache Kafka community standards. During broker exceptions, this can cause:
The
OutOfRangemechanism to trigger. If the offset reset policy is set toOldest(earliest), the consumer re-reads all messages from the earliest offset.The client to remain stuck in a
Rebalancestate.
Recommended action
Replace the Sarama client with the Confluent Go client (confluent-kafka-go). For a demo, see kafka-confluent-go-demo.
Interim workarounds
If you cannot replace the client immediately:
In production, set the offset reset policy to
Newest(latest). UseOldest(earliest)only in debugging environments where duplicate messages are acceptable.If an offset reset causes a large message backlog, reset the consumer offset to a specific point in time through the ApsaraMQ for Kafka console. This avoids code changes and consumer group changes. See Reset consumer offsets.
Message accumulation
Message accumulation occurs when a consumer group's committed offset falls behind the broker's high-water mark (the latest produced offset). The gap between these two values is the accumulated message count.
A large count alone does not indicate a problem. What matters is the trend: is the gap stable, growing, or caused by uncommitted offsets?
How consumption works
Each consumer follows a two-phase cycle:
Pull: Fetch messages from the broker.
Process: Run business logic on each message, then commit the consumer offset back to the broker.
The accumulated message count equals: high-water mark - committed offset.
Diagnose accumulation
Check consumer group metrics in the ApsaraMQ for Kafka console:
Log in to the ApsaraMQ for Kafka console.
In the top navigation bar, select the region of your instance.
In the left-side navigation pane, click Instances.
Click the name of your instance.
In the left-side navigation pane, click Groups.
Find the target group and choose More > Consumer Status in the Actions column.
Check the Last Consumed At, Accumulated Messages, and Consumer Offset values.
These values refresh at 1-minute intervals. Click Details to view per-partition consumer offsets.
Use the following table to interpret what you see:
| Symptom | Diagnosis | Action |
|---|---|---|
| Last Consumed At is close to the current time, and Accumulated Messages fluctuates within a stable range. | Normal. The consumer is keeping pace with production. | No action required. |
| Accumulated Messages keeps increasing, and Consumer Offset stays unchanged. | Abnormal. The consumer thread is blocked and has stopped committing offsets. | See Resolve abnormal accumulation. |
| Accumulated Messages keeps increasing, but Consumer Offset is advancing. | Abnormal. The consumer is processing messages, but slower than the production rate. The bottleneck is in the processing phase. | See Resolve abnormal accumulation. |
| Messages appear accumulated, but downstream processing is working fine. | Likely a false positive. If the consumer uses the assign mode, offsets are managed manually and may not be committed even though messages have been consumed. | Commit offsets manually to clear the reported accumulation. |
The displayed accumulation count depends on the production rate and offset commit frequency. For example, if a topic receives 10,000 messages per second and offsets are committed once per second, a fluctuation around 10,000 is normal.
Resolve abnormal accumulation
Identify the bottleneck
Blocked thread: If Consumer Offset is not advancing, the consumer thread is likely stuck. Use
jstack(for Java applications) to capture a thread dump and locate the blocking point. See jstack - Stack Trace.Slow processing: If Consumer Offset is advancing but falling behind production, profile the message-processing logic in your application. Look for slow I/O calls, database writes, or blocking operations.
Increase the consumption rate
Add consumers: Add more consumer instances to the same consumer group -- either as additional threads in an existing process or as separate processes. Each consumer handles one or more partitions. If the number of consumers already equals or exceeds the number of partitions, additional consumers remain idle.
Use multi-threaded consumption: Process messages in parallel within each consumer. For implementation details, see the "Increase consumption rate" section in Best practices for consumers.
In most cases, abnormal accumulation stems from slow processing or a blocked thread. Avoid setting long durations for timeout-related parameters in consumption logic.
Check for rebalances
If messages are accumulating and consumer status appears abnormal in the console, the consumer group may be rebalancing. No messages are consumed during a rebalance.
Frequent rebalances are typically caused by consumers repeatedly joining and leaving the group. For troubleshooting steps, see Frequent consumer rebalances in this guide.
"Not authorized to access group" error
This error means the consumer group does not exist. ApsaraMQ for Kafka requires consumer groups to be created before use, unlike open-source Apache Kafka where groups are created automatically.
Create the consumer group using either method:
Console: Follow the "Step 2: Create a group" section in Step 3: Create resources.
API: Call the CreateConsumerGroup operation.