Kafka 面试必问：ack 机制、消费失败、重复消息，一次讲清

原创于 2026-04-23 11:24:08 发布 · 362 阅读

8 ·

本内容遵循CC 4.0 BY-SA版权协议

GEO检测

卡夫卡的消费，卡夫卡消费失败怎么办？

我们的kafka设置的是最少一次投递，所以如果消费失败了，会有重复投递的策略，只要在下游做好幂等就行

你确定kafka最先执行的不是第一次消费者消费失败了，消费者重失吗？

是的，但消费者接受broker消息，会进行ack，如果下游ack失败就会进入重试队列，重复投递

你说的重复投递是生产者的事情，重试是消费者的事！

能看出来我在这方面混淆了生产者到broker的重复投递和broker到消费者的重试投递概念

去网上检索发现也没人把这部分的内容讲明白，所以来写篇文章希望能帮到大家

kafak的架构图

以下内容引入kafka官方文档：Design | Apache Kafka、设计 |阿帕奇·卡夫卡

It’s worth noting that this breaks down into two problems:
the durability guarantees for publishing a message
the guarantees when consuming a message.

消息可靠性分为两部分：

生产者发送消息的持久化保证
消费者消费消息的处理保证

生产者发送消息的持久化保证

生产者发送消息确认ACK

Kafka’s semantics are straightforward. When publishing a message we have a notion of the message being “committed” to the log. A message is considered committed only when all replicas in the in-sync replicas (ISR) for that partition have applied it to their log. Once a published message is committed, it will not be lost as long as one broker that replicates the partition to which this message was written remains “alive”.

Kafka 的语义十分直观。发布消息时，我们会有一个概念：消息已被 “提交” 到日志。只有当对应分区的所有同步副本集（ISR）中的副本都将该消息写入自身日志后，这条消息才会被视为已提交。一旦发布的消息完成提交，只要存有该消息的分区副本所在的任意一个代理节点（Broker）保持 “可用”，这条消息就不会丢失。

说人话就是

在生产者发送了消息给了broker，会将对应分区的从leader以及副本follower日志写入之后，所有的ISR副本都完成了同步后，这条信息会被认为完成提交，接着broker会给生产者回一个ACK确认，对应下面ACK = all（等价于 -1）

retries：Setting a value greater than zero will cause the client to resend any record whose send fails with a potentially transient error. Note that this retry is no different than if the client resent the record upon receiving the error.

Not all use cases require such strong guarantees. For use cases which are latency-sensitive, we allow the producer to specify the durability level it desires. If the producer specifies that it wants to wait on the message being committed, this can take on the order of 10 ms. However the producer can also specify that it wants to perform the send completely asynchronously or that it wants to wait only until the leader (but not necessarily the followers) have the message.

并非所有用例都需要如此强的保证。对于延迟敏感的用例，我们允许生产者指定其期望的耐久性水平。如果生产者指定要等待消息提交，这大约需要10毫秒。然而，生产者也可以指定它希望完全异步地发送，或者只想等待领导者(但不一定是跟随者)收到消息时

不是所有场景都要这么强的可靠性。如果你追求低延迟，可以让生产者不用等所有副本同步完：对应下面ACK = 0

具体的ACK机制如下：

acks=0 If set to zero then the producer will not wait for any acknowledgment from the server at all. The record will be immediately added to the socket buffer and considered sent. No guarantee can be made that the server has received the record in this case, and the retries configuration will not take effect (as the client won't generally know of any failures). The offset given back for each record will always be set to -1.
acks=1 This will mean the leader will write the record to its local log but will respond without awaiting full acknowledgement from all followers. In this case should the leader fail immediately after acknowledging the record but before the followers have replicated it then the record will be lost.
acks=all This means the leader will wait for the full set of in-sync replicas to acknowledge the record. This guarantees that the record will not be lost as long as at least one in-sync replica remains alive. This is the strongest available guarantee. This is equivalent to the acks=-1 setting.

acks=0如果设置为零，制作人根本不会等待服务器的任何确认。记录会立即添加到套接字缓冲区并视为发送。在这种情况下，无法保证服务器已经收到记录，配置不会生效（因为客户端通常不会知道任何失败）。每个记录返回的偏移量始终设置为-1。
acks=1这意味着领导者会将记录写入本地日志，但不会等待所有追随者的完全确认。在这种情况下，如果领导者在确认记录后立即失败，但在追随者复制之前，则该记录将丢失。
acks=all这意味着领导者会等待所有同步副本确认记录。这保证了只要至少有一个同步副本仍然存活，记录就不会丢失。这是目前最强的保证。这等同于acks=-1的设置。

Kafka 3.0 及以后（2021 年 10 月发布）：默认 acks = all（等价于 -1）
Kafka 2.x 及更早（0.8.x ~ 2.8）：默认 acks = 1

生产者发送消息失败重新投递

由于重试是以批次为颗粒度的，一批次失败就会重试整个批次，而不是重试其中几条

Prior to 0.11.0.0, if a producer failed to receive a response indicating that a message was committed

在0.11.0.0之前，如果生产者未能收到消息已提交的响应，几乎只能重新发送该消息。这至少提供了一次传递语义

retries：Setting a value greater than zero will cause the client to resend any record whose send fails with a potentially transient error. Note that this retry is no different than if the client resent the record upon receiving the error.

重试是针对发送失败的记录，但批次是最小发送 / 重试单元，Broker 不返回单条级别的成功 / 失败，只能整批重试

多久没收到 ACK 会重试？超时时间是多少？

官方配置：request.timeout.ms

默认值：30 秒

重试几次？多久重试一次？

重试次数retries 默认：0.11 ~ 2.8：默认 0（不重试）；3.0+：默认 2147483647（无限重试，直到成功）

重试间隔：retry.backoff.ms；默认：100 毫秒

Since 0.11.0.0, the Kafka producer also supports an idempotent delivery option which guarantees that resending will not result in duplicate entries in the log. To achieve this, the broker assigns each producer an ID and deduplicates messages using a sequence number that is sent by the producer along with every message. Also beginning with 0.11.0.0, the producer supports the ability to send messages atomically to multiple topic partitions using transactions, so that either all messages are successfully written or none of them are.

自0.11.0.0起，Kafka制作器还支持幂零交付选项，保证重发送不会导致日志中重复条目。为此，经纪人为每个生产者分配一个ID，并使用生产者随每条消息发送的序列号进行重复处理。同样从0.11.0.0开始，生产者支持通过事务原子方式向多个主题分区发送消息的能力，这样所有消息要么全部成功写入，要么都无法完成

说人话就是broker给生产端每一个生产者一个PID，每次消息来都会带上这个PID和这条消息的序号，就能在broker对生产端消息做去重，以及生产者在向多个分区发送信息去做到同时成功或失败；但这个需要enable.idempotence = true才能开启这个功能

所以纠正一个误区：kafka消息重复发送可能在生产端与broker，broker与消费端的网络波动
在enable.idempotence = true下kafka的消息重复发送与生产端到broker之间无关

消费者消费消息的处理保证

前置知识：offset是在broker里的消息消费进度

可以和我的rocketmq消费者端的消息确认机制这篇文章对比着看看

分三种模式：

Now that we understand a little about how producers and consumers work, let’s discuss the semantic guarantees Kafka provides between producer and consumer. Clearly there are multiple possible message delivery guarantees that could be provided:
At most once – Messages may be lost but are never redelivered.
At least once – Messages are never lost but may be redelivered.
Exactly once – Each message is processed once and only once.

在我们对生产者和消费者的工作方式有基本了解后，我们来讨论 Kafka 在生产者与消费者之间提供的语义保证。显然，消息投递可以提供以下几种不同的保证：

最多一次（At Most Once）：消息可能丢失，但绝不会被重复投递。
至少一次（At Least Once）：消息绝不会丢失，但可能被重复投递。
精确一次（Exactly Once）：每条消息只会被处理有且仅有一次。

这是 Kafka 对消息可靠性的三种核心承诺，区别只在于消费者什么时候提交消费位置（offset）。

提交 offset 的时机，直接决定消息是 “可能丢”、“可能重” 还是 “严格一次”。

最多一次（AT Most Once）

It can read the messages, then save its position in the log, and finally process the messages. In this case there is a possibility that the consumer process crashes after saving its position but before saving the output of its message processing. In this case the process that took over processing will start at the saved position even though a few messages prior to that position had not been processed. This corresponds to “at-most-once” semantics as in the case of a consumer failure messages may not be processed.

消费者可以先读取消息，先在日志中保存自己的消费位置（提交 offset），最后再处理消息。

在这种模式下，存在一种可能：消费者在提交位置之后、实际处理消息之前进程崩溃。

此时，接替它的新进程会从已提交的位置开始消费，而该位置之前的部分消息将永远不会被处理。

这对应 “最多一次” 语义：当消费者故障时，消息可能丢失、不会重复。

执行顺序
拉取消息 → 提交 offset → 业务处理
风险点
刚提交完 offset，还没来得及执行业务逻辑就宕机。
结果
offset 已经前进，Broker 认为消息已处理，不会再投递。
特点
无重复，但可能丢消息。
适用场景
日志采集、监控打点等允许少量丢失、追求极致低延迟的场景。

At Least Once 至少一次

It can read the messages, process the messages, and finally save its position. In this case there is a possibility that the consumer process crashes after processing messages but before saving its position. In this case when the new process takes over the first few messages it receives will already have been processed. This corresponds to the “at-least-once” semantics in the case of a consumer failure.

消费者可以先读取消息，先处理消息，最后再保存消费位置（提交 offset）。
在这种模式下，存在一种可能：消费者在处理完消息之后、提交位置之前进程崩溃。
此时，接替的新进程启动后，收到的前几条消息其实已经被处理过了。
这对应 “至少一次” 语义：故障时消息不会丢失，但会被重复投递。

消费者可以先读取消息，先处理消息，最后再保存消费位置（提交 offset）。

在这种模式下，存在一种可能：消费者在处理完消息之后、提交位置之前进程崩溃。

此时，接替的新进程启动后，收到的前几条消息其实已经被处理过了。

消费失败 → 不提交 offset → 下次从这个消费进度上重新消费 → 无限重复

这对应 “至少一次” 语义：故障时消息不会丢失，但会被重复投递。

执行顺序
拉取消息 → 业务处理 → 提交 offset
风险点
业务执行成功，但还没来得及提交 offset 就宕机。
结果
消息已处理，但 offset 未更新，重启后重新拉取、重复处理。
特点
绝不丢消息，但可能重复。
去重方案
业务侧做幂等（唯一键、分布式锁、状态判断（状态守卫）、Redis 标记等）。

Exactly Once 精确一次

So what about exactly-once semantics? When consuming from a Kafka topic and producing to another topic (as in a Kafka Streams application), we can leverage the new transactional producer capabilities in 0.11.0.0... The consumer’s position is stored as a message in an internal topic, so we can write the offset to Kafka in the same transaction as the output topics...
In the default “read_uncommitted” isolation level, all messages are visible to consumers even if they were part of an aborted transaction, but in “read_committed” isolation level, the consumer will only return messages from transactions which were committed.

那么精确一次语义如何实现？
当从一个 Kafka 主题消费，并将处理结果写入另一个主题时（例如 Kafka Streams 应用），我们可以使用 0.11.0 版本引入的事务生产者能力。
消费者的位置（offset）本身也是以消息形式存储在内部主题中，因此我们可以将 offset 的提交与输出主题的写入放在同一个事务中。

在默认的 read_uncommitted 隔离级别下，消费者能看到所有消息，即使它们属于已中止的事务；
而在 read_committed 隔离级别下，消费者只会读取已提交事务的消息。

主要用于 Kafka → Kafka 链路
消费 offset + 生产结果原子提交：
拉取消息 → 提交 offset要么同时成功，要么同时失败，不存在中间状态；

适用范围
主要用于 Kafka → Kafka 链路（Kafka Streams、事务生产者）。
外部系统怎么办？
官方原文明确：写入外部系统时，需要将 offset 和业务数据存在同一存储，保证原子性。
例如 MySQL 中存业务数据 + offset，同一事务提交。但就要求本地的持久层支持事务性
关键配置
生产者开启事务 transactional.id
消费者设置 isolation.level=read_committed

官方总结

Otherwise, Kafka guarantees at-least-once delivery by default, and allows the user to implement at-most-once delivery by disabling retries on the producer and committing offsets in the consumer prior to processing a batch of messages.

除此之外，Kafka 默认保证至少一次投递语义。

用户也可以通过禁用生产者重试，并在消费者处理一批消息之前就提交 offset，来实现最多一次语义。

Kafka 消费端的三大语义，本质上只有一个变量：offset 提交时机。