Spark中CheckPoint、Cache、Persist的用法、区别

最新推荐文章于 2026-03-07 10:58:18 发布

原创

最新推荐文章于 2026-03-07 10:58:18 发布 · 2.1k 阅读

本文介绍了Spark中的RDD持久化机制，包括cache、persist和checkpoint的用法及区别。cache和persist用于在内存中或磁盘上缓存数据，提供不同级别的存储选项，而checkpoint则将RDD写入HDFS，切断血缘关系，用于长链路计算和可靠存储。持久化有助于提高迭代算法和交互式应用的效率，并提供了容错能力。

Spark中CheckPoint、Cache、Persist

大家好，我是一拳就能打爆A柱的猛男

这几天看到一套视频《尚硅谷2021迎新版大数据Spark从入门到精通》，其中有关于检查点（CheckPoint）的内容，所以就给大家以文字的形式复盘一下。接下来的顺序是：Spark关于持久化的描述、Cache用法、Persist用法、CheckPoint用法。中间会讲解三者之间的关系。

1、Spark关于持久化的描述

在Spark官网，我可以找到关于RDD持久化的全部内容就是如下的内容：

RDD Persistence

One of the most important capabilities in Spark is persisting (or caching) a dataset in memory across operations. When you persist an RDD, each node stores any partitions of it that it computes in memory and reuses them in other actions on that dataset (or datasets derived from it). This allows future actions to be much faster (often by more than 10x). Caching is a key tool for iterative algorithms and fast interactive use.

You can mark an RDD to be persisted using the persist() or cache() methods on it. The first time it is computed in an action, it will be kept in memory on the nodes. Spark’s cache is fault-tolerant – if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it.

In addition, each persisted RDD can be stored using a different storage level, allowing you, for example, to persist the dataset on disk, persist it in memory but as serialized Java objects (to save space), replicate it across nodes. These levels are set by passing a StorageLevel object (Scala, Java, Python) to persist(). The cache() method is a shorthand for using the default storage level, which is StorageLevel.MEMORY_ONLY (store deserialized objects in memory). The full set of storage levels is:

Storage Level Meaning

MEMORY_ONLY Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they’re needed. This is the default level.

MEMORY_AND_DISK Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don’t fit on disk, and read them from there when they’re needed.

MEMORY_ONLY_SER (Java and Scala) Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read.

MEMORY_AND_DISK_SER (Java and Scala) Similar to MEMORY_ONLY_SER, but spill partitions that don’t fit in memory to disk instead of recomputing them on the fly each time they’re needed.

DISK_ONLY Store the RDD partitions only on disk.

MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc. Same as the levels above, but replicate each partition on two cluster nodes.

OFF_HEAP (experimental) Similar to MEMORY_ONLY_SER, but store the data in off-heap memory. This requires off-heap memory to be enabled.

Note: In Python, stored objects will always be serialized with the Pickle library, so it does not matter whether you choose a serialized level. The available storage levels in Python include MEMORY_ONLY, MEMORY_ONLY_2, MEMORY_AND_DISK, MEMORY_AND_DISK_2, DISK_ONLY, and DISK_ONLY_2.

Spark also automatically persists some intermediate data in shuffle operations (e.g. reduceByKey), even without users calling persist. This is done to avoid recomputing the entire input if a node fails during the shuffle. We still recommend users call persist on the resulting RDD if they plan to reuse it.

Storage Level	Meaning
MEMORY_ONLY	Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they’re needed. This is the default level.
MEMORY_AND_DISK	Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don’t fit on disk, and read them from there when they’re needed.
MEMORY_ONLY_SER (Java and Scala)	Store RDD as serialized Java objects (one byte array per partition). This is generally more space-efficient than deserialized objects, especially when using a fast serializer, but more CPU-intensive to read.
MEMORY_AND_DISK_SER (Java and Scala)	Similar to MEMORY_ONLY_SER, but spill partitions that don’t fit in memory to disk instead of recomputing them on the fly each time they’re needed.
DISK_ONLY	Store the RDD partitions only on disk.
MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc.	Same as the levels above, but replicate each partition on two cluster nodes.
OFF_HEAP (experimental)	Similar to MEMORY_ONLY_SER, but store the data in off-heap memory. This requires off-heap memory to be enabled.

这段话大概意思可以总结为以下几点：

RDD可以将其正在计算的所有分区的数据都保存起来用于下一次计算。
可以使用cache()和persist()来做保存。
保存可以存储在节点的内存中、节点的磁盘中、跨节点备份（磁盘和内存都可以），总之Spark为开发者提供了7中持久化级别。

注：RDD是可容错的数据集，也体现在持久化部分，一旦持久化过程中出现数据丢失、错误，可以沿着RDD血缘关系重新计算一遍再次持久化。

不知道是我找不到还是Spark将这部分内容写在了RDD的API中，以上是我翻找持久化的成果。从这段话可以看出RDD的持久化在持久化级别、数据纠错、数据丢失等方面做了完善的工作，所以接下来我们主要关注效率问题。

2、Cache的用法

cache的英文是高速缓冲存储器，也就是内存的意思。显然该方法作用是将数据缓存到内存中（注意：此处没有shuffle，各节点将各节点中各分区的数据缓存到各自的内存中）。下面是wordCount案例中使用Cache：

def main(args: Array[String]): Unit = {
   
   

    val conf = new SparkConf(

标签

#spark

最低0.47元/天解锁文章