Spark源码分析-takeSample源码分析

最新推荐文章于 2026-01-14 23:34:46 发布

原创最新推荐文章于 2026-01-14 23:34:46 发布 · 435 阅读

0 ·

本内容遵循CC 4.0 BY-SA版权协议

标签

#spark #takeSample源码分析

Spark 专栏收录该内容

10 篇文章

订阅专栏

本文深入探讨Spark的takeSample方法，分析其参数、执行流程和源码细节。讲解了withReplacement参数的含义，指出当结果数组较小且数据加载到driver内存时适用。文章还介绍了count方法获取RDD元素数量的过程，以及在不同抽样条件下如何计算采样率。通过PoissonBounds和BinomialBounds计算概率，然后进行抽样。总结中强调takeSample和sample方法底层都是基于概率进行抽样。

1.参数说明

withReplacement:是否是有放回的抽样，就是普通的抽样，我们都是抽过的数据就不能在抽样了，有放回抽样就是可以继续抽以前抽取过的。

num:数据抽样的个数

2.执行源码流程

执行流程主要是调用了RDD的takeSample方法，下面先贴上这个方法的代码：

def takeSample(
    withReplacement: Boolean,
    num: Int,
    seed: Long = Utils.random.nextLong): Array[T] = withScope {
  val numStDev = 10.0

  require(num >= 0, "Negative number of elements requested")
  require(num <= (Int.MaxValue - (numStDev * math.sqrt(Int.MaxValue)).toInt),
    "Cannot support a sample size > Int.MaxValue - " +
    s"$numStDev * math.sqrt(Int.MaxValue)")

  if (num == 0) {
    new Array[T](0)
  } else {
    val initialCount = this.count()  //统计数据的总量
    if (initialCount == 0) {
      new Array[T](0)
    } else {
      val rand = new Random(seed)
      if (!withReplacement && num >= initialCount) { //非有放回抽样，样本数比实际数据少
        Utils.randomizeInPlace(this.collect(), rand)
      } else {
        val fraction = SamplingUtils.computeFractionForSampleSize(num, initialCount,
          withReplacement)
        var samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()

        // If the first sample didn't turn out large enough, keep trying to take samples;
        // this shouldn't happen often because we use a big multiplier for the initial size
        var numIters = 0
        while (samples.length < num) {
          logWarning(s"Needed to re-sample due to insufficient sample size. Repeat #$numIters")
          samples = this.sample(withReplacement, fraction, rand.nextInt()).collect()
          numIters += 1
        }
        Utils.randomizeInPlace(samples, rand).take(num)
      }
    }
  }
}

我们先看一下这个函数顶端的注释：

this method should only be used if the resulting array is expected to be small, as
all the data is loaded into the driver's memory.

这个方法适用于当结果数组比较小的情况下，因为所有的数据都要被加载到driver的内存中的，所以这里就很明白了，driver内存本来就不大，所以就放不了很多的数据。

看 val initialCount = this.count() ,其实现方法如下：

/**
 * Return the number of elements in the RDD.
 */
def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum

看这个方法上面的注释说，返回RDD中元素的数量，所以initialCount就是RDD元素的数量。

条件 if (!withReplacement && num >= initialCount)表示如果要的是非放回抽样，并且样本数少于我们要抽样的数据量，那么里面代码就是返回的全部数据，但是不是简单的将所有数据全部返回，而是有可能打乱了顺序（感觉这个是有点多余，之所以要这么做，个人感觉可能是要满足随机的特性吧）。在else中，首先计算出了一个fraction，这是一个（0,1）的一个分数，按照不放回抽样的思路来说，应该是目标元素数除以数据总数的一个比例，但是还有个有放回抽样的属性，所以我们就看下SamplingUtils.computeFractionForSampleSize(num, initialCount,withReplacement)的详细实现：

* @param sampleSizeLowerBound sample size
 * @param total size of RDD
 * @param withReplacement whether sampling with replacement
 * @return a sampling rate that guarantees sufficient sample size with 99.99% success rate
 */
def computeFractionForSampleSize(sampleSizeLowerBound: Int, total: Long,
    withReplacement: Boolean): Double = {
  if (withReplacement) {
    PoissonBounds.getUpperBound(sampleSizeLowerBound) / total
  } else {
    val fraction = sampleSizeLowerBound.toDouble / total
    BinomialBounds.getUpperBound(1e-4, total, fraction)
  }
}

看下返回结果解释：返回一个能有足够样本的采样率，准确率是99.99%。

看函数本体，当有放回抽样的时候，调用了PoissonBounds（暂时叫它泊松边界）的getUpperBound方法，当不是放回抽样的时候调用了BinomialBounds（暂时叫它二项分布边界）的getUpperBound方法。接下来我们看下这两个方法。

PoissonBounds.getUpperBound方法：

def getUpperBound(s: Double): Double = {
  math.max(s + numStd(s) * math.sqrt(s), 1e-10)
}

private def numStd(s: Double): Double = {
  // TODO: Make it tighter.
  if (s < 6.0) {
    12.0
  } else if (s < 16.0) {
    9.0
  } else {
    6.0
  }
}

BinomialBounds.getUpperBound方法：

/**
 * Returns a threshold `p` such that if we conduct n Bernoulli trials with success rate = `p`,
 * it is very unlikely to have less than `fraction * n` successes.
 */
def getUpperBound(delta: Double, n: Long, fraction: Double): Double = {
  val gamma = - math.log(delta) / n
  math.min(1,
    math.max(minSamplingRate, fraction + gamma + math.sqrt(gamma * gamma + 2 * gamma * fraction)))
}

管是泊松分布还是二项分布，返回的都是一个比例。

下面就是根据这个比例去调用sample函数取出数据。具体的实现在另一篇文章里sample分析中已经说了，也是分别利用泊松分布和二项分布取数的。后面进行多次取数，直到取出的数的数量等于我们要求的数量。

3.总结

通过对takeSample采样分析发现，他是利用我们要取的数据量，和总的数据量计算出一个概率，然后再利用这个概率去进行二项分布抽样或者是泊松抽样。所以说不管是sample方法，还是takeSample方法，底层抽样的原理都是一样的，都是通过概率进行抽样的。