刚开始学aggregateByKey算子看的一头雾水,今天写下心得。看下面的例子:
package com.chy.rdd.transformation;
import com.chy.util.SparkUtil;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import scala.Tuple2;
import java.util.Arrays;
import java.util.List;
/**
* @Title: sparkAggregateByKey
* @Description: aggregateByKey 算子
* @author chy
* @date 2018/5/17 16:20
*/
public class sparkAggregateByKey {
public static void main(String[] arg){
JavaSparkContext sc= SparkUtil.getJavaSparkContext();
List<String> list = Arrays.asList("you,jump", "he,jump","he");
JavaRDD<String> listRDD = sc.parallelize(list);
/**
* flatMap 拆分元素
*/
listRDD.flatMap(line -> Arrays.asList(line.split(",")).iterator())
/**
* 形成 k,v
*/
.mapToPair(word -> new Tuple2<>(word,1))
/**
* (you,1)
* (jump,1)
* (he,1)
* (jump,1)
* (he,1)
* -----seqFunc-----
* (you,(1,zeroValue))--- (you,(1,1))---(x+y)--(you,(2)
* ------------------------- (jump,(1,1))--(x+y)--(jump,(2)
* ------------------------- (he,(1,1))-----(x+y)--(he,(2)
* ------------------------- (jump,(1,1))--(x+y)--(jump,(2)
* ------------------------- (he,(1,1))--(x+y)--(he,(2)
*
* --------combFunc-----
* ------------------------- (jump,(2)+(jump,(2)= (jump,(4)
* ------------------------- (he,(2)+(he,(2)=(he,(4)
*
* --------result-----------
* -------------------------(you,(2)
* -------------------------(jump,(4)
* -------------------------(he,(4)
*/
.aggregateByKey(1,(x,y)->{
System.out.println("x:"+x+",y:"+y);
return x+y;
} ,(m,n) ->{
//有多个的情况执行联合合并
System.out.println("m:"+m+",n:"+n);
return m+n;
})
.foreach(tuple -> System.out.println(tuple._1+"->"+tuple._2));
}
}
下面来分析
def aggregateByKey[U](zeroValue : U,
seqFunc : org.apache.spark.api.java.function.Function2[U, V, U],
combFunc : org.apache.spark.api.java.function.Function2[U, U, U]) :
org.apache.spark.api.java.JavaPairRDD[K, U] = { /* compiled code */ }
zerovalue : 分组初始值
zeqFunc: 分组函数
comFunc: 聚合函数
数据源
List<String> list = Arrays.asList("you,jump", "he,jump","he");
按逗号拆分
line -> Arrays.asList(line.split(",")).iterator()
形成k,v
.mapToPair(word -> new Tuple2<>(word,1))
* (you,1) * (jump,1) * (he,1) * (jump,1) * (he,1)
分组
.aggregateByKey(1,(x,y)->{
System.out.println("x:"+x+",y:"+y);
return x+y;
}
* -----seqFunc----- * (you,(1,zeroValue))--- (you,(1,1))---(x+y)--(you,2) * ------------------------- (jump,(1,1))--(x+y)--(jump,2) * ------------------------- (he,(1,1))-----(x+y)--(he,2) * ------------------------- (jump,(1,1))--(x+y)--(jump,2) * ------------------------- (he,(1,1))--(x+y)--(he,2)
聚合
(m,n) ->{
//有多个的情况执行联合合并
System.out.println("m:"+m+",n:"+n);
return m+n;
}
* --------combFunc----- * ------------------------- (jump,2)+(jump,2)--(m+n)---- (jump,(2+2=4)) * ------------------------- (he,2)+(he,2)------(m+n)----(he,(2+2)=4))
最终结果
* --------result----------- * -------------------------(you,(2) * -------------------------(jump,(4) * -------------------------(he,(4)
本文通过实例解析了Spark中AggregateByKey算子的工作原理,详细展示了如何使用该算子进行分组聚合操作,并解释了seqFunc和combFunc的作用。

被折叠的 条评论
为什么被折叠?



