环境
Spark On Yarn环境准备:
Spark:0.9.1 release。注意要选择relase版本(不是incubating版),踩到的坑会比较少。下载页面 http://spark.apache.org/downloads.html
Hadoop:2.0.0-cdh4.2.1。MRv2(Yarn)
环境:cygwin(Git console also works)
编译机器:内存够大(2G以上吧)
如果带宽够大,你就成功30了。如果你有SSD,你就成功80%了。
准备
下载后解压spark,然后SparkBuild.scala做一些修改。SparkBuild.scala是Spark使用的build系统sbt的配置文件。相当于maven的pom文件。
修改project/SparkBuild.scala:
1) force protobuf dependency:这个版本是hadoop 2.0.0-cdh4.2.1用的版本。不写定版本会有问题。
libraryDependencies ++= Seq(
"com.google.protobuf" % "protobuf-java" % "2.4.0a" force(),
2)修改maven repo。搜索resolvers ++= Seq,对于有local maven repo的情况下,修改为本地的repo。
3)修改maven local repo地址。如果之前用过maven了。可以指定local repo
resolvers ++= Seq(Resolver.file("Local Maven Repo", file("D:/repository"))),
编译
进入Spark Home,执行
SPARK_HADOOP_VERSION=2.0.0-cdh4.2.1 SPARK_YARN=true sbt/sbt assembly
编译很慢,约一个小时。如果dependency需要下载就不好说了。
Tips:如果有问题,可以尝试proxy:export HTTP_PROXY=http://127.0.0.1:port
拷贝
把需要的内容拷贝到hadoop集群上。比如spark目录。这个目录包含下面的目内容:
bin和conf下的所有文件:
./bin/run-example
./bin/spark-shell
./bin/pyspark
./bin/compute-classpath.sh
./bin/spark-class
./conf/fair-scheduler.xml
./conf/metrics.properties.template
./conf/spark-env.sh
./conf/fairscheduler.xml.template
./conf/log4j.properties
./conf/slaves
和spark & example:
./assembly/target/scala-2.10/spark-assembly-0.9.1-hadoop2.0.0-cdh4.2.1.jar
./examples/target/scala-2.10/spark-examples-assembly-0.9.1.jar
分发
可以参考下面的脚本分发
#!/bin/sh
#Author: Meng Zang <mzang@ctrip.com>
#Date: 2014-04-11
NODES_ADDRESS=~/spark/nodes
TMP_SPARK=~/spark
SPARK=/usr/local/spark
sync()
{
# sudo mkdir ${SPARK}
# sudo cp $TMP_SPARK/* $SPARK
for node in $(cat $NODES_ADDRESS)
do
ssh -p 1022 $node "mkdir -p $TMP_SPARK"
ssh -p 1022 $node "sudo mkdir -p $SPARK"
rsync -vaz --delete -e 'ssh -p 1022' $TMP_SPARK/ $node:$TMP_SPARK
ssh -p 1022 $node "sudo cp $TMP_SPARK/* $SPARK"
done
}
case "$1" in
'sync')
sync
;;
*)
echo "Usage: $0 {sync}"
exit 1
esac
exit 0
执行example
# step into spark dir
export HADOOP_CONF_DIR=/etc/hadoop/conf
export YARN_CONF_DIR=/etc/hadoop/conf
SPARK_JAR=./assembly/target/scala-2.10/spark-assembly-0.9.1-hadoop2.0.0-cdh4.2.1.jar \
./bin/spark-class org.apache.spark.deploy.yarn.Client \
--jar examples/target/scala-2.10/spark-examples-assembly-0.9.1.jar \
--class org.apache.spark.examples.SparkPi \
--args yarn-standalone \
--num-workers 3 \
--master-memory 4g \
--worker-memory 2g \
--worker-cores 1
执行过程中,可以去MRv2的web console查看。也可以在控制台看到执行的report
14/04/11 16:43:44 INFO Client: Application report from ASM: application identifier: application_1397119957695_0037 appId: 37 clientToken: null appDiagnostics: appMasterHost: SVR2370HP360.hadoop.test.sh.ctriptravel.com appQueue: default appMasterRpcPort: 0 appStartTime: 1397205809221 yarnAppState: RUNNING distributedFinalState: UNDEFINED appTrackingUrl: SVR2368HP360.hadoop.test.sh.ctriptravel.com:8088/proxy/application_1397119957695_0037/ appUser: op1 14/04/11 16:43:45 INFO Client: Application report from ASM: application identifier: application_1397119957695_0037 appId: 37 clientToken: null appDiagnostics: appMasterHost: SVR2370HP360.hadoop.test.sh.ctriptravel.com appQueue: default appMasterRpcPort: 0 appStartTime: 1397205809221 yarnAppState: RUNNING distributedFinalState: UNDEFINED appTrackingUrl: SVR2368HP360.hadoop.test.sh.ctriptravel.com:8088/proxy/application_1397119957695_0037/ appUser: op1 14/04/11 16:43:46 INFO Client: Application report from ASM: application identifier: application_1397119957695_0037 appId: 37 clientToken: null appDiagnostics: appMasterHost: SVR2370HP360.hadoop.test.sh.ctriptravel.com appQueue: default appMasterRpcPort: 0 appStartTime: 1397205809221 yarnAppState: FINISHED distributedFinalState: SUCCEEDED appTrackingUrl: appUser: op1
本文详细介绍如何在YARN环境中部署Spark 0.9.1release版本,包括环境准备、编译步骤、分发脚本及运行示例的过程。
1255

被折叠的 条评论
为什么被折叠?



