大家好,又见面了,我是你们的朋友全栈君。
原文地址: http://blog.csdn.net/nsrainbow/article/details/43735737 最新课程请关注原作者博客,获得更好的显示体验
声明
- 本文基于Centos6.x + CDH 5.x
Spark是什么
安装Spark
- spark-core: spark核心包
- spark-worker: spark-worker用的脚本
- spark-master: spark-master用的脚本
- spark-python: Spark的Python客户端
- spark-history-server: 任务历史服务
安装组件包
sudo yum install spark-core spark-master spark-worker spark-python
host2 作为 history-server 和 worker
sudo yum install spark-core spark-worker spark-history-server spark-python
配置Spark
- 独立模式: 在独立模式, Spark使用一个 Master 服务来运行任务。
- YARN模式: 在YARN模式, YARN ResourceManager 代替了Spark Master。Job还是由NodeManager运行。YARN 模式搭建会比较复杂,但是它支持安全机制,并且跟YARN集群的配合更好。
本教程中使用独立模式
编辑每一台安装了Spark机器上的 /etc/spark/conf/spark-env.sh 修改master所在机器的机器名,在这个教程中就是host1
###### === IMPORTANT ===### Change the following to specify a real cluster‘s Master host###export STANDALONE_SPARK_MASTER_HOST=‘host1‘
注意: 包裹host1的符号也要换成单引号
$ sudo -u hdfs hadoop fs -mkdir /user/spark $ sudo -u hdfs hadoop fs -mkdir /user/spark/applicationHistory $ sudo -u hdfs hadoop fs -chown -R spark:spark /user/spark$ sudo -u hdfs hadoop fs -chmod 1777 /user/spark/applicationHistory
在Spark客户端,在本例中就是host2,创建一份新的配置文件
cp /etc/spark/conf/spark-defaults.conf.template /etc/spark/conf/spark-defaults.conf
把下面这两行增加到/etc/spark/conf/spark-defaults.conf 里面去
spark.eventLog.dir=/user/spark/applicationHistory
spark.eventLog.enabled=true
在所有的机器上复制hdfs-site.xml到 /etc/spark/conf 下
cp /etc/hadoop/conf/hdfs-site.xml /etc/spark/conf/
启动Spark
sudo service spark-master start
在其他节点上启动woker服务,本教程中就是 host1 和 host2
sudo service spark-worker start
sudo service spark-history-server start
启动顺序
- master
- worker
- history-server
使用Spark
使用 spark-shell 命令进入spark shell
[root@host1 impala]# spark-shell2015-02-10 09:02:07,059 INFO [main] spark.SecurityManager (Logging.scala:logInfo(59)) - Changing view acls to: root2015-02-10 09:02:07,069 INFO [main] spark.SecurityManager (Logging.scala:logInfo(59)) - Changing modify acls to: root2015-02-10 09:02:07,070 INFO [main] spark.SecurityManager (Logging.scala:logInfo(59)) - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)2015-02-10 09:02:07,072 INFO [main] spark.HttpServer (Logging.scala:logInfo(59)) - Starting HTTP Server2015-02-10 09:02:07,217 INFO [main] server.Server (Server.java:doStart(272)) - jetty-8.y.z-SNAPSHOT2015-02-10 09:02:07,350 INFO [main] server.AbstractConnector (AbstractConnector.java:doStart(338)) - Started SocketConnector@0.0.0.0:590582015-02-10 09:02:07,352 INFO [main] util.Utils (Logging.scala:logInfo(59)) - Successfully started service ‘HTTP class server‘ on port 59058.Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ ‘_/ /___/ .__/\_,_/_/ /_/\_\ version 1.2.0 /_/Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_25)...2015-02-10 09:02:21,572 INFO [main] storage.BlockManagerMaster (Logging.scala:logInfo(59)) - Registered BlockManager2015-02-10 09:02:22,472 INFO [main] scheduler.EventLoggingListener (Logging.scala:logInfo(59)) - Logging events to file:/user/spark/applicationHistory/local-14235301409862015-02-10 09:02:22,672 INFO [main] repl.SparkILoop (Logging.scala:logInfo(59)) - Created spark context..Spark context available as sc.scala>
我们来开始玩一下Spark。还是做之前用YARN做的wordcount任务,看看Spark如何完成这项任务。
STEP1
$ echo "Hello World Bye World" > file0
$ echo "Hello Hadoop Goodbye Hadoop" > file1
$ hdfs dfs -mkdir -p /user/spark/wordcount/input
$ hdfs dfs -put file* /user/spark/wordcount/input
STEP2
val file = sc.textFile("hdfs://mycluster/user/spark/wordcount/input")
val counts = file.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://mycluster/user/spark/wordcount/output")
这回不用写java代码了,简单了好多。这里用的是Scala语言。
Spark支持 Java, Scale, Python 三种语言,但是对Scala的支持最全,建议开始用java来写,后期还是熟悉下Scala比较好。
STEP3
grunt> ls
hdfs://mycluster/user/spark/wordcount/input <dir>
hdfs://mycluster/user/spark/wordcount/output <dir>
grunt> cd output
grunt> ls
hdfs://mycluster/user/spark/wordcount/output/_SUCCESS<r 2> 0
hdfs://mycluster/user/spark/wordcount/output/part-00000<r 2> 8
hdfs://mycluster/user/spark/wordcount/output/part-00001<r 2> 10
hdfs://mycluster/user/spark/wordcount/output/part-00002<r 2> 33
grunt> cat part-00000
(Bye,1)
grunt> cat part-00001
(World,2)
grunt> cat part-00002
(Goodbye,1)
(Hello,2)
(Hadoop,2)
更深入的学习请看手册
Spark Programming Guide , 另外这个手册写的真不错。
参考资料
发布者:全栈程序员-用户IM,转载请注明出处:https://javaforall.cn/143745.html原文链接:https://javaforall.cn
【正版授权,激活自己账号】: Jetbrains全家桶Ide使用,1年售后保障,每天仅需1毛
【官方授权 正版激活】: 官方授权 正版激活 支持Jetbrains家族下所有IDE 使用个人JB账号...