第一章,安装spark集群,测试和运行wordcount 案例

第一章,安装spark集群,测试和运行wordcount 案例第一章,安装spark集群,测试和运行wordcount 案例

大家好,又见面了,我是你们的朋友全栈君。

1.1下载spark

地址:https://mirrors.tuna.tsinghua.edu.cn/apache/spark/spark-2.3.0/

第一章,安装spark集群,测试和运行wordcount 案例

1.2上传解压;

xiaoye@ubuntu:~/Downloads$ ls
apache-activemq-5.15.3-bin.tar.gz  hive-0.13.1-cdh5.2.0.tar.gz
flume-ng-1.6.0-cdh5.10.1.tar.gz    spark-2.3.0-bin-hadoop2.7.tgz
hadoop-2.5.0-cdh5.2.0.tar.gz       zookeeper-3.4.5-cdh5.10.2.tar.gz

xiaoye@ubuntu:~/Downloads$ tar -zxvf spark-2.3.0-bin-hadoop2.7.tgz -C ../

1.3为安装包建立一个软连接:

xiaoye@ubuntu:~$ ln -s spark-2.3.0-bin-hadoop2.7/ spark

1.4进入spark/conf修改配置文件

xiaoye@ubuntu:~$ cd spark
xiaoye@ubuntu:~/spark$ cd conf
xiaoye@ubuntu:~/spark/conf$ ls
docker.properties.template  metrics.properties.template   spark-env.sh.template
fairscheduler.xml.template  slaves.template
log4j.properties.template   spark-defaults.conf.template

xiaoye@ubuntu:~/spark/conf$ 

复制spark-env.sh.template并重命名为spark-env.sh,并在文件最后添加配置内容

export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk-amd64
export HADOOP_HOME=/home/xiaoye/hadoop
export HADOOP_CONF_DIR=/home/xiaoye/hadoop/etc/hadoop
export SPARK_WORKER_MEMORY=500m
export SPARK_WORKER_CORES=1
export SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=ubuntu:2181,ubuntu2:2181,ubuntu3:2181 -Dspark.deploy.zookeeper.dir=/spark"

解释:

集群搭建时配置的spark参数可能和现在的不一样,主要是考虑个人电脑配置问题,如果memory配置太大,机器运行很慢。 
说明: 
-Dspark.deploy.recoveryMode=ZOOKEEPER    #说明整个集群状态是通过zookeeper来维护的,整个集群状态的恢复也是通过zookeeper来维护的。就是说用zookeeper做了spark的HA配置,Master(Active)挂掉的话,Master(standby)要想变成Master(Active)的话,Master(Standby)就要像zookeeper读取整个集群状态信息,然后进行恢复所有Worker和Driver的状态信息,和所有的Application状态信息; 
-Dspark.deploy.zookeeper.url=

buntu:2181,ubuntu2:2181,ubuntu3:2181 

#将所有配置了zookeeper,并且在这台机器上有可能做master(Active)的机器都配置进来;(我用了3台,就配置了3台) 

-Dspark.deploy.zookeeper.dir=/spark 
这里的dir和zookeeper配置文件zoo.cfg中的dataDir的区别??? 
-Dspark.deploy.zookeeper.dir是保存spark的元数据,保存了spark的作业运行状态; 
zookeeper会保存spark集群的所有的状态信息,包括所有的Workers信息,所有的Applactions信息,所有的Driver信息,如果集群 

复制slaves.template成slaves:

xiaoye@ubuntu:~/spark/conf$ cp slaves.template  slaves

修改slaves:

xiaoye@ubuntu:~/spark/conf$ vi slaves

ubuntu
ubuntu2
ubuntu3

1.5 一台机子配置完后分发给其他节点

xiaoye@ubuntu:~$ scp -r spark-2.3.0-bin-hadoop2.7/ xiaoye@192.168.26.141:/home/xiaoye

xiaoye@ubuntu:~$ scp -r spark-2.3.0-bin-hadoop2.7/ xiaoye@192.168.26.142:/home/xiaoye

1.6为每台机子创建软连接

xiaoye@ubuntu2:~$ ln -s spark-2.3.0-bin-hadoop2.7/ spark

xiaoye@ubuntu3:~$ ln -s spark-2.3.0-bin-hadoop2.7/ spark

1.7为每台机子配置环境变量

xiaoye@ubuntu:~$ vim .bashrc

#Spark
export SPARK_HOME=/home/xiaoye/spark
export PATH=$PATH:$SPARK_HOME/bin

配置好后立即生效:

xiaoye@ubuntu:~$ source .bashrc

2试着启动spark集群

2.1启动zookeeper集群:

每台机子执行如下操作:

xiaoye@ubuntu:~$ ./zookeeper/sbin/zkServer.sh  start
JMX enabled by default
Using config: /home/xiaoye/zookeeper/sbin/../conf/zoo.cfg
Starting zookeeper … STARTED
xiaoye@ubuntu:~$ ./zookeeper/sbin/zkServer.sh  status
JMX enabled by default
Using config: /home/xiaoye/zookeeper/sbin/../conf/zoo.cfg

Mode: leader

2.2启动hdfs:

在一个节点执行以下命令即可:

xiaoye@ubuntu:~$ ./hadoop/sbin/start-dfs.sh 
18/06/09 21:36:31 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Starting namenodes on [ubuntu ubuntu2]
ubuntu: Warning: Permanently added 'ubuntu,192.168.26.140' (ECDSA) to the list of known hosts.
ubuntu2: Warning: Permanently added 'ubuntu2,192.168.26.141' (ECDSA) to the list of known hosts.
ubuntu2: starting namenode, logging to /home/xiaoye/hadoop/logs/hadoop-xiaoye-namenode-ubuntu2.out
ubuntu: starting namenode, logging to /home/xiaoye/hadoop/logs/hadoop-xiaoye-namenode-ubuntu.out
ubuntu2: Warning: Permanently added 'ubuntu2,192.168.26.141' (ECDSA) to the list of known hosts.
ubuntu: Warning: Permanently added 'ubuntu,192.168.26.140' (ECDSA) to the list of known hosts.
ubuntu3: Warning: Permanently added 'ubuntu3,192.168.26.142' (ECDSA) to the list of known hosts.
ubuntu: starting datanode, logging to /home/xiaoye/hadoop/logs/hadoop-xiaoye-datanode-ubuntu.out
ubuntu2: starting datanode, logging to /home/xiaoye/hadoop/logs/hadoop-xiaoye-datanode-ubuntu2.out
ubuntu3: starting datanode, logging to /home/xiaoye/hadoop/logs/hadoop-xiaoye-datanode-ubuntu3.out
Starting journal nodes [ubuntu ubuntu2 ubuntu3]
ubuntu3: Warning: Permanently added 'ubuntu3,192.168.26.142' (ECDSA) to the list of known hosts.
ubuntu: Warning: Permanently added 'ubuntu,192.168.26.140' (ECDSA) to the list of known hosts.
ubuntu2: Warning: Permanently added 'ubuntu2,192.168.26.141' (ECDSA) to the list of known hosts.
ubuntu: starting journalnode, logging to /home/xiaoye/hadoop/logs/hadoop-xiaoye-journalnode-ubuntu.out
ubuntu3: starting journalnode, logging to /home/xiaoye/hadoop/logs/hadoop-xiaoye-journalnode-ubuntu3.out
ubuntu2: starting journalnode, logging to /home/xiaoye/hadoop/logs/hadoop-xiaoye-journalnode-ubuntu2.out
18/06/09 21:36:58 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Starting ZK Failover Controllers on NN hosts [ubuntu ubuntu2]
ubuntu2: Warning: Permanently added 'ubuntu2,192.168.26.141' (ECDSA) to the list of known hosts.
ubuntu: Warning: Permanently added 'ubuntu,192.168.26.140' (ECDSA) to the list of known hosts.
ubuntu2: starting zkfc, logging to /home/xiaoye/hadoop/logs/hadoop-xiaoye-zkfc-ubuntu2.out
ubuntu: starting zkfc, logging to /home/xiaoye/hadoop/logs/hadoop-xiaoye-zkfc-ubuntu.out
xiaoye@ubuntu:~$ jps
\2746 NameNode
3252 DFSZKFailoverController
2573 QuorumPeerMain
2861 DataNode
3324 Jps
3060 JournalNode

2.3启动spark

在三个节点分别执行以下命令:

xiaoye@ubuntu:~$ ./spark/sbin/start-all.sh 
org.apache.spark.deploy.master.Master running as process 3359.  Stop it first.
localhost: Warning: Permanently added 'localhost' (ECDSA) to the list of known hosts.
localhost: starting org.apache.spark.deploy.worker.Worker, logging to /home/xiaoye/spark/logs/spark-xiaoye-org.apache.spark.deploy.worker.Worker-1-ubuntu.out

好了之后jps分别查看启动的进程:

leader节点:

xiaoye@ubuntu:~$ jps
2746 NameNode
3359 Master
3252 DFSZKFailoverController
3550 Worker
2573 QuorumPeerMain
2861 DataNode
3583 Jps
3060 JournalNode

其他两个从节点:

xiaoye@ubuntu2:~$ jps
2488 JournalNode
2748 Worker
2784 Jps
2654 Master
2229 QuorumPeerMain
2377 DataNode

xiaoye@ubuntu3:~$ jps
2510 Master
2582 Worker
2298 DataNode
2402 JournalNode
2665 Jps
2198 QuorumPeerMain

3,验证是否成功启动

在web上查看:

第一章,安装spark集群,测试和运行wordcount 案例

第一章,安装spark集群,测试和运行wordcount 案例

第一章,安装spark集群,测试和运行wordcount 案例

4,验证HA高可用

动干掉ubuntu上面的Master进程,观察是否会自动进行切换

xiaoye@ubuntu:~$ jps
3656 Jps
2746 NameNode
3359 Master
3252 DFSZKFailoverController
2573 QuorumPeerMain
2861 DataNode
3060 JournalNode
xiaoye@ubuntu:~$ kill -9 3359
xiaoye@ubuntu:~$ jps
2746 NameNode
3252 DFSZKFailoverController
2573 QuorumPeerMain
2861 DataNode
3666 Jps
3060 JournalNode

查看web:

第一章,安装spark集群,测试和运行wordcount 案例

再看看其他两个节点有没有成功上位成live的:

第一章,安装spark集群,测试和运行wordcount 案例

ubuntu3自动成为live节点说明成功。

5,执行spark程序 on standalone

5.1执行第一个spark程序,执行的是spark自带的案例

在master下执行:

xiaoye@ubuntu3:~$ /home/xiaoye/spark/bin/spark-submit --class org.apache.spark.examples.SparkPi --master spark://ubuntu3:7077 --executor-memory 500m --total-executor-cores 1 /home/xiaoye/spark/examples/jars/spark-examples_2.11-2.3.0.jar  100

第一章,安装spark集群,测试和运行wordcount 案例

其中–master spark://ubuntu3:7077是下图的位置的结果:

第一章,安装spark集群,测试和运行wordcount 案例

运行结果:

第一章,安装spark集群,测试和运行wordcount 案例

至于含义我们以后再讨论

5.2 启动spark shell

xiaoye@ubuntu3:~$ /home/xiaoye/spark/bin/spark-shell \
> --master spark://ubuntu3:7077 \
> --executor-memory 500m
> --total-executor-cores 1 

参数说明:

–master spark://ubuntu3:7077 指定Master的地址

–executor-memory 500m:指定每个worker可用内存为500m

–total-executor-cores 1: 指定整个集群使用的cup核数为1个

 xiaoye@ubuntu3:~$ 
xiaoye@ubuntu3:~$ /home/xiaoye/spark/bin/spark-shell \
> --master spark://ubuntu3:7077 \
> --executor-memory 500m
--^H^H2018-06-09 23:21:53 WARN  Utils:66 - Your hostname, ubuntu3 resolves to a loopback address: 127.0.1.1; using 192.168.26.142 instead (on interface ens33)
2018-06-09 23:21:53 WARN  Utils:66 - Set SPARK_LOCAL_IP if you need to bind to another address
2018-06-09 23:21:55 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context Web UI available at http://ubuntu3:4040
Spark context available as 'sc' (master = spark://ubuntu3:7077, app id = app-20180609232223-0001).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 2.3.0
      /_/
         
Using Scala version 2.11.8 (OpenJDK 64-Bit Server VM, Java 1.8.0_162)
Type in expressions to have them evaluated.
Type :help for more information.

scala> --

scala> 

注意:

如果启动spark shell时没有指定master地址,但是也可以正常启动spark shell和执行spark shell中的程序,其实是启动了spark的local模式,该模式仅在本机启动一个进程,没有与集群建立联系。

Spark Shell中已经默认将SparkContext类初始化为对象sc。用户代码如果需要用到,则直接应用sc即可

Spark Shell中已经默认将SparkSQl类初始化为对象spark。用户代码如果需要用到,则直接应用spark即可

5.3在spark shell中编写word count 程序

创建一个文件,随便写点内容:

xiaoye@ubuntu3:~$ cat classes/aa.txt 
hello world!
aa aa
d
d 
dg
g

然后上传到hdfs中:

xiaoye@ubuntu3:~/hadoop$ ./hadoop/ fs -mkdir -p /spark
-bash: ./hadoop/: Is a directory
xiaoye@ubuntu3:~/hadoop$ hadoop fs -mkdir -p /spark
18/06/09 23:32:33 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
xiaoye@ubuntu3:~/hadoop$ hadoop fs -ls /spark
18/06/09 23:33:12 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
xiaoye@ubuntu3:~/hadoop$ hadoop fs -put classes/aa.txt /spark
18/06/09 23:34:04 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
put: `classes/aa.txt': No such file or directory
xiaoye@ubuntu3:~/hadoop$ ls
bin             etc                  hadoop   lib      sbin   tmp
bin-mapreduce1  examples             include  libexec  share
cloudera        examples-mapreduce1  journal  logs     src
xiaoye@ubuntu3:~/hadoop$ cd ..
xiaoye@ubuntu3:~$ hadoop fs -put classes/aa.txt /spark
18/06/09 23:34:33 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
xiaoye@ubuntu3:~$ hadoop fs -ls /spark
18/06/09 23:34:48 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 1 items
-rw-r--r--   2 xiaoye supergroup         29 2018-06-09 23:34 /spark/aa.txt

5.4在spark shell中用scala编写spark程序,按空格分割数据

scala> sc.textFile("/spark/aa.txt").flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).saveAsTextFile("/spark/out")

说明:

sc是SparkContext对象,该对象是提交spark程序的入口

textFile(“/spark/aa.txt”)是hdfs中读取数据

flatMap(_.split(” “))先map再压平

map((_,1))将单词和1构成元组

reduceByKey(_+_)按照key进行reduce,并将value累加

saveAsTextFile(“/spark/out”)将结果写入到hdfs中

执行完后,查看hdfs的执行结果:

xiaoye@ubuntu3:~$ hadoop fs -cat /spark/out/p*
18/06/09 23:52:42 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
(d,2)
(aa,2)
(hello,1)
(world!,1)
(dg,1)
(g,1)

执行结果跟我们相要的结果是一样的。

第一章,安装spark集群,测试和运行wordcount 案例

6,执行spark程序 on Yarn

首先要成功启动zookeeper集群、HDFS集群、YARN集群

启动spark on Yarn

xiaoye@ubuntu3:~$

 spark-shell --master yarn --deploy-mode client

启动时间比较长,成功启动:

第一章,安装spark集群,测试和运行wordcount 案例

启动的进程如下:

xiaoye@ubuntu3:~$ jps
2298 DataNode
5150 ExecutorLauncher
3950 ResourceManager
2402 JournalNode
5375 Jps
5275 CoarseGrainedExecutorBackend
5274 CoarseGrainedExecutorBackend
4092 NodeManager
2198 QuorumPeerMain
4424 SparkSubmit

版权声明:本文内容由互联网用户自发贡献,该文观点仅代表作者本人。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如发现本站有涉嫌侵权/违法违规的内容, 请发送邮件至 举报,一经查实,本站将立刻删除。

发布者:全栈程序员-用户IM,转载请注明出处:https://javaforall.cn/106103.html原文链接:https://javaforall.cn

【正版授权,激活自己账号】: Jetbrains全家桶Ide使用,1年售后保障,每天仅需1毛

【官方授权 正版激活】: 官方授权 正版激活 支持Jetbrains家族下所有IDE 使用个人JB账号...

(0)


相关推荐

  • 欢迎使用CSDN-markdown编辑器[通俗易懂]

    欢迎使用CSDN-markdown编辑器[通俗易懂]引用块内容欢迎使用Markdown编辑器写博客本Markdown编辑器使用StackEdit修改而来,用它写博客,将会带来全新的体验哦:Markdown和扩展Markdown简洁的语法代码块高亮图片链接和图片上传LaTex数学公式UML序列图和流程图离线写博客导入导出Markdown文件丰富的快捷键快捷键加粗Ctrl+B斜体Ctrl+I引用

  • jquery手机号码正则验证

    jquery手机号码正则验证<!DOCTYPEhtml><htmllang="en"><head><metacharset="UTF-8"><title>Title</title><scripttype="text/javascript"src=&qu

  • PPT 中插入域代码公式的方法

    PPT 中插入域代码公式的方法PPT中插入域代码公式的方法插入对象,选择Word*Document,或OpenDocument都可以; 在新打开的页面中,选择插入文档部件,再选择域代码; 在域代码选项中,选择Eq,具体语法如下。域代码:Eq(公式)域注意:我们希望能够尽快以你的语言为你提供最新的帮助内容。本页面是自动翻译的,可能包含语法错误或不准确之处。我们的目的是使此内容能对你有所帮助。可以在本页面底部告诉我们此信息是否对你有帮助吗?请在此处查看本文的英文版本以…

  • Navicat Premium 15 激活(已测有效)

    Navicat Premium 15 激活(已测有效),https://javaforall.cn/100143.html。详细ieda激活码不妨到全栈程序员必看教程网一起来了解一下吧!

  • pycharm2022 linux版激活码_最新在线免费激活

    (pycharm2022 linux版激活码)好多小伙伴总是说激活码老是失效,太麻烦,关注/收藏全栈君太难教程,2021永久激活的方法等着你。IntelliJ2021最新激活注册码,破解教程可免费永久激活,亲测有效,下面是详细链接哦~https://javaforall.cn/100143.html1M2OME2TZY-eyJsaWNlbnNlSW…

  • OpenGL3D迷宫场景设计「建议收藏」

    OpenGL3D迷宫场景设计「建议收藏」OpenGL实现的3D迷宫场景,五角星粒子系统和雨雪粒子系统

发表回复

您的电子邮箱地址不会被公开。

关注全栈程序员社区公众号