Docker+Hadoop+Spark简单环境搭建

Docker

安装

1
2
3
4
5
6
sudo dnf -y install dnf-plugins-core
sudo dnf config-manager --add-repo https://download.docker.com/linux/fedora/docker-ce.repo
sudo dnf install docker-ce docker-ce-cli containerd.io

# 启动docker服务
service docker start

获取镜像

这里使用新国大的Docker镜像,具体配置如下:

  • ubuntu
  • jdk 1.8.0_191 (/usr/java)
  • Hadoop 2.8.5 (/usr/local/hadoop)
  • Spark 2.2.0 (/usr/local/spark)

创建容器

1
2
3
docker run -it -h master --name master nusbigdatacs4225/ubuntu-with-hadoop-spark
docker run -it -h slave01 --name slave01 nusbigdatacs4225/ubuntu-with-hadoop-spark
docker run -it -h slave02 --name slave02 nusbigdatacs4225/ubuntu-with-hadoop-spark

其它操作

  • 退出container: exit
  • 查看containers: sudo docker ps [-a]
  • 重启container: sudo docker container start [name]
  • 进入container命令行: sudo docker attach [name]

Hadoop

查看容器IP

使用ifconfig查看三个容器各自的ip,假设得到的ip如下:

  • master: 172.17.0.2
  • slave01:172.17.0.3
  • slave02:172.17.0.4

配置容器IP

将对应ip填入/etc/hosts配置中:

1
2
3
4
5
vi /etc/hosts
# 在文件最后追加
172.17.0.2 master
172.17.0.3 slave01
172.17.0.4 slave02

vi /usr/local/hadoop/etc/hadoop/slaves,增加slave01 slave02

运行HDFS

初始化hdfs并且运行:

1
2
3
4
5
6
7
8
9
cd /usr/local/hadoop
bin/hdfs namenode-format
sbin/start-all.sh
# hdfs命令
/usr/local/hadoop/bin/hdfs dfs -[命令] [参数]
# 如需停止则使用
sbin/stop-all.sh
# 如需查看运行情况则使用
jps

Spark

配置hadoop与Java路径:

1
2
3
4
5
6
7
vi /usr/local/spark/conf/spark-env.sh
# >>> 在文件后追加 >>>
export JAVA_HOME=/usr/java/jdk1.8.0_191
export HADOOP_HOME=/usr/local/hadoop-2.8.5
export HADOOP_CONF_DIR=/usr/local/hadoop-2.8.5/etc/hadoop
export SPARK_MASTER_IP=172.17.0.2
export SPARK_DIST_CLASSPATH=$(/usr/local/hadoop/bin/hadoop classpath)

vi /usr/local/spark/conf/slaves, 配置从节点localhost slave01 slave02

WordCount

MapReduce运行

  1. 创建用户目录:/usr/local/hadoop/bin/hdfs dfs -mkdir input /user/
  2. 上传input(自行在其中增加需要进行wordcount的文件):/usr/local/hadoop/bin/hdfs dfs -put input /user/
  3. 运行wordcount示例:/usr/local/hadoop/bin/hadoop jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.5.jar wordcount /user/input /user/output

Spark运行

  1. 要求hdfs已启动,并将input文件上传

  2. 在master中安装pyspark:

1
2
3
apt-get update
apt-get install python3-pip
pip3 install pyspark
  1. 创建一个python3的软链接pytho:
1
2
cd /usr/bin
ln -s python3 python
  1. python实现wordcount
wordcount.py
1
2
3
4
5
6
7
8
9
10
11
12
from pyspark import SparkContext
from time import time

start = time()
sc = SparkContext('local', 'wordcount')
text_file = sc.textFile("hdfs://master:9000/user/input")
counts = text_file.flatMap(lambda line: line.split(" ")) \
.map(lambda word: (word, 1)) \
.reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs://master:9000/user/output")
elapsed = (time() - start)
print("Time used:", int(elapsed * 1000))
  1. 运行: python wordcount.py