Hadoop
条评论[1] BigData-Notes/Hadoop集群环境搭建.md at master · heibaiying/BigData-Notes (github.com)
安装方法参考官网 Apache Hadoop 3.3.3 – Hadoop: Setting up a Single Node Cluster.
原理
操作步骤如下:
- 客户端发起文件读取的请求。
- NameNode 将文件对应的数据块信息及每个块的位置信息,包括每个块的所有副本的位置信息(即每个副本所在的 DataNode 的地址信息)都传送给客户端。
- 客户端收到数据块信息后,直接和数据块所在的 DataNode 通信,并行地读取数据块。
在客户端获得 NameNode 关于每个数据块的信息后,客户端会根据网络拓扑选择与它最近的 DataNode 来读取每个数据块。当与 DataNode 通信失败时,它会选取另一个较近的 DataNode,同时会对出故障的 DataNode 做标记,避免与它重复通信,并发送 NameNode 故障节点的信息。
Prerequisite
设置环境变量:
JAVA_HOME={path_to_jdk_dir}
HADOOP_HOME={path_to_hadoop_home}
多台设备间能免密ssh
:
- 修改每台服务器的hostname
- :star:在
\etc\hosts
中绑定集群中每个设备的IP地址与域名(使用真实IP表示本机而不用localhost) - 在
~/.ssh
生成公密钥,用ssh-copy-id
命令将公钥添加到其他设备 - 使用
ssh {hostname}
命令验证是否能免密登录其他主机
Configuration
修改hadoop\etc
中的配置文件,如下。
1. core-site.xml
The
fs.defaultFS
makes HDFS a file abstraction over a cluster, so that its root is not the same as the local system’s.
1 | <configuration> |
2. hdfs-site.xml
1 | <configuration> |
3. hadoop-env.sh
1 | export JAVA_HOME=/opt/jdk |
4. yarn-site.xml (可选,如果不使用分布式计算)
1 | <configuration> |
5. mapred-site.xml (可选,如果不使用MapReduce)
1 | <configuration> |
6. wokers
在目录的wokers
中添加所有的datanode节点的主机名
1 | hadoop-master |
修改配置文件后,在namenode节点(即master)使用hdfs namenode -format
格式化namenode的文件系统(不需要在slave节点上执行该命令)
执行sbin/start-all.sh
命令开启服务。
由于是集群,因此任务的执行会在所有的设备上触发。
Typically one machine in the cluster is designated as the NameNode and another machine as the ResourceManager, exclusively. These are the masters. Other services (such as Web App Proxy Server and MapReduce Job History server) are usually run either on dedicated hardware or on shared infrastructure, depending upon the load.
The rest of the machines in the cluster act as both DataNode and NodeManager. These are the workers.
[hadoop slaves_猎人在吃肉的博客-CSDN博客](https://blog.csdn.net/xiaojin21cen/article/details/42421781
slaves 文件 (在新版中是worker文件)
一般在集群中你唯一地选择一台机器作为 NameNode ,一台机器作为 ResourceManager,这是master (主)。
那么,集群中剩下的机器作为DataNode 和 NodeManager。这些是slaves(从)。
在你的hadoop目录/etc/hadoop/slaves文件上列出全部slave机器名或IP地址,一个一行。
Spark
RDD Programming Guide - Spark 3.2.1 Documentation (apache.org)
To run Spark applications in Python without pip installing PySpark, use the bin/spark-submit
script located in the Spark directory. This script will load Spark’s Java/Scala libraries and allow you to submit applications to a cluster
. You can also use bin/pyspark
to launch an interactive Python shell.
使用spark-submit提交应用到集群,连接集群需要初始化
集群连接方法
The first thing a Spark program must do is to create a SparkContext object, which tells Spark how to access a cluster. To create a SparkContext
you first need to build a SparkConf object that contains information about your application.
1 | conf = SparkConf().setAppName(appName).setMaster(master) |
The appName
parameter is a name for your application to show on the cluster UI. master
is a Spark, Mesos or YARN cluster URL, or a special “local” string to run in local mode. In practice, when running on a cluster, you will not want to hardcode master
in the program, but rather launch the application with spark-submit
and receive it there. However, for local testing and unit tests, you can pass “local” to run Spark in-process.
连接集群需要创建SparkContext对象,描述集群的连接方法,使用配置文件传入应用名与master(可理解集群的类型)。
1 | ./bin/spark-submit \ |
Unlike other cluster managers supported by Spark in which the master’s address is specified in the
--master
parameter, in YARN mode the ResourceManager’s address is picked up from the Hadoop configuration. Thus, the--master
parameter isyarn
.
此处的master如果使用Hadoop Yarn可以传入yarn
,而不是spark://...
,具体的脚本参数可查看官网。
PySpark applications start with initializing SparkSession
which is the entry point of PySpark as below. In case of running it in PySpark shell via pyspark executable, the shell automatically creates the session in the variable spark for users.
通过SparkSession开启Spark应用,交互式的bin/pyspark
中已经创建了session。
在集群上运行(Cluster Manager Types)
Running Spark on YARN - Spark 3.2.1 Documentation (apache.org)
The system currently supports several cluster managers:
- Standalone – a simple cluster manager included with Spark that makes it easy to set up a cluster.
- Apache Mesos – a general cluster manager that can also run Hadoop MapReduce and service applications. (Deprecated)
- Hadoop YARN – the resource manager in Hadoop 2.
- Kubernetes – an open-source system for automating deployment, scaling, and management of containerized applications.
Spark支持四种集群模式。除了本地测试,此处选择在Hadoop Yarn上运行Spark
1 | # %% |
- 本文链接:https://blog.charjin.top/2022/07/03/linux/hadoop-spark-tutorial/
- 版权声明:本博客所有文章除特别声明外,均采用 CC BY-NC-SA 3.0 CN 许可协议。转载请注明出处!
分享