本文介绍如何创建 Spark Scala 项目, 并 Spark Client 在 Spark 集群上跑大数据任务.
先决条件
- 安装 java 8, maven 3.5.4
- 拷贝公司的
settings.xml
放到~/.m2
目录下 (没有就创建) - 安装 intellij idea
- 安装 scala sdk 2.11.8, 详细安装教程参见这里
创建 Java Maven 项目
通过 intellij 创建一个 java maven 项目, 建议再创建一个 module, 在子模块中写具体的代码, 而不是在父项目中写, 父项目依赖 spring-boot.
<parent>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-parent</artifactId>
<version>2.1.6.RELEASE</version>
</parent>
spark 模块, 需要依赖 maven-scala-plugin
插件
<properties>
<spark.edition>2.11</spark.edition>
<spark.version>2.1.1</spark.version>
</properties>
<build>
<plugins>
<plugin>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<version>2.15.2</version>
<executions>
<execution>
<id>scala-compile-first</id>
<phase>process-resources</phase>
<goals>
<goal>add-source</goal>
<goal>compile</goal>
</goals>
</execution>
<execution>
<id>scala-test-compile</id>
<phase>process-test-resources</phase>
<goals>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<finalName>${project.artifactId}-${project.version}</finalName>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_${spark.edition}</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_${spark.edition}</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
</dependencies>
创建 HelloWorld.scala
, 代码如下
object HelloWorld {
def main(args: Array[String]): Unit = {
println("Hello world!")
}
}
打包执行
mvn -DskipTests clean package
然后将 xxx-jar-with-dependencies.jar
jar 包上传到 ip-10-54-241-217
(可能需要申请权限)
sudo su - hadoop
执行 jar
spark-submit \
--queue offline \
--num-executors 800 \
--executor-cores 4 \
--executor-memory 12G \
--driver-memory 4G \
--conf spark.executor.memoryOverhead=2G \
--conf spark.driver.maxResultSize=10G \
--conf spark.sql.shuffle.partitions=1000 \
--conf spark.default.parallelism=1000 \
--conf spark.speculation=true \
--conf spark.speculation.interval=10000ms \
--conf spark.speculation.multiplier=4 \
--conf spark.speculation.quantile=0.75 \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.kryoserializer.buffer.max=1024 \
--conf spark.driver.extraJavaOptions=-XX:+UseG1GC \
--conf spark.executor.extraJavaOptions=-XX:+UseG1GC \
--conf spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2 \
--class $main_class_path \
xxx-jar-with-dependencies.jar