本文介绍Intellij IDEA编写Spark应用程序,读取hdfs上的文件,进行文件的词频统计,项目使用maven构建。
一、安装Scala插件 依次选择File->Settings->Plugins,在Marketplace中搜索scala进行安装,安装后根据提示重启IDEA。 二、创建maven项目 1.File->New->Project,选择Maven,点next
2.输入项目的名字,设置想要的GroupId,当然也可以不设置,然后Finish
三、添加pom依赖 1.在服务器spark安装目录下输入./bin/spark-shell,查看spark和Scala的版本
2.pom.xml文件添加代码,把spark.version和scala.version改成自己的版本
<properties>
<spark.version>2.1.0</spark.version>
<scala.version>2.11</scala.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_${scala.version}</artifactId>
<version>${spark.version}</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<version>2.15.2</version>
<executions>
<execution>
</execution>
</executions>
</plugin>
</plugins>
</build>
修改后的pom文件如下图,然后点击图中红框的Reload Maven按钮 四、编写程序 1.依次选择File->Project Structure->Global Libaries,添加Scala-sdk
2.右键项目,在弹出的菜单中,选择Add Framework Surport ,在左侧有一排可勾选项,找到scala,勾选即可。
3.将项目中的java文件夹rename为scala,删除resource文件夹,创建package:com.spark.wordcount,然后创建Scala class,选择object类型
创建好后的项目结构如下:
4.在WordCount.scala文件中编写代码
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object WordCount {
def main(args: Array[String]) {
val inputFile = "hdfs://39.106.229.57:9000/home/spark/users.txt"
val conf = new SparkConf().setAppName("WordCount").setMaster("local")
val sc = new SparkContext(conf)
val textFile = sc.textFile(inputFile)
val wordCount = textFile.flatMap(line => line.split(" ")).map(word => (word, 1)).reduceByKey((a, b) => a + b)
wordCount.foreach(println)
}
}
五、打包到服务器运行 1.先右键运行一下WordCount.scala,运行成功后出现红色的info信息,等待几秒后即可关闭。不进行此步骤的话在服务器运行会出错。
2.点击右边的maven面板,双击lifecycle下的package即可打包 3.打包完成后的jar包在spark-test1 arget目录下,将其上传到服务器,输入下面命令运行jar包
/usr/hdp/spark/bin/spark-submit --class com.spark.wordcount.WordCount /root/codeTest/spark-test1-1.0-SNAPSHOT.jar 代码解析:
/usr/hdp/spark/bin/spark-submit:使用安装spark的目录中的spark-submit
–class com.spark.wordcount.WordCount:class后面的是WordCount的引用,获得方法如下:
/root/codeTest/spark-test1-1.0-SNAPSHOT.jar:服务器中jar包存放的位置
如果不执行步骤一 ,运行jar包会出现以下错误:
java.lang.ClassNotFoundException: com.spark.wordcount.WordCount
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:229)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:700)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:187)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
执行了步骤一即可运行成功: 输出的信息有点多,往下拉即可看到结果:
版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。
本文链接:https://blog.csdn.net/Wing_kin666/article/details/111246201