Spark经典的WordCount例子
这是经典的Spark的WordCount的例子,演示了Spark分布式数据集的诸多方面,以及Map/Reduce的用法。
- 例一
import org.apache.spark.sql.{DataFrame, SparkSession}
object WordCount {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().master("local[*]").appName("WordCount").getOrCreate()
// Below textFile is RDD[String], each line in the txt is a row
val textFile = spark.sparkContext.textFile("file:///tmp/WordCount.txt")
val words = textFile
.flatMap(line => line.split(" "))
.map((_, 1)) // map(word => (word, 1)) : each word to a (word,1) pair
.reduceByKey(_ + _) // reduced by key, the result will be a list of pair
words.saveAsTextFile("file:///tmp/saved_word_count") // RDD will be saved in partitions
import spark.implicits._
val df: DataFrame = words.toDF // with the spark implicits, thd RDD could be easily converted to a DataFrame
df.show
spark.stop
}
}
打开/tmp/saved_word_count文件夹,里面存了2个partitions的数据。
user@/tmp/saved_word_count $ ls
part-00000 part-00001 _SUCCESS
user@/tmp/saved_word_count $ cat part-00000
(Africa,3)
(birds,1)
(means,1)
(surrounding,1)
(is,1)
(have,1)
(region,1)
(one,1)
(jealousy,1)
...
- 例二
object WordCount2 extends App{
val spark: SparkSession = SparkSession.builder().master("local[*]").appName("QuickStart").getOrCreate()
import spark.implicits._
val textFile = spark.read.textFile("file:///tmp/WordCount.txt")
val wordCounts = textFile.flatMap(line => line.split(" ")).groupByKey(identity).count()
// an array of the (word, count) pair. The amazing thing is the .groupByKey and .count combined with the pair
wordCounts.toDF.show(200)
spark.stop
}