这是经典的Spark的WordCount的例子,演示了Spark分布式数据集的诸多方面,以及Map/Reduce的用法。

  • 例一
import org.apache.spark.sql.{DataFrame, SparkSession}

object WordCount {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().master("local[*]").appName("WordCount").getOrCreate()

    // Below textFile is RDD[String], each line in the txt is a row

    val textFile = spark.sparkContext.textFile("file:///tmp/WordCount.txt")

    val words = textFile
      .flatMap(line => line.split(" "))
      .map((_, 1))    // map(word => (word, 1))  : each word to a (word,1) pair

      .reduceByKey(_ + _) // reduced by key, the result will be a list of pair


    words.saveAsTextFile("file:///tmp/saved_word_count")  // RDD will be saved in partitions


    import spark.implicits._
    val df: DataFrame = words.toDF   // with the spark implicits, thd RDD could be easily converted to a DataFrame

    df.show

    spark.stop
  }
}

打开/tmp/saved_word_count文件夹,里面存了2个partitions的数据。

 
user@/tmp/saved_word_count $ ls
part-00000  part-00001  _SUCCESS

user@/tmp/saved_word_count $ cat part-00000
(Africa,3)
(birds,1)
(means,1)
(surrounding,1)
(is,1)
(have,1)
(region,1)
(one,1)
(jealousy,1)
...
  • 例二
object WordCount2 extends App{
  val spark: SparkSession = SparkSession.builder().master("local[*]").appName("QuickStart").getOrCreate()
  import spark.implicits._

  val textFile = spark.read.textFile("file:///tmp/WordCount.txt")

  val wordCounts = textFile.flatMap(line => line.split(" ")).groupByKey(identity).count()
  // an array of the (word, count) pair.  The amazing thing is the .groupByKey and .count combined with the pair 


  wordCounts.toDF.show(200)

  spark.stop
}