Getting Started with Apache Spark: Highlights from Our Latest Meetup

Apache Spark has become one of the most popular frameworks for large-scale data processing, thanks to its speed, simplicity, and flexibility.

Apache Spark doesn’t have to feel intimidating. This free meetup provided a clear, practical entry point into distributed data processing, focusing on core concepts and real examples in Scala. This meetup was designed to assist with doing the first steps.

The Video

The talk itself was captured on video, and that video was uploaded on YouTube. You can find it at life michael YouTube channel.

The Slides

The slides were uploaded to the Slideshare website, where you can find the slides of every talk I delivered during the past years.

<span data-mce-type="bookmark" style="display: inline-block; width: 0px; overflow: hidden; line-height: 0;" class="mce_SELRES_start"></span>.

TCode Samples

During the meetup, we reviewed several code samples. The slides includes most of these code samples. The first code sample we saw was kind of Hello World. The following is the content of the Program.scala file. It is a simple program that just create a new SparkSession object.

				
					import org.apache.spark.sql.SparkSession

object Program {
  def main(args: Array[String]): Unit = {
    println("Hello! Scala is Working!")
    val spark = SparkSession
      .builder()
      .appName("Hello Spark!")
      .master("local[*]")
      .getOrCreate()
    println("Spark Session was created!")

    println("Go to http://localhost:4040 to see Spark UI")
    println("Press Enter to exit...")
    scala.io.StdIn.readLine() // Waits for user input
  }
}

The second code sample we reviewed was for working directly with the RDD. It shows how to work with the RDD directly.

				
					import org.apache.spark.sql.SparkSession

object WordCountDemo {
  def main(args: Array[String]): Unit = {
    // Create a Spark session
    val spark = SparkSession.builder
      .appName("Word Count Example")
      .master("local[*]") // Use all available cores
      .getOrCreate()

    val lines = sc.parallelize(Seq(
      "hello world",
      "hello spark",
      "hello scala"
    ))


    println(lines)

    // Word count logic
    val words = lines.flatMap(_.split("\\s+"))       
    println(words)
    words.collect().foreach {
      case (word) => println(s"word=$word")
    }

    val wordsCountOfOne = words.map(word => (word, 1))         
    println(wordsCountOfOne)
    wordsCountOfOne.collect().foreach {
      case (word,count) => println(s"word=$word count=$count")
    }

    val wordsCount = wordsCountOfOne.reduceByKey(_ + _)         
    println(wordsCount)
    wordsCount.collect().foreach {
      case (word, count) => println(s"$word: $count")
    }

    //Stop Spark
    //spark.stop()
  }
}

The third code sample shows how to use the DataFrame and the DataSet classes. It uses the employee.csv file you can find below.

				
					name,department,salary
Alice,Engineering,80000
Bob,HR,60000
Charlie,Engineering,90000
Rona,Engineering,95000
David,HR,65000
Tal,HR,25000
Mosh,HR,15000
Eve,Marketing,70000

The code in Scala that reads the data from that the csv file and works on it is the folliowing one.

				
					import org.apache.spark.sql.{SparkSession, functions => F}


object AverageSalaryPerDepartment {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder()
      .appName("Average Salary Per Department")
      .master("local[*]")
      .getOrCreate()

    // Read CSV into DataSet
    val ds = spark.read.
      option("header", "true").
      option("inferSchema", "true").
      csv("employees.csv")

    // Compute average salary per department
    val avgSalaryDF = ds
      .groupBy("department")
      .agg(F.avg("salary").alias("average_salary"))

    // Show the result
    avgSalaryDF.show()

    spark.stop()
  }
}

The last code sample we reviewed in our meetup showed how to use the DataSet. The DataFrame is actually a DataSet of Row objects. This code sample uses the following CSV file.

				
					category,value
A,10
B,20
A,30
C,15
B,5

The following code reads the data from this file, calculates the average for each letteer and writes its output to a new file.

				
					import org.apache.spark.sql
import org.apache.spark.sql.functions._
import org.apache.spark.sql.{Dataset, SparkSession}

object Main {
  case class Record(category: String, value: Double)

  def main(args: Array[String]): Unit = {
    val spark:SparkSession = SparkSession.builder()
      .appName("Functional Spark Project")
      .master("local[*]")
      .getOrCreate()

    import spark.implicits._ // Must come after spark is defined

    // Load and parse data
    val rawDF:sql.DataFrame = spark.read
      .option("header", "true")
      .csv("data/sample.csv")

    val ds: Dataset[Record] = rawDF
      .filter($"value".isNotNull) 
      .map(row => 
        Record(row.getString(0), row.getString(1).toDouble))

    // Compute average value per category
    val result = ds.groupBy("category")
      .agg(avg($"value").as("avg_value"))
      .orderBy($"avg_value".desc)

    result.show()
    result.write.mode("overwrite").csv("data/output")

    spark.stop()
  }
}

We usually have our meetups on the first Tuesday of every month. You can find more information about our meetups and join for free at https://www.meetup.com/lifemichael.