Apache Spark

Duration: 3 days

Apache Spark


Apache Spark has become a very popular computational framework for processing big data and streaming information. With its rich set of libraries and highly optimized computational model, companies are now able to process massive amounts of information and assemble insights at record time with the Apache Spark Machine Learning and Graph libraries.

In this course, we’ll bring you up to speed on Apache Spark and its libraries (Such as MLLib and GraphX).


  • Learn how Apache Spark achieves near-linear horizontal scale

  • Learn the fundamental principles of assembling a distributed algorithm

  • Learn about Spark’s RDD’s, DataFrames, and DataSets

  • Learn Spark Streaming (as well as Structured Streaming)

  • Learn how to use the Spark MLLib to build machine learning algorithms

  • Learn how to use GraphX to build graph-algorithms


  • Aspiring Spark Programmers
  • Information Architects
  • Data Analysts
  • Data Engineers
  • Data Scientists


The fundamentals

  • What is Big Data?

  • Why horizontal scaling?

  • The fundamental problems, theories, and solutions in distributed computing

  • What is Spark?

  • Why Spark?

Resilient Distributed Dataset

  • Functional programming in.a nutshell

  • Programming with RDD

  • Building distributed algorithms using RDD

  • How and why it works?

DataFrames and DataSets

  • What are DataFrames?

  • What are DataSets?

  • Spark SQL

  • Building distributed algorithms with DataFrames and DataSets

Streaming in Spark

  • What is streaming?

  • How does Spark solve Streaming?

  • Structured Streaming vs Spark Streaming

  • Streaming from Kafka

  • Other streaming platforms

  • Distributed algorithms and Spark Streaming


  • An introduction to Machine Learning

  • MLLib

  • Machine learning use cases

  • Building machine learning pipelines


  • Graph Theory and Algorithms

  • What is GraphX?

  • Some common graph problems

  • Examples of GraphX solutions