Apache Spark
Duration: 3 days

Description:
Apache Spark has become a very popular computational framework for processing big data and streaming information. With its rich set of libraries and highly optimized computational model, companies are now able to process massive amounts of information and assemble insights at record time with the Apache Spark Machine Learning and Graph libraries.
In this course, we’ll bring you up to speed on Apache Spark and its libraries (Such as MLLib and GraphX).
Objectives:
-
Learn how Apache Spark achieves near-linear horizontal scale
-
Learn the fundamental principles of assembling a distributed algorithm
-
Learn about Spark’s RDD’s, DataFrames, and DataSets
-
Learn Spark Streaming (as well as Structured Streaming)
-
Learn how to use the Spark MLLib to build machine learning algorithms
-
Learn how to use GraphX to build graph-algorithms
Audience
- Aspiring Spark Programmers
- Information Architects
- Data Analysts
- Data Engineers
- Data Scientists
Outline
The fundamentals
-
What is Big Data?
-
Why horizontal scaling?
-
The fundamental problems, theories, and solutions in distributed computing
-
What is Spark?
-
Why Spark?
Resilient Distributed Dataset
-
Functional programming in.a nutshell
-
Programming with RDD
-
Building distributed algorithms using RDD
-
How and why it works?
DataFrames and DataSets
-
What are DataFrames?
-
What are DataSets?
-
Spark SQL
-
Building distributed algorithms with DataFrames and DataSets
Streaming in Spark
-
What is streaming?
-
How does Spark solve Streaming?
-
Structured Streaming vs Spark Streaming
-
Streaming from Kafka
-
Other streaming platforms
-
Distributed algorithms and Spark Streaming
MLLib
-
An introduction to Machine Learning
-
MLLib
-
Machine learning use cases
-
Building machine learning pipelines
GraphX
-
Graph Theory and Algorithms
-
What is GraphX?
-
Some common graph problems
-
Examples of GraphX solutions