Apache Spark with Scala for High-Performance Big Data Processing
In order to embark on a successful career in data science, understanding Apache Spark is essential. This course helps solidify that understanding.
Enroll NowBig Data eLearning Apache Spark with Scala: Become a Spark Expert in 21 days. The objective of this course is to provide a comprehensive understanding and training on Apache Spark. If you are an absolute beginner, this Apache Spark tutorial is perfect for your needs. Also, if you understand the basics of Apache Spark but would like to deepen your knowledge, then this course is an apt choice.
It should be noted that this course doesn’t delve into deep instruction about the Scala programming language. However, a basic understanding of Scala is useful for this course.
From the fundamentals of RDDs to advanced features like Spark SQL and Streaming, you will learn everything you need to know about Apache Spark in 21 days.
Enroll NowCourse Objectives
In the course, we will talk about a wide variety of topics in Spark, including:

- An introduction to Spark and Spark installation
- RDD programming and key-value pair RDDs
- Partitioning in Spark
- Spark SQL tutorial
- The latest spark abstractions like Dataframe and Dataset
- An exploration of what file systems are supported in Spark
- Spark UDFs
- An understanding of shared variables like Accumulators and broadcast variables
- How to tune and debug Spark applications
- Various concepts related to structured streaming
- An introduction to machine learning and how MLib libraries can be used for machine learning use cases
Who Should Take This Course?
The course is designed to provide a solid foundation of understanding for Apache Spark — for those who aspire to make a career move to big data. Additionally, anyone looking to become a Spark expert or deepen their understanding of Apache Spark will benefit from this course.
The course is ideal for:
-
Spark/Big Data aspirants
-
Big data engineers
-
Big data developers
-
Big data architects
-
Data scientists
-
Analytics professional

Pre-Requisites for This Course
A basic understanding of Scala is a plus. However, it is not mandatory.
Enroll NowWhy Learn Apache Spark?
More and more companies are starting to use big data technologies such as Apache Hive, Spark, and Pig to process ever-growing amounts of data in order to derive valuable insights and make better business decisions.
Spark can process huge volumes of data with ease. There's no need to write complex map-reduce programs, and development time is also significantly reduced. Spark can perform all the ETL tasks, making it a versatile tool for big data.

Curriculum
Apache Spark 2 with Scala
- Apache Spark Introduction
- How to Install Spark on Windows
- Get Familiarized with Scala and Python Shells
- Persistence and Storage levels
Key-Value Pair RDDs
- What Are Pair RDDs?
- How to Create Pair RDDs
- Transformations of Single Pair RDDs
- Combine by Key Pair RDD Transformation
- Transformations on Multi-Pair RDDs
- Actions on Pair RDDs
Partitioning in Spark - RDD Partitioning
- What Is RDD Partitioning?
- Why Should We Partition the Data in Spark?
- How to Create Partitions in RDD
- Determining the Number of Partitions
Spark SQL
- What Is Spark SQL?
- What Is a Dataframe?
- How to Programmatically Specify a Schema
- Dataframe Operations
Loading and Saving your Data
- How to Create a DataFrame From JSON File
- How to Create a DataFrame from RDBMS
- How to Create a Dataframe from ElasticSearch
- How to Create a DataFrame From Parquet File
- How to Create DataFrame from TextFile
- How to Create a DataFrame From CSV File
Programming with Dataset
- What Is a Dataset?
- How to Create a Dataset: 4 Ways to Create a Dataset
- RDD vs Dataframe vs Dataset
- Joining Datasets: How to Join Two Datasets
- How to Remove the Duplicate Column while Joining the Datasets
Supported FileSystems by Apache Spark
- Most Commonly-Used File Systems
- Which File System to Use HDFS or Amazon S3
- Spark User-Defined Functions
Accumulators and Broadcast Variables
- Spark Accumulators
- Spark Broadcast Variables
Tuning and Debugging Spark
- Spark-Submit Commands and Flags
- How to Set Executor Cores, Number of Executors and Executor Memory for Maximum Performance
- 5 Strategies to Improve Spark Application Performance
- How to Improve Spark Performance: Tuning the Levels of Parallelism
- How to Improve Spark Performance: Use Kryo Serializer
- How to Improve Spark Performance: Tweaking Memory Parameters
- How to Improve Spark Performance: Improving Cache Policy
- How to Improve Spark Performance: Tweaking Cluster Sizing Parameters
- How to Debug a Spark Application
- Spark Web UI
Spark Structured Streaming
- What Is Spark Structured Streaming?
- Dstreams vs Streaming Datasets
- How to Create a Streaming Dataset from a Socket Source
- How to Create a Streaming Dataset from a Continuously-Updated Directory
- How to Create a Streaming Dataframe from an S3 File
- How to Write the Contents of a Streaming Dataset onto a Console
- How to Perform Window Operations on a Streaming Dataframe Dataset
Machine Learning in Spark using MLibs
- What is Machine Learning?
- How to Perform Spam Email Classification Using MLib Library in Spark