Apache Spark with Scala for High-Performance Big Data Processing

In order to embark on a successful career in data science, understanding Apache Spark is essential. This course helps solidify that understanding.

Enroll Now

Big Data eLearning Apache Spark with Scala: Become a Spark Expert in 21 days. The objective of this course is to provide a comprehensive understanding and training on Apache Spark. If you are an absolute beginner, this Apache Spark tutorial is perfect for your needs. Also, if you understand the basics of Apache Spark but would like to deepen your knowledge, then this course is an apt choice. 

It should be noted that this course doesn’t delve into deep instruction about the Scala programming language. However, a basic understanding of Scala is useful for this course. 

From the fundamentals of RDDs to advanced features like Spark SQL and Streaming, you will learn everything you need to know about Apache Spark in 21 days. 

Enroll Now

Course Objectives

In the course, we will talk about a wide variety of topics in Spark, including:

  • An introduction to Spark and Spark installation
  • RDD programming and key-value pair RDDs
  • Partitioning in Spark
  • Spark SQL tutorial
  • The latest spark abstractions like Dataframe and Dataset
  • An exploration of what file systems are supported in Spark
  • Spark UDFs
  • An understanding of shared variables like Accumulators and broadcast variables
  • How to tune and debug Spark applications
  • Various concepts related to structured streaming
  • An introduction to machine learning and how MLib libraries can be used for machine learning use cases

Who Should Take This Course?

 

The course is designed to provide a solid foundation of understanding for Apache Spark — for those who aspire to make a career move to big data. Additionally, anyone looking to become a Spark expert or deepen their understanding of Apache Spark will benefit from this course.

The course is ideal for:

  • Spark/Big Data aspirants

  • Big data engineers

  • Big data developers

  • Big data architects

  • Data scientists

  • Analytics professional

Video Poster Image

Pre-Requisites for This Course

A basic understanding of Scala is a plus. However, it is not mandatory.

Enroll Now

Why Learn Apache Spark?

 

More and more companies are starting to use big data technologies such as Apache Hive, Spark, and Pig to process ever-growing amounts of data in order to derive valuable insights and make better business decisions.

Spark can process huge volumes of data with ease. There's no need to write complex map-reduce programs, and development time is also significantly reduced. Spark can perform all the ETL tasks, making it a versatile tool for big data.

Video Poster Image

Curriculum

Apache Spark 2 with Scala

  • Apache Spark Introduction 
  • How to Install Spark on Windows 
  • Get Familiarized with Scala and Python Shells 
  • Persistence and Storage levels 

Key-Value Pair RDDs

  • What Are Pair RDDs?
  • How to Create Pair RDDs 
  • Transformations of Single Pair RDDs 
  • Combine by Key Pair RDD Transformation 
  • Transformations on Multi-Pair RDDs 
  • Actions on Pair RDDs

Partitioning in Spark - RDD Partitioning  

  • What Is RDD Partitioning? 
  • Why Should We Partition the Data in Spark? 
  • How to Create Partitions in RDD 
  • Determining the Number of Partitions

Spark SQL

  • What Is Spark SQL?
  • What Is a Dataframe?
  • How to Programmatically Specify a Schema 
  • Dataframe Operations 

Loading and Saving your Data

  • How to Create a DataFrame From JSON File 
  • How to Create a DataFrame from RDBMS 
  • How to Create a Dataframe from ElasticSearch 
  • How to Create a DataFrame From Parquet File 
  • How to Create DataFrame from TextFile 
  • How to Create a DataFrame From CSV File

Programming with Dataset 

  • What Is a Dataset? 
  • How to Create a Dataset: 4 Ways to Create a Dataset 
  • RDD vs Dataframe vs Dataset 
  • Joining Datasets: How to Join Two Datasets 
  • How to Remove the Duplicate Column while Joining the Datasets

Supported FileSystems by Apache Spark

  • Most Commonly-Used File Systems 
  • Which File System to Use HDFS or Amazon S3 
  • Spark User-Defined Functions 

Accumulators and Broadcast Variables

  • Spark Accumulators 
  • Spark Broadcast Variables

Tuning and Debugging Spark

  • Spark-Submit Commands and Flags 
  • How to Set Executor Cores, Number of Executors and Executor Memory for Maximum Performance 
  • 5 Strategies to Improve Spark Application Performance 
  • How to Improve Spark Performance: Tuning the Levels of Parallelism 
  • How to Improve Spark Performance: Use Kryo Serializer 
  • How to Improve Spark Performance: Tweaking Memory Parameters 
  • How to Improve Spark Performance: Improving Cache Policy 
  • How to Improve Spark Performance: Tweaking Cluster Sizing Parameters 
  • How to Debug a Spark Application 
  • Spark Web UI 

Spark Structured Streaming

  • What Is Spark Structured Streaming? 
  • Dstreams vs Streaming Datasets 
  • How to Create a Streaming Dataset from a Socket Source 
  • How to Create a Streaming Dataset from a Continuously-Updated Directory 
  • How to Create a Streaming Dataframe from an S3 File 
  • How to Write the Contents of a Streaming Dataset onto a Console 
  • How to Perform Window Operations on a Streaming Dataframe Dataset 

Machine Learning in Spark using MLibs

  • What is Machine Learning?
  • How to Perform Spam Email Classification Using MLib Library in Spark

Enroll now and gain the necessary skills to excel in the world of big data, making yourself a valuable asset to any company. Sign up for our free course on Apache Spark today by clicking the link below:

Enroll Now