Supercharge your data with Apache Spark, a big data platform well-suited for iterative algorithms required by graph analytics and machine learning. In this training course, you will learn to leverage Spark best practices, develop solutions that run on the Apache Spark platform, and take advantage of Spark’s efficient use of memory and powerful programming model.
Select specific date to see price, venue and full details.
Learning Objectives
You Will Learn How To
- Develop applications with Spark
- Work with the libraries for SQL, Streaming, and Machine Learning
- Map real-world problems to parallel algorithms
- Build business applications that integrate with Spark
Pre-Requisites
Requirements:
- Professional experience in programming at the level of:
- Course, Java Programming Introduction, or
- Course, C# Programming
- Three to six months of experience in a object-oriented programming language
Course Content
Course Outline
Introduction to Spark
- Defining Big Data and Big Computation
- What is Spark?
- What are the benefits of Spark?
The Challenge of Parallelising Applications
Scaling-out applications
- Identifying the performance limitations of a modern CPU
- Scaling traditional parallel processing models
Designing parallel algorithms
- Fostering parallelism through functional programming
- Mapping real-world problems to effective parallel algorithms
Defining the Spark Architecture
Parallelising data structures
- Partitioning data across the cluster using Resilient Distributed Datasets (RDD) and DataFrames
- Apportioning task execution across multiple nodes
- Running applications with the Spark execution model
The anatomy of a Spark cluster
- Creating resilient and fault-tolerant clusters
- Achieving scalable distributed storage
Managing the cluster
- Monitoring and administering Spark applications
- Visualising execution plans and results
Developing Spark Applications
Selecting the development environment
- Performing exploratory programming via the Spark shell
- Building stand-alone Spark applications
Working with the Spark APIs
- Programming with Scala and other supported languages
- Building applications with the core APIs
- Enriching applications with the bundled libraries
Manipulating Structured Data with Spark SQL
Querying structured data
- Processing queries with DataFrames and embedded SQL
- Extending SQL with User-Defined Functions (UDFs)
- Exploiting Parquet and JSON formatted data sets
Integrating with external systems
- Connecting to databases with JDBC
- Executing Hive queries in external applications
Processing Streaming Data in Spark
What is streaming?
- Implementing sliding window operations
- Determining state from continuous data
- Processing simultaneous streams
- Improving performance and reliability
Streaming data sources
- Streaming from built-in sources (e.g., log files, Twitter sockets, Kinesis, Kafka)
- Developing custom receivers
- Processing with the streaming API and Spark SQL
Performing Machine Learning with Spark
Classifying observations
- Predicting outcomes with supervised learning
- Building a decision tree classifier
Identifying patterns
- Grouping data using unsupervised learning
- Clustering with the k-means method
Creating Real-World Applications
Building Spark-based business applications
- Exposing Spark via a RESTful web service
- Generating Spark-based dashboards
Spark as a service
- Cloud vs. on-premises
- Choosing a service provider (eg, AWS, Azure, Databricks)
The Future of Spark
- Scaling to massive cluster sizes
- Enhancing security on multi-tenant clusters
- Tracking the ongoing commercialisation of Spark
- Project Tungsten: pushing performance closer to the limits of modern hardware
- Working with existing projects powered by Spark
- Re-architecting Spark for mobile platforms