0345 4506120

Building Big data Analytics Solutions in the Hadoop Ecosystem

Hadoop is more popular than ever and is generating data-driven business value across every industry. This course gives attendees the essential skills to build Big Data applications using Hadoop technologies such as HDFS, YARN, Apache Kafka, Apache Hive and Apache Spark, in an analytical ecosystem with Teradata components such as Teradata Database, Teradata Viewpoint, and Teradata QueryGrid.

In this course, students will have access to their own cluster to gain hands-on experience. Students will use the Hadoop's distributed file system and process distributed datasets with Hive. In addition, students will develop applications in Spark using Scala and Python via RDDs and DataFrames.

Students will write applications using Hive and Spark and learn about common issues encountered when processing vast datasets in distributed systems.

A discussion of additional tools, Hadoop distributions, and the opportunity to ask questions of experts in Hadoop technology make this popular course an essential grounding for companies looking to implement Hadoop effectively within their enterprise.


Hive Developers, Spark Developers, Hadoop Developers, Data Scientists, Business Analysts/Data Analysts, and Data Engineers


Learning Objectives

After successfully completing this course, you will be able to:

  • Describe the issues of 'Big Data' and how they are remedied using Hadoop
  • Describe the Hadoop architecture and its core components (HDFS, YARN)
  • Load data into Hadoop from various sources (Flume, Sqoop, Kafka)
  • Use Hive to analyze unstructured and structured data at a large scale
  • Explain the importance of the Hive Metastore
  • Write applications with Spark using RDDs and Spark SQL using DataFrames
  • Use Spark SQL to analyze datasets from Hive using Hive Metastore
  • Use Spark Streaming and Structured Streaming for near-real-time analysis
  • Integrate Hadoop with Teradata (Teradata Unified Data Architecture, Teradata Viewpoint, Teradata QueryGrid)


To get the most out of this training, you should have the following knowledge or experience:

  • Students are expected to have some prior programming experience and can use basic Linux commands
  • Experience in SQL, Scala and Python will be a distinct advantage
  • Prior Hadoop experience is a bonus

Course Content

Module 0. Introduction and Setup:

  • Introduction to course, setup and connect to the cloud lab environment
  • Fire up Hadoop
  • Open PuTTy terminal, Firefox, and WinSCP

Module 1. Hadoop Basics:

  • Why Hadoop was developed and problems it solves
  • Architecture (HDFS and YARN)
  • Introduction to common Hadoop components
  • Hadoop components

Module 2. Ingesting Data into Hadoop:  How to load data into Hadoop using several popular ingest utilities

  • Flume
  • Sqoop
  • WebHDFS
  • Kafka

Module 3. Hive Basics:

  • Introduction to how Hive works
  • What Hive is (and isn’t)
  • The Hive Metastore
  • Creating tables, loading data, schema-on-read
  • Storage formats and SerDes
  • Hive with unstructured data
  • Logically partitioning tables
  • Complex Data Types (arrays, maps, structs)
  • UDFs, Joins, Explain

Module 4. Spark Architecture and Concepts:

  • Architecture (Spark versus MapReduce, Spark Building Blocks, Component Location, Execution Speed)
  • Why use Spark
  • Deployment options (Spark on Hadoop versus Spark standalone)
  • Terms and nomenclature

Module 5. Spark Core (Scala and Python): 

  • How to use Spark with Scala or Python languages
  • About Spark (Spark session, Spark shell and Zeppelin, Spark logs)
  • Setting lab environments
  • Scala/Python for Spark (Immutables, Anonymous functions)
  • Resilient Distributed Datasets (RDDs, RDD Creation, RDD Operations, RDD Persistence)

Module 6. Spark SQL and DataFrames:

  • About Spark SQL (SQLContext and Hive Context)
  • Spark DataFrames (DFs, DF Creation, DFs API)
  • Spark SQL (querying Hive, querying DF, spark-sql shell)
  • Speed versus ease of use

Module 7. Spark Streaming:

  • About Spark Streaming (Introduction, fundamental concepts, components, streaming sources)
  • Unstructured Streaming (DStream, Streaming Program)
  • Structured Streaming (Datasets/DataFrames, Streaming Program)
  • Structured versus Unstructured

Module 8. Hackathon:

  • Hands-on labs developing Hadoop applications from scratch

Module 9. Integrating Teradata with Hadoop Applications – optional

  • Introduction to Teradata Unified Data Architecture
  • Introduction to Teradata Viewpoint
  • Introduction to Teradata QueryGrid

How to setup Teradata QueryGrid links for:

  • Teradata-Hive
  • Teradata-Spark

How to query from:

  • Teradata-to-Hive
  • Hive-to-Teradata
  • Teradata-to-Spark
  • Spark-to-Teradata

Privacy Notice

In order to provide you with the service requested we will need to retain and use your contact information in accordance with our Privacy Notice. If you choose to provide us with this information you explicitly consent to us using the information as necessary to provide the request service to you. If you do not agree please do not proceed to request the service from us.

Marketing Permissions

Would you like to receive our newsletter and other information on products and services which we think will be of interest to you by email. We will always treat your information with care and in accordance with our Privacy Notice. You are free to withdraw this permission at any time.


Virtual Classroom

Virtual classrooms provide all the benefits of attending a classroom course without the need to arrange travel and accomodation. Please note that virtual courses are attended in real-time, commencing on a specified date.

Virtual Course Dates

Our Customers Include