Hadoop: Hive, Sqoop and Spark

Resize the browser window to see the effect.

Hadoop: Hive, Sqoop and Spark

taught by Jenny Kim

Aim of Course:

In this online course, "Hadoop: Hive, Sqoop and Spark" you will expand on the topics from the topics from the "Introduction to Analytics using Hadoop" course and learn how to use higher-order tools in the Hadoop Ecosystem and Spark computing platform to perform data analysis and implement machine learning patterns on data at scale.  In this course, you will learn about:

  • The software components of the Hadoop Ecosystem
  • Data loading, warehousing and manipulation with HBase, Hive, and Sqoop
  • Data aggregation and designing data workflows with Pig and Spark
  • Machine learning and data mining with Spark's MLlib library

This course may be taken individually (one-off) or as part of a certificate program.

Here's an excellent introduction to Spark, the newest component in the Hadoop ecosystem.

This course may be taken individually (one-off) or as part of a certificate program.

Course Program:

WEEK 1: The Hadoop Ecosystem and Data Warehousing and Manipulation pt. 1

  • Review basic installation and configuration of Hadoop in a single-node, pseudodistributed mode
  • Structured data querying and warehousing with Hive

WEEK 2: Data Warehousing and Manipulation pt. 2

  • Working with Hadoop’s NoSQL database HBase
  • Accessing Relational Data with Sqoop

WEEK 3:  Higher Order Hadoop Programming

  • Data processing flows with Pig
  • Fast, in-memory big-data processing with Spark's Python API

WEEK 4: Machine Learning and Data Mining

  • Introduction to Data Mining and Machine Learning
  • Building a Machine Learning system with Spark's MLlib



The homework in this course consists of short answer questions to test concepts, guided exercises in writing code and guided data analysis problems using software.

This course also has example software codes, supplemental readings available online and video lectures in each week.

Hadoop: Hive, Sqoop and Spark

Who Should Take This Course:

Data scientists and statisticians who are familiar with Hadoop fundamentals, have programming experience, and who want to learn how to process and analyze large data sets with Hadoop's distributing computing capability and ecosystem components.



  1. Big Data Computing with Hadoop or equivalent familiarity with Hadoop and its core components

  2. Strong understanding of MapReduce and MapReduce API

  3. Intermediate familiarity with Python preferred

  4. “SQL and R: Introduction to Database Queries” or the equivalent familiarity with SQL and query languages

  5. Basic knowledge of operating systems (UNIX/Linux)

Organization of the Course:

This course takes place online at the Institute for 4 weeks. During each course week, you participate at times of your own choosing - there are no set times when you must be online. Course participants will be given access to a private discussion board. In class discussions led by the instructor, you can post questions, seek clarification, and interact with your fellow students and the instructor.

At the beginning of each week, you receive the relevant material, in addition to answers to exercises from the previous session. During the week, you are expected to go over the course materials, work through exercises, and submit answers. Discussion among participants is encouraged. The instructor will provide answers and comments, and at the end of the week, you will receive individual feedback on your homework answers.

Time Requirement:
About 15 hours per week, at times of  your choosing.

Students come to the Institute for a variety of reasons. As you begin the course, you will be asked to specify your category:

  1. You may be interested only in learning the material presented, and not be concerned with grades or a record of completion.
  2. You may be enrolled in PASS (Programs in Analytics and Statistical Studies) that requires demonstration of proficiency in the subject, in which case your work will be assessed for a grade.
  3. You may require a "Record of Course Completion," along with professional development credit in the form of Continuing Education Units (CEU's).  For those successfully completing the course,  CEU's and a record of course completion will be issued by The Institute, upon request.

Course Text:

Required readings will be provided as PDF documents in the course.

Recommended texts:

Hadoop: The Definitive Guide, 3rd ed., by Tom White (O'Reilly Media).  Optional readings will be assigned from this reference.

Java Resources:

Head First Java, 2nd ed., by Kathy Sierra and Bert Bates (O’Reilly Media).  Good introductory book on Java.

Effective Java, 2nd ed., by Joshua Block (Addison-Wesley).  Excellent book on those familiar with Java but looking for insights into best practices and effective Java patterns.


The required software is Apache Hadoop and Java JDK 7.  Familiarity with Linux is required.  IMPORTANT:  Please continue reading below for configuration information.

Hadoop and Virtual Machines

Hadoop developers often use a “Single Node Cluster” to perform development tasks on. This is often a virtual machine running a virtual server environment, which runs the various Hadoop daemons. Access to this VM can be accomplished with SSH from your main development box, just like you’d access a Hadoop cluster. In order to create a virtual environment, you need some sort of virtualization software like VirtualBoxVMWare, or Parallels.  

VirtualBox with an Ubuntu VM is used in the examples within the course material.

The installation instructions discuss how to setup an Ubuntu x64 virtual machine, and the course provides a preconfigured one for use with VMWare or VirtualBox. If you’d like to use the preconfigured virtual machine instead of setting up your own, you will be able to download it from the Resources section in the course.  Note that you will need a 64-bit machine in any case.

Software Development

The native API for Hadoop is written in Java, thus you will need some tool to develop and compile Java. The most well known are Eclipse and NetBeans, as well as a popular, professional IDE- IntelliJIDEA.

SSH on Windows

If you’re on Windows, to SSH into your VM you’ll need a client called PuTTY, on Mac or Linux you’ll be fine using SSH from the terminal. Note that this class does not cover command line usage, ssh, or virtual machine setup.The best place to ask for help on these topics will be in the forums, and if you’re an expert on these topics, please help your fellow classmates as well!

Hadoop: Hive, Sqoop and Spark

To be Scheduled

Course Fee: $549