Assignment 4: Hadoop

Background

The goal of this assignment is to learn how to evaluate the performance and limitations of distributed batch processing systems.

Motivation

Batch processing frameworks generally provides an easy to use framework for developing and executing tasks in parallell on a single computer, and in a network of computers. MapReduce is a batch processing framework, and Apache Hadoop implements this framework as an open source project. The motivation for this assignment is to familiarize yourself with the advantages and limitations of batch processing frameworks through practical use of Apache Hadoop MapReduce.

Task

Evaluate the performace of Apache Hadoop MapReduce for the following workloads:

  • A workload where each step depends on the output of the previous step, at least use Fibonacci Numbers.
  • A parallelizable workload with multiple barriers, at least use Shearsort.
  • A workload that is easy to parallelize, at least use the wordcount example from the Apache Hadoop Mapreduce getting started guide.

You should test the workloads in a Hadoop configurations on a single machine, and how the workloads scale with one to (at least) ten slaves.

 

Getting started with Apache Hadoop (MapReduce)

Visit http://hadoop.apache.org/index.html#Getting+Started and read their getting started guide.

De som ?nsker ? lese om MapReduce fra et litt annet perspektiv kan ta en titt p? MapReduce tutorial p? Google Code University.

Machines

For this assignment you can use your own machines, or machines at IFI.

Machines at IFI are usually marked with their name. So you should be able to find machines by visiting bachelor and master labs. Bachelor machines can also be found using:

Find name of bachelor labs:
$ ls /hom/peder/opt/termstue/share/maps/
Show machines on a lab:
$ ~termvakt/bin/termstue <ROMNAVN>
Example:
$ ~termvakt/bin/termstue assembler

Note: The bachelor machines runs the Idle Job Killer script. This script kills any processes on machines where you are not logged in. One solution is to be logged in and do something on every machine, another is to set a nice value (this is explained in the emails Idle Job Killer sends you when it kills a process).

 

Assignment

Solve the task in a group of two (or alone). Present your experiences and results orally in the course on November 19. It is mandatory to write a report that in the specified format and deliver it by November 19, using Devilry. Keep in mind that it is possible to update the report until November.

It is mandatory to present your group's results on November 19. You do not have to prepare a formal presentation (like a Powerpoint foilset); however, you must at least show the measurement results that are included in your report and that you discuss in class. The discussions in class are supposed to help you improve your report for final delivery. It is recommended that you have a web page or a PDF document that is web-accessible from an arbitrary computer.

Report

The written report has up to 4 pages in ACM format (see right column). It is expect that such a report includes: a description of the assignment, a description of the testbed, an explanation of the metrics that were chosen to present the measurement results visually, graphs showing the results, an interpretation of the graphs.

The results must be based on the own tests.

The report is evaluated by writing quality, by the trustworthiness and correctness of the results. The evaluation does not consider whether related work (citations of other papers) is included. It is not necessary to cite existing work in this report.

Published Aug. 24, 2012 8:24 AM - Last modified Nov. 5, 2012 8:09 AM

Log in to comment