first bit of MR project
This commit is contained in:
132
concurrency-mapreduce/README.md
Normal file
132
concurrency-mapreduce/README.md
Normal file
@@ -0,0 +1,132 @@
|
|||||||
|
|
||||||
|
# Map Reduce
|
||||||
|
|
||||||
|
In 2004, engineers at Google introduced a new paradigm for large-scale
|
||||||
|
parallel data processing known as MapReduce (see the original paper
|
||||||
|
[here](https://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf),
|
||||||
|
and make sure to look in the citations at the end). One key aspect of
|
||||||
|
MapReduce is that it makes programming such tasks on large-scale clusters easy
|
||||||
|
for developers; instead of worrying about how to manage parallelism, handle
|
||||||
|
machine crashes, and many other complexities common within clusters of
|
||||||
|
machines, the developer can instead just focus on writing little bits of code
|
||||||
|
(described below) and the infrastructure handles the rest.
|
||||||
|
|
||||||
|
In this project, you'll be building a simplified version of MapReduce for just
|
||||||
|
a single machine. While somewhat easier than with a single machine, there are
|
||||||
|
still numerous challenges, mostly in building the correct concurrency
|
||||||
|
support. Thus, you'll have to think a bit about how to build the MapReduce
|
||||||
|
implementation, and then build it work efficiently and correctly.
|
||||||
|
|
||||||
|
## Background
|
||||||
|
|
||||||
|
To understand how to make progress on this project, you should understand the
|
||||||
|
basics of thread creation, mutual exclusion (with locks), and
|
||||||
|
signaling/waiting (with condition variables). These are described in the
|
||||||
|
following book chapters:
|
||||||
|
|
||||||
|
- [Intro to Threads](http://pages.cs.wisc.edu/~remzi/OSTEP/threads-intro.pdf)
|
||||||
|
- [Threads API](http://pages.cs.wisc.edu/~remzi/OSTEP/threads-api.pdf)
|
||||||
|
- [Locks](http://pages.cs.wisc.edu/~remzi/OSTEP/threads-locks.pdf)
|
||||||
|
- [Using Locks](http://pages.cs.wisc.edu/~remzi/OSTEP/threads-locks-usage.pdf)
|
||||||
|
- [Condition Variables](http://pages.cs.wisc.edu/~remzi/OSTEP/threads-cv.pdf)
|
||||||
|
|
||||||
|
Read these chapters carefully in order to prepare yourself for this project.
|
||||||
|
|
||||||
|
## General Idea
|
||||||
|
|
||||||
|
Let's now get into the exact code you'll have to build. The MapReduce
|
||||||
|
infrastructure you will build supports the execution of user-defined `Map()`
|
||||||
|
and `Reduce()` functions.
|
||||||
|
|
||||||
|
As from the original paper: "`Map()`, written by the user, takes an input pair
|
||||||
|
and produces a set of intermediate key/value pairs. The MapReduce library
|
||||||
|
groups together all intermediate values associated with the same intermediate
|
||||||
|
key K and passes them to the `Reduce()` function."
|
||||||
|
|
||||||
|
"The `Reduce()` function, also written by the user, accepts an intermediate
|
||||||
|
key K and a set of values for that key. It merges together these values to
|
||||||
|
form a possibly smaller set of values; typically just zero or one output value
|
||||||
|
is produced per `Reduce()` invocation. The intermediate values are supplied to
|
||||||
|
the user's reduce function via an iterator."
|
||||||
|
|
||||||
|
A classic example, written here in pseudocode, shows how to count the number
|
||||||
|
of occurrences of each word in a set of documents:
|
||||||
|
|
||||||
|
```
|
||||||
|
map(String key, String value):
|
||||||
|
// key: document name
|
||||||
|
// value: document contents
|
||||||
|
for each word w in value:
|
||||||
|
EmitIntermediate(w, "1");
|
||||||
|
|
||||||
|
reduce(String key, Iterator values):
|
||||||
|
// key: a word
|
||||||
|
// values: a list of counts
|
||||||
|
int result = 0;
|
||||||
|
for each v in values:
|
||||||
|
result += ParseInt(v);
|
||||||
|
print key, result;
|
||||||
|
```
|
||||||
|
|
||||||
|
What's fascinating about MapReduce is that so many different kinds of relevant
|
||||||
|
computations can be mapped onto this framework. The original paper lists many
|
||||||
|
examples, including word counting (as above), a distributed grep, a URL
|
||||||
|
frequency access counters, a reverse web-link graph application, a term-vector
|
||||||
|
per host analysis, and others.
|
||||||
|
|
||||||
|
What's also quite interesting is how easy it is to parallelize: many mappers
|
||||||
|
can be running at the same time, and later, many reducers can be running at
|
||||||
|
the same time. Users don't have to worry about how to parallelize their
|
||||||
|
application; rather, they just write `Map()` and `Reduce()` functions and the
|
||||||
|
infrastructure does the rest.
|
||||||
|
|
||||||
|
## Details
|
||||||
|
|
||||||
|
We give you here a `.h` file that specifies exactly what you must build in
|
||||||
|
your MapReduce library:
|
||||||
|
|
||||||
|
```
|
||||||
|
#ifndef __mapreduce_h__
|
||||||
|
#define __mapreduce_h__
|
||||||
|
|
||||||
|
// Various function pointers
|
||||||
|
typedef char *(*Getter)();
|
||||||
|
typedef void (*Mapper)(char *file_name);
|
||||||
|
typedef void (*Reducer)(char *key, Getter get_func, int get_index);
|
||||||
|
typedef unsigned long (*Partitioner)(char *key, int num_buckets);
|
||||||
|
|
||||||
|
// Key functions exported by MapReduce
|
||||||
|
void MR_Emit(char *key, char *value);
|
||||||
|
unsigned long MR_DefaultHashPartition(char *key, int num_buckets);
|
||||||
|
void MR_Run(int argc, char *argv[],
|
||||||
|
Mapper map, int num_mappers,
|
||||||
|
Reducer reduce, int num_reducers,
|
||||||
|
Partitioner partition);
|
||||||
|
```
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
## Considerations
|
||||||
|
|
||||||
|
|
||||||
|
- **xxx**. yyy.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
## Grading
|
||||||
|
|
||||||
|
Your code should turn in `mapreduce.c` which implements the above functions
|
||||||
|
correctly and efficiently. It will be compiled with test applications with the
|
||||||
|
`-Wall -Werror -pthread -O` flags; it will also be valgrinded to check for
|
||||||
|
memory errors.
|
||||||
|
|
||||||
|
Your code will first be measured for correctness, ensuring that it performs
|
||||||
|
the maps and reductions correctly. If you pass the correctness tests, your
|
||||||
|
code will be tested for performance; higher performance will lead to better
|
||||||
|
scores.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
Reference in New Issue
Block a user