More details
This commit is contained in:
@@ -17,6 +17,14 @@ still numerous challenges, mostly in building the correct concurrency
|
|||||||
support. Thus, you'll have to think a bit about how to build the MapReduce
|
support. Thus, you'll have to think a bit about how to build the MapReduce
|
||||||
implementation, and then build it work efficiently and correctly.
|
implementation, and then build it work efficiently and correctly.
|
||||||
|
|
||||||
|
There are three specific objectives to this assignment:
|
||||||
|
|
||||||
|
- To learn bout the general nature of the MapReduce paradigm.
|
||||||
|
- To implement a correct and efficient MapReduce framework using threads and
|
||||||
|
related functions.
|
||||||
|
- To gain more experience writing concurrent code.
|
||||||
|
|
||||||
|
|
||||||
## Background
|
## Background
|
||||||
|
|
||||||
To understand how to make progress on this project, you should understand the
|
To understand how to make progress on this project, you should understand the
|
||||||
@@ -82,8 +90,8 @@ infrastructure does the rest.
|
|||||||
|
|
||||||
## Details
|
## Details
|
||||||
|
|
||||||
We give you here a `.h` file that specifies exactly what you must build in
|
We give you here `mapreduce.h` file that specifies exactly what you must build
|
||||||
your MapReduce library:
|
in your MapReduce library:
|
||||||
|
|
||||||
```
|
```
|
||||||
#ifndef __mapreduce_h__
|
#ifndef __mapreduce_h__
|
||||||
@@ -92,7 +100,7 @@ your MapReduce library:
|
|||||||
// Various function pointers
|
// Various function pointers
|
||||||
typedef char *(*Getter)();
|
typedef char *(*Getter)();
|
||||||
typedef void (*Mapper)(char *file_name);
|
typedef void (*Mapper)(char *file_name);
|
||||||
typedef void (*Reducer)(char *key, Getter get_func, int get_index);
|
typedef void (*Reducer)(char *key, Getter get_func, int partition_number);
|
||||||
typedef unsigned long (*Partitioner)(char *key, int num_buckets);
|
typedef unsigned long (*Partitioner)(char *key, int num_buckets);
|
||||||
|
|
||||||
// Key functions exported by MapReduce
|
// Key functions exported by MapReduce
|
||||||
@@ -104,6 +112,85 @@ void MR_Run(int argc, char *argv[],
|
|||||||
Partitioner partition);
|
Partitioner partition);
|
||||||
```
|
```
|
||||||
|
|
||||||
|
The most important function is `MR_Run`, which takes the command line
|
||||||
|
parameters of a given program, a pointer to a Map function (type `Mapper`,
|
||||||
|
called `map`), the number of mapper threads your library should create
|
||||||
|
(`num_mappers`), a pointer to a Reduce function (type `Reducer`, called
|
||||||
|
`reduce`), the number of reducers (`num_reducers`), and finally, a pointer to
|
||||||
|
a Partition function (`partition`, described below).
|
||||||
|
|
||||||
|
Thus, when a user is writing a MapReduce computation with your library, they
|
||||||
|
will implement a Map function, implement a Reduce function, possibly implement
|
||||||
|
a Partition function, and then call `MR_Run()`. The infrastructure will then
|
||||||
|
create threads as appropriate and run the computation.
|
||||||
|
|
||||||
|
Here is a simple (but functional) wordcount program, written to use this
|
||||||
|
infrastructure:
|
||||||
|
|
||||||
|
```
|
||||||
|
#include <assert.h>
|
||||||
|
#include <stdio.h>
|
||||||
|
#include <string.h>
|
||||||
|
#include "mapreduce.h"
|
||||||
|
|
||||||
|
void Map(char *file_name) {
|
||||||
|
FILE *fp = fopen(file_name, "r");
|
||||||
|
assert(fp != NULL);
|
||||||
|
|
||||||
|
char *line = NULL;
|
||||||
|
size_t size = 0;
|
||||||
|
while (getline(&line, &size, fp) != -1) {
|
||||||
|
char *token, *dummy = line;
|
||||||
|
while ((token = strsep(&dummy, " \t\n\r")) != NULL) {
|
||||||
|
MR_Emit(token, "1");
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
fclose(fp);
|
||||||
|
}
|
||||||
|
|
||||||
|
void Reduce(char *key, Getter get_next, int partition_number) {
|
||||||
|
int count = 0;
|
||||||
|
char *value;
|
||||||
|
while ((value = get_next(partition_number)) != NULL)
|
||||||
|
count++;
|
||||||
|
printf("%s %d\n", key, count);
|
||||||
|
}
|
||||||
|
|
||||||
|
int main(int argc, char *argv[]) {
|
||||||
|
MR_Run(argc, argv, Map, 10, Reduce, 10, MR_DefaultHashPartition);
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Let's walk through this code, in order to see what it is doing. First, notice
|
||||||
|
that `Map()` is called with a file name. In general, we assume that this type
|
||||||
|
of computation is being run over many files; each invocation of `Map()` is
|
||||||
|
thus handed one file name and is expected to process that file in its
|
||||||
|
entirety.
|
||||||
|
|
||||||
|
In this example, the code above just reads through the file, one line at a
|
||||||
|
time, and uses `strsep()` to chop the line into tokens. Each token is then
|
||||||
|
emitted using the `MR_Emit()` function, which takes two strings as input: a
|
||||||
|
key and a value. The key here is the word itself, and the token is just a
|
||||||
|
count, in this case, 1 (as a string). It then closes the file.
|
||||||
|
|
||||||
|
The `MR_Emit()` function is thus another key part of your library; it needs to
|
||||||
|
take key/value pairs from the many different mappers and store them in a way
|
||||||
|
that later reducers can access them, given constraints described
|
||||||
|
below. Designing and implementing this data structure is thus a central
|
||||||
|
challenge of the project.
|
||||||
|
|
||||||
|
After the mappers are finished, your library should have stored the key/value
|
||||||
|
pairs in such a way that the `Reduce()` function can be called. `Reduce()` is
|
||||||
|
invoked once per key, and is passed the key along with a function that enables
|
||||||
|
iteration over all of the values that produced that same key. To iterate, the
|
||||||
|
code just calls `get_next()` repeatedly until a NULL value is returned;
|
||||||
|
`get_next` returns a pointer to the value passed in by the `MR_Emit()`
|
||||||
|
function above.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
@@ -111,7 +198,7 @@ void MR_Run(int argc, char *argv[],
|
|||||||
## Considerations
|
## Considerations
|
||||||
|
|
||||||
|
|
||||||
- **xxx**. yyy.
|
- **Memory Management**. yyy.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user