More details

This commit is contained in:
Remzi Arpaci-Dusseau
2018-04-11 11:50:22 -05:00
parent 02e5ee18ef
commit a3186d9383

View File

@@ -17,6 +17,14 @@ still numerous challenges, mostly in building the correct concurrency
support. Thus, you'll have to think a bit about how to build the MapReduce support. Thus, you'll have to think a bit about how to build the MapReduce
implementation, and then build it work efficiently and correctly. implementation, and then build it work efficiently and correctly.
There are three specific objectives to this assignment:
- To learn bout the general nature of the MapReduce paradigm.
- To implement a correct and efficient MapReduce framework using threads and
related functions.
- To gain more experience writing concurrent code.
## Background ## Background
To understand how to make progress on this project, you should understand the To understand how to make progress on this project, you should understand the
@@ -82,8 +90,8 @@ infrastructure does the rest.
## Details ## Details
We give you here a `.h` file that specifies exactly what you must build in We give you here `mapreduce.h` file that specifies exactly what you must build
your MapReduce library: in your MapReduce library:
``` ```
#ifndef __mapreduce_h__ #ifndef __mapreduce_h__
@@ -92,7 +100,7 @@ your MapReduce library:
// Various function pointers // Various function pointers
typedef char *(*Getter)(); typedef char *(*Getter)();
typedef void (*Mapper)(char *file_name); typedef void (*Mapper)(char *file_name);
typedef void (*Reducer)(char *key, Getter get_func, int get_index); typedef void (*Reducer)(char *key, Getter get_func, int partition_number);
typedef unsigned long (*Partitioner)(char *key, int num_buckets); typedef unsigned long (*Partitioner)(char *key, int num_buckets);
// Key functions exported by MapReduce // Key functions exported by MapReduce
@@ -104,6 +112,85 @@ void MR_Run(int argc, char *argv[],
Partitioner partition); Partitioner partition);
``` ```
The most important function is `MR_Run`, which takes the command line
parameters of a given program, a pointer to a Map function (type `Mapper`,
called `map`), the number of mapper threads your library should create
(`num_mappers`), a pointer to a Reduce function (type `Reducer`, called
`reduce`), the number of reducers (`num_reducers`), and finally, a pointer to
a Partition function (`partition`, described below).
Thus, when a user is writing a MapReduce computation with your library, they
will implement a Map function, implement a Reduce function, possibly implement
a Partition function, and then call `MR_Run()`. The infrastructure will then
create threads as appropriate and run the computation.
Here is a simple (but functional) wordcount program, written to use this
infrastructure:
```
#include <assert.h>
#include <stdio.h>
#include <string.h>
#include "mapreduce.h"
void Map(char *file_name) {
FILE *fp = fopen(file_name, "r");
assert(fp != NULL);
char *line = NULL;
size_t size = 0;
while (getline(&line, &size, fp) != -1) {
char *token, *dummy = line;
while ((token = strsep(&dummy, " \t\n\r")) != NULL) {
MR_Emit(token, "1");
}
}
fclose(fp);
}
void Reduce(char *key, Getter get_next, int partition_number) {
int count = 0;
char *value;
while ((value = get_next(partition_number)) != NULL)
count++;
printf("%s %d\n", key, count);
}
int main(int argc, char *argv[]) {
MR_Run(argc, argv, Map, 10, Reduce, 10, MR_DefaultHashPartition);
}
```
Let's walk through this code, in order to see what it is doing. First, notice
that `Map()` is called with a file name. In general, we assume that this type
of computation is being run over many files; each invocation of `Map()` is
thus handed one file name and is expected to process that file in its
entirety.
In this example, the code above just reads through the file, one line at a
time, and uses `strsep()` to chop the line into tokens. Each token is then
emitted using the `MR_Emit()` function, which takes two strings as input: a
key and a value. The key here is the word itself, and the token is just a
count, in this case, 1 (as a string). It then closes the file.
The `MR_Emit()` function is thus another key part of your library; it needs to
take key/value pairs from the many different mappers and store them in a way
that later reducers can access them, given constraints described
below. Designing and implementing this data structure is thus a central
challenge of the project.
After the mappers are finished, your library should have stored the key/value
pairs in such a way that the `Reduce()` function can be called. `Reduce()` is
invoked once per key, and is passed the key along with a function that enables
iteration over all of the values that produced that same key. To iterate, the
code just calls `get_next()` repeatedly until a NULL value is returned;
`get_next` returns a pointer to the value passed in by the `MR_Emit()`
function above.
@@ -111,7 +198,7 @@ void MR_Run(int argc, char *argv[],
## Considerations ## Considerations
- **xxx**. yyy. - **Memory Management**. yyy.