From a3186d938338c4dcfbcf25b9579a8554adec6e4c Mon Sep 17 00:00:00 2001 From: Remzi Arpaci-Dusseau Date: Wed, 11 Apr 2018 11:50:22 -0500 Subject: [PATCH] More details --- concurrency-mapreduce/README.md | 95 +++++++++++++++++++++++++++++++-- 1 file changed, 91 insertions(+), 4 deletions(-) diff --git a/concurrency-mapreduce/README.md b/concurrency-mapreduce/README.md index 81b43ef..0c10842 100644 --- a/concurrency-mapreduce/README.md +++ b/concurrency-mapreduce/README.md @@ -17,6 +17,14 @@ still numerous challenges, mostly in building the correct concurrency support. Thus, you'll have to think a bit about how to build the MapReduce implementation, and then build it work efficiently and correctly. +There are three specific objectives to this assignment: + +- To learn bout the general nature of the MapReduce paradigm. +- To implement a correct and efficient MapReduce framework using threads and + related functions. +- To gain more experience writing concurrent code. + + ## Background To understand how to make progress on this project, you should understand the @@ -82,8 +90,8 @@ infrastructure does the rest. ## Details -We give you here a `.h` file that specifies exactly what you must build in -your MapReduce library: +We give you here `mapreduce.h` file that specifies exactly what you must build +in your MapReduce library: ``` #ifndef __mapreduce_h__ @@ -92,7 +100,7 @@ your MapReduce library: // Various function pointers typedef char *(*Getter)(); typedef void (*Mapper)(char *file_name); -typedef void (*Reducer)(char *key, Getter get_func, int get_index); +typedef void (*Reducer)(char *key, Getter get_func, int partition_number); typedef unsigned long (*Partitioner)(char *key, int num_buckets); // Key functions exported by MapReduce @@ -104,6 +112,85 @@ void MR_Run(int argc, char *argv[], Partitioner partition); ``` +The most important function is `MR_Run`, which takes the command line +parameters of a given program, a pointer to a Map function (type `Mapper`, +called `map`), the number of mapper threads your library should create +(`num_mappers`), a pointer to a Reduce function (type `Reducer`, called +`reduce`), the number of reducers (`num_reducers`), and finally, a pointer to +a Partition function (`partition`, described below). + +Thus, when a user is writing a MapReduce computation with your library, they +will implement a Map function, implement a Reduce function, possibly implement +a Partition function, and then call `MR_Run()`. The infrastructure will then +create threads as appropriate and run the computation. + +Here is a simple (but functional) wordcount program, written to use this +infrastructure: + +``` +#include +#include +#include +#include "mapreduce.h" + +void Map(char *file_name) { + FILE *fp = fopen(file_name, "r"); + assert(fp != NULL); + + char *line = NULL; + size_t size = 0; + while (getline(&line, &size, fp) != -1) { + char *token, *dummy = line; + while ((token = strsep(&dummy, " \t\n\r")) != NULL) { + MR_Emit(token, "1"); + } + } + + fclose(fp); +} + +void Reduce(char *key, Getter get_next, int partition_number) { + int count = 0; + char *value; + while ((value = get_next(partition_number)) != NULL) + count++; + printf("%s %d\n", key, count); +} + +int main(int argc, char *argv[]) { + MR_Run(argc, argv, Map, 10, Reduce, 10, MR_DefaultHashPartition); +} +``` + +Let's walk through this code, in order to see what it is doing. First, notice +that `Map()` is called with a file name. In general, we assume that this type +of computation is being run over many files; each invocation of `Map()` is +thus handed one file name and is expected to process that file in its +entirety. + +In this example, the code above just reads through the file, one line at a +time, and uses `strsep()` to chop the line into tokens. Each token is then +emitted using the `MR_Emit()` function, which takes two strings as input: a +key and a value. The key here is the word itself, and the token is just a +count, in this case, 1 (as a string). It then closes the file. + +The `MR_Emit()` function is thus another key part of your library; it needs to +take key/value pairs from the many different mappers and store them in a way +that later reducers can access them, given constraints described +below. Designing and implementing this data structure is thus a central +challenge of the project. + +After the mappers are finished, your library should have stored the key/value +pairs in such a way that the `Reduce()` function can be called. `Reduce()` is +invoked once per key, and is passed the key along with a function that enables +iteration over all of the values that produced that same key. To iterate, the +code just calls `get_next()` repeatedly until a NULL value is returned; +`get_next` returns a pointer to the value passed in by the `MR_Emit()` +function above. + + + + @@ -111,7 +198,7 @@ void MR_Run(int argc, char *argv[], ## Considerations -- **xxx**. yyy. +- **Memory Management**. yyy.