first bit of MR project

2018-04-11 11:27:03 -05:00
parent c05a2e3482
commit 02e5ee18ef
1 changed files with 132 additions and 0 deletions
--- a/concurrency-mapreduce/README.md
+++ b/concurrency-mapreduce/README.md
@@ -0,0 +1,132 @@
+
+# Map Reduce
+
+In 2004, engineers at Google introduced a new paradigm for large-scale
+parallel data processing known as MapReduce (see the original paper
+[here](https://static.googleusercontent.com/media/research.google.com/en//archive/mapreduce-osdi04.pdf),
+and make sure to look in the citations at the end). One key aspect of
+MapReduce is that it makes programming such tasks on large-scale clusters easy
+for developers; instead of worrying about how to manage parallelism, handle
+machine crashes, and many other complexities common within clusters of
+machines, the developer can instead just focus on writing little bits of code
+(described below) and the infrastructure handles the rest.
+
+In this project, you'll be building a simplified version of MapReduce for just
+a single machine. While somewhat easier than with a single machine, there are
+still numerous challenges, mostly in building the correct concurrency
+support. Thus, you'll have to think a bit about how to build the MapReduce
+implementation, and then build it work efficiently and correctly.
+
+## Background
+
+To understand how to make progress on this project, you should understand the
+basics of thread creation, mutual exclusion (with locks), and
+signaling/waiting (with condition variables). These are described in the
+following book chapters:
+
+- [Intro to Threads](http://pages.cs.wisc.edu/~remzi/OSTEP/threads-intro.pdf)
+- [Threads API](http://pages.cs.wisc.edu/~remzi/OSTEP/threads-api.pdf)
+- [Locks](http://pages.cs.wisc.edu/~remzi/OSTEP/threads-locks.pdf)
+- [Using Locks](http://pages.cs.wisc.edu/~remzi/OSTEP/threads-locks-usage.pdf)
+- [Condition Variables](http://pages.cs.wisc.edu/~remzi/OSTEP/threads-cv.pdf)
+
+Read these chapters carefully in order to prepare yourself for this project.
+
+## General Idea
+
+Let's now get into the exact code you'll have to build. The MapReduce
+infrastructure you will build supports the execution of user-defined `Map()`
+and `Reduce()` functions.
+
+As from the original paper: "`Map()`, written by the user, takes an input pair
+and produces a set of intermediate key/value pairs. The MapReduce library
+groups together all intermediate values associated with the same intermediate
+key K and passes them to the `Reduce()` function."
+
+"The `Reduce()` function, also written by the user, accepts an intermediate
+key K and a set of values for that key. It merges together these values to
+form a possibly smaller set of values; typically just zero or one output value
+is produced per `Reduce()` invocation. The intermediate values are supplied to
+the user's reduce function via an iterator."
+
+A classic example, written here in pseudocode, shows how to count the number
+of occurrences of each word in a set of documents:
+
+```
+map(String key, String value):
+  // key: document name
+  // value: document contents
+  for each word w in value:
+    EmitIntermediate(w, "1");
+
+reduce(String key, Iterator values):
+  // key: a word
+  // values: a list of counts
+  int result = 0;
+  for each v in values:
+    result += ParseInt(v);
+  print key, result;
+```
+
+What's fascinating about MapReduce is that so many different kinds of relevant
+computations can be mapped onto this framework. The original paper lists many
+examples, including word counting (as above), a distributed grep, a URL
+frequency access counters, a reverse web-link graph application, a term-vector
+per host analysis, and others. 
+
+What's also quite interesting is how easy it is to parallelize: many mappers
+can be running at the same time, and later, many reducers can be running at
+the same time. Users don't have to worry about how to parallelize their
+application; rather, they just write `Map()` and `Reduce()` functions and the
+infrastructure does the rest.
+
+## Details
+
+We give you here a `.h` file that specifies exactly what you must build in
+your MapReduce library:
+
+```
+#ifndef __mapreduce_h__
+#define __mapreduce_h__
+
+// Various function pointers
+typedef char *(*Getter)();
+typedef void (*Mapper)(char *file_name);
+typedef void (*Reducer)(char *key, Getter get_func, int get_index);
+typedef unsigned long (*Partitioner)(char *key, int num_buckets);
+
+// Key functions exported by MapReduce
+void MR_Emit(char *key, char *value);
+unsigned long MR_DefaultHashPartition(char *key, int num_buckets);
+void MR_Run(int argc, char *argv[],
+            Mapper map, int num_mappers,
+            Reducer reduce, int num_reducers,
+            Partitioner partition);
+```
+
+
+
+
+
+## Considerations
+
+
+- **xxx**. yyy.
+
+
+
+## Grading
+
+Your code should turn in `mapreduce.c` which implements the above functions
+correctly and efficiently. It will be compiled with test applications with the
+`-Wall -Werror -pthread -O` flags; it will also be valgrinded to check for
+memory errors.
+
+Your code will first be measured for correctness, ensuring that it performs
+the maps and reductions correctly. If you pass the correctness tests, your
+code will be tested for performance; higher performance will lead to better
+scores.
+
+
+
+