Final edits for v1.0

2018-04-11 13:16:32 -05:00
parent 6a39d8acf3
commit e036ab31a2
1 changed files with 39 additions and 7 deletions
--- a/concurrency-mapreduce/README.md
+++ b/concurrency-mapreduce/README.md
@@ -88,7 +88,7 @@ the same time. Users don't have to worry about how to parallelize their
 application; rather, they just write `Map()` and `Reduce()` functions and the
 infrastructure does the rest.

-## Details
+## Code Overview

 We give you here `mapreduce.h` file that specifies exactly what you must build
 in your MapReduce library:
@@ -124,6 +124,14 @@ will implement a Map function, implement a Reduce function, possibly implement
 a Partition function, and then call `MR_Run()`. The infrastructure will then
 create threads as appropriate and run the computation.

+One basic assumption is that the library will create `num_mappers` threads
+(in a thread pool) that perform the map tasks. Another is that your library 
+will create `num_reducers` threads to perform the reduction tasks. Finally,
+your library will create some kind of internal data structure to pass
+keys and values from mappers to reducers; more on this below.
+
+## Simple Example: Wordcount
+
 Here is a simple (but functional) wordcount program, written to use this
 infrastructure: 

@@ -210,19 +218,43 @@ unsigned long MR_DefaultHashPartition(char *key, int num_buckets) {
 ```

 The function's role is to take a given `key` and map it to a number, from `0`
-to `num_buckets - 1`. Its use is internal to the MapReduce library; 
-
-
-
+to `num_buckets - 1`. Its use is internal to the MapReduce library, but
+critical. Specifically, your MR library should use this function to decide
+which Reduce thread gets a particular key/list of values to process.  For some
+applications, which Reducer thread processes a particular key is not
+important (and thus the default function above should be passed in to
+`MR_Run()`); for others, it is, and this is why the user can even pass in
+their own partitioning function as need be.

+One last requirement: For each partition, keys (and the value list associated
+with said keys) should be *sorted* in ascending key order; thus, when a
+particular reducer thread (and its associated partition) are working, the
+`Reduce()` function should be called on each key in order for that partition.

 ## Considerations

- **Thread Management**. 
+Here are a few things to consider in your implementation:

- **Memory Management**. yyy.
+- **Thread Management**. This part is fairly straightforward. You should
+    create `num_mappers` mapping threads, and assign a file to each `Map()`
+    invocation in some manner you think is best (e.g., Round Robin,
+    Shortest-File-First, etc.). Which way might lead to best performance?  
+    You should also create `num_reducers` reducer threads at some point, to
+    work on the map'd output. 

+- **Partitioning and Sorting**. Your central data structure should be
+    concurrent, allowing mappers to each put values into different
+    partitions correctly and efficiently. Once the mappers have completed, a
+    sorting phase should order the key/value-lists. Then, finally, each
+    reducer thread should start calling the user-defined `Reduce()` function
+    on the keys in sorted order per partition. You should think about what
+    type of locking is needed throughout this process for correctness.

+- **Memory Management**. One last concern is memory management. The
+    `MR_Emit()` function is passed a key/value pair; it is the responsibility
+    of the MR library to make copies of each of these. Then, when the entire
+    mapping and reduction is complete, it is the responsibility of the MR
+    library to free everything.

 ## Grading