Final edits for v1.0
This commit is contained in:
@@ -88,7 +88,7 @@ the same time. Users don't have to worry about how to parallelize their
|
|||||||
application; rather, they just write `Map()` and `Reduce()` functions and the
|
application; rather, they just write `Map()` and `Reduce()` functions and the
|
||||||
infrastructure does the rest.
|
infrastructure does the rest.
|
||||||
|
|
||||||
## Details
|
## Code Overview
|
||||||
|
|
||||||
We give you here `mapreduce.h` file that specifies exactly what you must build
|
We give you here `mapreduce.h` file that specifies exactly what you must build
|
||||||
in your MapReduce library:
|
in your MapReduce library:
|
||||||
@@ -124,6 +124,14 @@ will implement a Map function, implement a Reduce function, possibly implement
|
|||||||
a Partition function, and then call `MR_Run()`. The infrastructure will then
|
a Partition function, and then call `MR_Run()`. The infrastructure will then
|
||||||
create threads as appropriate and run the computation.
|
create threads as appropriate and run the computation.
|
||||||
|
|
||||||
|
One basic assumption is that the library will create `num_mappers` threads
|
||||||
|
(in a thread pool) that perform the map tasks. Another is that your library
|
||||||
|
will create `num_reducers` threads to perform the reduction tasks. Finally,
|
||||||
|
your library will create some kind of internal data structure to pass
|
||||||
|
keys and values from mappers to reducers; more on this below.
|
||||||
|
|
||||||
|
## Simple Example: Wordcount
|
||||||
|
|
||||||
Here is a simple (but functional) wordcount program, written to use this
|
Here is a simple (but functional) wordcount program, written to use this
|
||||||
infrastructure:
|
infrastructure:
|
||||||
|
|
||||||
@@ -210,19 +218,43 @@ unsigned long MR_DefaultHashPartition(char *key, int num_buckets) {
|
|||||||
```
|
```
|
||||||
|
|
||||||
The function's role is to take a given `key` and map it to a number, from `0`
|
The function's role is to take a given `key` and map it to a number, from `0`
|
||||||
to `num_buckets - 1`. Its use is internal to the MapReduce library;
|
to `num_buckets - 1`. Its use is internal to the MapReduce library, but
|
||||||
|
critical. Specifically, your MR library should use this function to decide
|
||||||
|
which Reduce thread gets a particular key/list of values to process. For some
|
||||||
|
applications, which Reducer thread processes a particular key is not
|
||||||
|
important (and thus the default function above should be passed in to
|
||||||
|
`MR_Run()`); for others, it is, and this is why the user can even pass in
|
||||||
|
their own partitioning function as need be.
|
||||||
|
|
||||||
|
One last requirement: For each partition, keys (and the value list associated
|
||||||
|
with said keys) should be *sorted* in ascending key order; thus, when a
|
||||||
|
particular reducer thread (and its associated partition) are working, the
|
||||||
|
`Reduce()` function should be called on each key in order for that partition.
|
||||||
|
|
||||||
## Considerations
|
## Considerations
|
||||||
|
|
||||||
- **Thread Management**.
|
Here are a few things to consider in your implementation:
|
||||||
|
|
||||||
- **Memory Management**. yyy.
|
- **Thread Management**. This part is fairly straightforward. You should
|
||||||
|
create `num_mappers` mapping threads, and assign a file to each `Map()`
|
||||||
|
invocation in some manner you think is best (e.g., Round Robin,
|
||||||
|
Shortest-File-First, etc.). Which way might lead to best performance?
|
||||||
|
You should also create `num_reducers` reducer threads at some point, to
|
||||||
|
work on the map'd output.
|
||||||
|
|
||||||
|
- **Partitioning and Sorting**. Your central data structure should be
|
||||||
|
concurrent, allowing mappers to each put values into different
|
||||||
|
partitions correctly and efficiently. Once the mappers have completed, a
|
||||||
|
sorting phase should order the key/value-lists. Then, finally, each
|
||||||
|
reducer thread should start calling the user-defined `Reduce()` function
|
||||||
|
on the keys in sorted order per partition. You should think about what
|
||||||
|
type of locking is needed throughout this process for correctness.
|
||||||
|
|
||||||
|
- **Memory Management**. One last concern is memory management. The
|
||||||
|
`MR_Emit()` function is passed a key/value pair; it is the responsibility
|
||||||
|
of the MR library to make copies of each of these. Then, when the entire
|
||||||
|
mapping and reduction is complete, it is the responsibility of the MR
|
||||||
|
library to free everything.
|
||||||
|
|
||||||
## Grading
|
## Grading
|
||||||
|
|
||||||
|
|||||||
Reference in New Issue
Block a user