Kyle Edwards

MapReduce: Simplified Data Processing on Large Clusters

MapReduce is a distributed programming model that performs simple computational workloads over large datasets that can be parallelized to complete in a reasonable amount of time. Breaking the workload apart and merged together in a way that is resilient to failure makes this a challenging problem.

The core behavior of MapReduce takes key-value pairs, distributes them across a cluster of machines. Each machine runs the same map function to transform each pair into sets of intermediate key-value pairs. That data is then supplied to a reduce function that condenses that representation into final results. Each step can be as complex or as simple as necessary (it may even be an identity function).

Example Applications

Implementation

Shared-memory, vs NUMA, vs networked cluster…

The map workload is split amongst a number of distributed nodes to be processed in parallel using a deterministed hashing function (ideally so that the work is evenly distributed across the nodes). The work is split into configurably sized chunks, usually between 16 and 64 MB.

A primary node manages this workload and assign each node either a map or reduce task.

The map task iterates over the input key-value pairs and stores intermediate pairs on disk. Periodically the node will inform the primary the location of each key on disk, and the primary will in turn assign a reduce task to pull that data from the map node (via RPC).

The reduce nodes sort pairs by key and perform their calculations, creating individual output files per key.

These output files are either used by new MapReduce functions or can be used elsewhere.

The primary/master keeps track of all map and reduce workers, and whether they are idle, in progress, or completed.

Fault Tolerance

The primary node pings each worker periodically. If any worker is unresponsive, it’s marked as idle so its workload can be re-processed. For reduce tasks marked as complete, they can be safely ignored since its output is saved to a globally accessible file system.

When a map task fails, all reduce workers are notified of the new worker that replaced it. Any reduce workers that have not yet read from the original map task will now read from the new worker.

This makes MapReduce resilient to large-scale worker failures.

In the event of a master node failure, either the entire process is aborted, or the master node can be recreated from a backup of the last checkpointed data structure.

Would it be possible to replicate the primary node using leadership election or chain replication?

MapReduce also uses a signal handler on each worker. Each invocation of a map or reduce function writes out the current item’s sequence number into global memory. If the worker fails due to a segmentation fault or some other deterministic error, the handler sends a UDP message to the primary with the sequence number. After a number of failures, the primary stops scheduling that record.

Networking

Google’s implementation of MapReduce relies on the Google File System (GFS) to handle locality of files to their map tasks (see “active disks” technique). GFS stores files in 64 MB chunks on (usually) three machines at once, meaning their implementation attempts to run a map task on one of those machines. If it can’t, then it attempts to run on a machine on the same network switch.

The paper states that Google commonly splits workloads into roughly 200,000 maps and 5,000 reads over 2,000 worker nodes.

When a worker “straggles” (is slow to complete a task), MapReduce runs backup tasks on a separate machine that may complete first.

Using reader and writer interfaces, it’s possible to forgo the file system and deal with data in memory or databases.

Performance Demo

Google Uses

MapReduce is an abstraction that allows developers to exploit large distributed systems in a relatively short amount of time.

Google rewrote its web search indexing code to use a series of MapReduce operations. By abstracting away parallelization, distribution, locality optimization, load balancing, and fault tolerance mechanisms, it made this code much easier to reason about and change over time.

Other Notes

Andrea C. Arpaci-Dusseau and Remzi H. Arpaci-Dusseau, who wrote the OSTEP book, are attributed multiple times in the references.

Other topics to research: