Apache Spark

#big-data

Apache Spark is a big data processing platform distributing a workload across a directed acyclic graph (DAG) of operations that transform and aggregate. It is similar to the MapReduce model but includes built-in relational algebraic operators that drastically improve performance.

Resilient Distributed Datasets (RDDs) are abstractions of data flowing through the system, either input or intermediary states, that represent the results of a particular computation. However, these computations may not be performed until the data is actually used (lazily evaluated) further in the DAG.

Spark’s architecture consists of a driver, a master, and multiple worker nodes. The driver compiles the program into stages and submits jobs to the master. Shuffle steps organize the tasks into buckets.