Parallel vs Distributed Computing
- Coupling
- Parallel computing: tight coupling between tasks, that cooperate on shared memory
- Distributed computing: loose coupling between nodes, that cooperate by messaging
- Algorithm implementation
- In classical algorithms, the problem is encoded and given to the (one single) processing element
- In parallel algorithms, the problem is encoded and partitioned, and then its chunks are given to the processing elements.
- In distributed algorithms, the problem is given by the network itself and solved by coordination protocols
Distributed systems are groups of networked computers which share a common goal for their work.
Lynch’s Discussion of how Distributed Algorithms differ
There are many different kinds of distributed algorithms. Some of the attributes by which they differ include:
The Inter-process Communication (IPC) Method
Distributed algorithms run on a collection of processors, which need to communicate somehow. They could communicate through:
- Accessing Shared Memory
- Point-to-Point or Broadcast Messaging
- Executing Remote Procedure Calls
Timing Model
Timing of events in the system, reflecting the different types of timing information that might be used by algorithms.
- Completely synchronous: performing communication and computation in Lock-step synchrony
- Completely asynchronous: each node can take steps at arbitrary speeds and in arbitrary orders.
- Partially asynchronous: in between, some timing restrictions but not completely lock-step. For example, each node could
- have bounds on their relative speeds, or
- have access to approximately synchronized clocks.
Failure Model
- The algorithm can assume that there will be no node failures
- The algorithm can also be designed to tolerate limited amount of faulty behavior, such as:
- Processors might just stop, with or without warning;
- Processors might fail transiently (转瞬即逝地);
- Processors might exhibit more severe Byzantine failures, where a failed processor can behave arbitrarily.
- Failures of the communication mechanisms, including
- message loss
- duplication
Problem Addressed
The typical problems that are considered are those that arise in the application areas mentioned above.
Typical problems:
- resource allocation,
- communication,
- consensus among distributed processors,
- database concurrency control,
- deadlock detection,
- global snapshots,
- synchronization
Typical scenario in distributed computing
- computing nodes have local memories and unique IDs
- nodes are arranged in a network: graph, digraph, ring…
- neighbouring nodes communicate by message passing (This is a typical IPC method)
- independent failures are possible: nodes, communication channels (Failure model)
- the network topology (size, diameter, neighbourhoods) and other characteristics (e.g. latencies) are often unknown to individual nodes
- the network may or may not dynamically change
- nodes solve parts of a bigger problem or have individual problems but still need some sort of coordination - which is achieved by distributed algorithms.
Rounds and Steps
Nodes work in rounds (macro-steps), which have three sub-steps:
- 1 Receive sub-step: receive incoming messages
- 2 Process sub-step: change local node state
- 3 Send sub-step: send outgoing messages
Note: some algorithms expect null or empty messages, as an explicit conrmation that nothing was sent
Synchronise model as a particular case of asynchronous models
We describe timing model with the rounds and steps above.
- Synchronous model, version 1
- all nodes: process takes 1 time unit
- all messages: transit time (send -> receive) takes 0 time units
- Synchronous model, version 2
- all nodes: process takes 0 time units
- all messages: transit time (send -> receive) takes 1 time units
The second (and equivalent) version ensures that synchronised models are a particular case of asynchronous models
Asynchronous model
all nodes: process takes 0 time units
each message (individually): transit time (send -> receive) takes any real number of time units (for synchronous it is 1 time unit)
- or, normalised: any real number in the [0; 1] interval
often, a FIFO guarantee, which however may affect the above timing assumption (see next slide)
async time complexity >= sync time complexity
Time complexity (worst-case) = supremum of all possible normalised async runs.
Sync run is just one of the all possible runs, where all transit is 1 time unit.
Asynchronous model with FIFO channels
- The FIFO guarantee: faster messages cannot overtake earlier slower messages sent over the same channel
- Congestion (pileup) may occur and this should be accounted for
- The timing assumption: (each message (individually): transit time (send -> receive) takes any real number of time units (for synchronous it is 1 time unit), or, normalised: any real number in the [0; 1] interval) only applies to the top (oldest) message in the FIFO queue
- After the top message is delivered, the next top message is delivered after an additional arbitrary delay.
- Thus, a sequence of n messages may take n time units until the last is delivered
Non-determinism
Both sync and async timing models are non-deterministic.
- Choice of request (sync and async)
- Message delay may vary (async)
However, all executions must arrive to a valid decision (not necessarily the same, but valid).
Confluent system
In computer science, confluence is a property of rewriting systems, describing which terms in such a system can be rewritten in more than one way, to yield the same result.