Parallel vs Distributed Computing

  • Coupling
    • Parallel computing: tight coupling between tasks, that cooperate on shared memory
    • Distributed computing: loose coupling between nodes, that cooperate by messaging
  • Algorithm implementation
    • In classical algorithms, the problem is encoded and given to the (one single) processing element
    • In parallel algorithms, the problem is encoded and partitioned, and then its chunks are given to the processing elements.
    • In distributed algorithms, the problem is given by the network itself and solved by coordination protocols

Distributed systems are groups of networked computers which share a common goal for their work.

Lynch’s Discussion of how Distributed Algorithms differ

There are many different kinds of distributed algorithms. Some of the attributes by which they differ include:

The Inter-process Communication (IPC) Method

Distributed algorithms run on a collection of processors, which need to communicate somehow. They could communicate through:

  • Accessing Shared Memory
  • Point-to-Point or Broadcast Messaging
  • Executing Remote Procedure Calls

Timing Model

Timing of events in the system, reflecting the different types of timing information that might be used by algorithms.

  • Completely synchronous: performing communication and computation in Lock-step synchrony
  • Completely asynchronous: each node can take steps at arbitrary speeds and in arbitrary orders.
  • Partially asynchronous: in between, some timing restrictions but not completely lock-step. For example, each node could
    • have bounds on their relative speeds, or
    • have access to approximately synchronized clocks.

Failure Model

  • The algorithm can assume that there will be no node failures
  • The algorithm can also be designed to tolerate limited amount of faulty behavior, such as:
    • Processors might just stop, with or without warning;
    • Processors might fail transiently (转瞬即逝地);
    • Processors might exhibit more severe Byzantine failures, where a failed processor can behave arbitrarily.
    • Failures of the communication mechanisms, including
      • message loss
      • duplication

Problem Addressed

The typical problems that are considered are those that arise in the application areas mentioned above.
Typical problems:

  • resource allocation,
  • communication,
  • consensus among distributed processors,
  • database concurrency control,
  • deadlock detection,
  • global snapshots,
  • synchronization

Typical scenario in distributed computing

  • computing nodes have local memories and unique IDs
  • nodes are arranged in a network: graph, digraph, ring…
  • neighbouring nodes communicate by message passing (This is a typical IPC method)
  • independent failures are possible: nodes, communication channels (Failure model)
  • the network topology (size, diameter, neighbourhoods) and other characteristics (e.g. latencies) are often unknown to individual nodes
  • the network may or may not dynamically change
  • nodes solve parts of a bigger problem or have individual problems but still need some sort of coordination - which is achieved by distributed algorithms.

Rounds and Steps

Nodes work in rounds (macro-steps), which have three sub-steps:

  • 1 Receive sub-step: receive incoming messages
  • 2 Process sub-step: change local node state
  • 3 Send sub-step: send outgoing messages

Note: some algorithms expect null or empty messages, as an explicit con rmation that nothing was sent

Synchronise model as a particular case of asynchronous models

We describe timing model with the rounds and steps above.

  • Synchronous model, version 1
    • all nodes: process takes 1 time unit
    • all messages: transit time (send -> receive) takes 0 time units
  • Synchronous model, version 2
    • all nodes: process takes 0 time units
    • all messages: transit time (send -> receive) takes 1 time units

The second (and equivalent) version ensures that synchronised models are a particular case of asynchronous models

Asynchronous model

  • all nodes: process takes 0 time units

  • each message (individually): transit time (send -> receive) takes any real number of time units (for synchronous it is 1 time unit)

    • or, normalised: any real number in the [0; 1] interval
  • often, a FIFO guarantee, which however may affect the above timing assumption (see next slide)

async time complexity >= sync time complexity

Time complexity (worst-case) = supremum of all possible normalised async runs.

Sync run is just one of the all possible runs, where all transit is 1 time unit.

Asynchronous model with FIFO channels

  • The FIFO guarantee: faster messages cannot overtake earlier slower messages sent over the same channel
  • Congestion (pileup) may occur and this should be accounted for
  • The timing assumption: (each message (individually): transit time (send -> receive) takes any real number of time units (for synchronous it is 1 time unit), or, normalised: any real number in the [0; 1] interval) only applies to the top (oldest) message in the FIFO queue
  • After the top message is delivered, the next top message is delivered after an additional arbitrary delay.
  • Thus, a sequence of n messages may take n time units until the last is delivered

Non-determinism

Both sync and async timing models are non-deterministic.

  • Choice of request (sync and async)
  • Message delay may vary (async)

However, all executions must arrive to a valid decision (not necessarily the same, but valid).

Confluent system

In computer science, confluence is a property of rewriting systems, describing which terms in such a system can be rewritten in more than one way, to yield the same result.