Maximilian Stark (mail@dakror.de), WS2019

Stand: 01.02.2019

Distributed Systems

Definitions
- Coulouris et al.: Message-Passing Hard- & Software nodes
- Tanenbaum & van Steen: Independent computers appearing as single coherent system
- Google Code University: Coordination of network processes for cooperation in single task
Ilities
- Reliability
- Fault tolerance
- High availability
- Recoverability
- Consistency
- Scalability (horizontal)
- Performance predictability
- Security
- Heterogeneity
- Openness

Data Storage

Centralized: BigTable / HBase
- Key-Value Store
- Tablet (data chunk) based processing
- Lock service: Chubby / Zookeeper
- Write-Ahead-Log
Decentralized: Dynamo / Cassandra
- Gossip-based information propagation
- Hash based key-range designation
- Replication to other nodes

Time

Physical Clocks
- Cristian's Algorithm
  - External synchronization with time server
  - $t' = t + t_{rtt} / 2$
  - Accuracy $\pm t_{rtt}/2 - min$
  - Multiple requests: take lowest $t_{rtt}$
- Berkeley Algorithm
  - Internal synchronization
  - Average of other node's time including RTT
  - Correction using relative delta messages
- NTP
  - Design features
    - Authentication
    - Redundancy
    - UTC-Synchronization over Internet
    - Scalability through stratum hierarchy
  - Usage in Internet context
  - Strata: Time Server Hierarchy
  - Modes of Operation (in order of accuracy)
    - Multicast: Usage in High-speed LAN
    - Procedure-call: Analog to Cristian's Algorithm
    - Symmetric: Prolonged pairwise synchronization
  - Process
    1. A sends to B at $T_{i-3}$
    2. B received at $T_{i-2} = T_{i-3} + t + \theta$
    3. B replies at $T_{i-1}$
    4. B receives at $T_{i} = T_{i-1} + t' - \theta$
      - $\delta_i = t+t' = T_{i-2}-T_{i-3}+T_i - T_{i-1}$
      - $\theta_i = \frac{T_{i-2} - T_{i-3} + T_{i-1} - T_i}{2}$
      - $\theta_i - \delta_i/2 \leq \theta \leq \theta_i + \delta_i/2$
Logical Clocks
- Happened-Before relation
- Assignment of unique ids to events
- Lamport timestamp
  - Incremented before each event
  - Piggybacking when receiving process message with timestamp $newtime = \max(oldtime, receivedtime)$
  - Total order enforced by including process id
- Vector clocks
  - Vector of each process' Lamport timestamp
  - Leads to clear total order for unconnected events
  - Comparison
    - $V \leq V' \iff \forall ~i~V_i \leq V'_i$
    - $V ~||~ V' \iff V \nleq V' \land V' \nleq V$

Coordination

Mutual Exclusion

Requirements

Safety: Only one process in CS at a time
Liveness: All enter / exits eventually succeed
Fairness: No starvation, preserving of order (happened-before)

Methods

Centralized
Token-Ring
Lamport Algorithm
- Priority Queue of entry requests with Lamport timestamps
- Broadcast on entry request
- Reply from all others with their timestamps (head of queues)
- Entry if lowest timestamp
- Broadcast release
Ricart & Agrawala
- State based (Released, Wanted, Held)
- Entry on positive reply from all others
- If held: withholding of reply until released
- Resolution of concurrent requests by timestamp comparison
Can be reduced to Leader Election
- Key properties of LE algorithms
  - Safety: Eventual consensus on the selection of a leader
  - Liveness: Enforce either participation or crashing
- Chang-Roberts Ring-based LE
  - First round: Passing of highest yet seen process id
  - Second round: Propagation of total highest process id for election
  - Concurrent ignorance of obsolete messages
- Bully algorithm
  - Election of highest process id from alive processes
  - Total knowledge of other processes in the system
  - Allows crashing of processes in election

Group communication

B-Multicast: Multicasts that will always eventually be sent and delivered
R-Multicast
- Via B-Multicasts: if any group member receives a message, send it to rest of group
- Via IP-Multicasts
  - Detect missing multicasts by skipped ACK number
  - Store incoming messages in hold-back queue
Ordered Multicast
- FIFO ordering
  - A multicasts m, then m'; m must be delivered first by any B
  - Implementation using sequence numbers and piggybacking upon delivery
  - Hold-back until proper order of sequence numbers collected
- Causal ordering: multicast m → m' => delivery m → delivery m'
  - order via happened-before relation
  - implies FIFO ordering Implementation using vector timestamps
- Total ordering
  - A delivers m then m'; m must be delivered first by any other B
  - Sequencer algorithm
    - Single node in group to enforce global sequence
    - Receives messages as well and multicasts global sequence number to other receivers
  - ISIS Algorithm: collect proposed sequence numbers from potential receivers, pick highest for new message
Consensus & Byzantine Generals problem
- Requirements
  - Agreement
  - Validity
  - Integrity
  - Termination
- Synchronous system
  - at least $f+1$ rounds (at most $f$ crashed processes)
  - B-Multicast of values collected from other processes that haven't been sent in a previous round already
  - round duration determined by maximum timeout setting
  - Eventual selection of minimum value
- Majority function to determine final result
- Impossibility in synchronous system of size $n \leq 3f$ (Pease et al)
- Impossibility of guarantee in any asynchronous system
- Solution for $n \geq 3f + 1$
  - Followers not voting for their own preference
  - Followers repeating Leader value
  - Follower internal consensus by majority of received data

Replication

Active: Client sends request to every replica
Passive
- Primary-Backup
  - Leader Election to select primary
  - Client communicates only with primary
  - Primary synchronizes backups
  - Eager replication: Synchronization before reply to client
  - - not scalable
- Multi-primary (MPR)
  - Every replica can interact with client
  - Consensus based synchronization
  - Conflict-free replicated data types (CRDTs)
    - Growth-only counter
    - PN-Counter: pair of g-counter for plus and minus, returning difference
    - Growth-only set
    - 2P-Set: pair of g-sets for add and subtract
    - Replicated state-based object
    - Max-register
  - Significant system slowdown in eager replication
- Optimistic Lazy MPR
  - Immediate response to client
  - Asynchronous update propagation
  - Optimistic about minimal divergence of the global state
Gossiping
Chain
- Replicas arranged in linked list
- Updates sent to head and propagated along list, response sent by tail
- Queries sent to tail for immediate response
- Propagation
  - Speculative history: messages top-down (unconfirmed)
  - Stable history: messages bottom-up (acknowledged)

Paxos (Consensus Algorithm)

Roles (independent from processes)
- Proposer: Client Input, seeks consensus
- Acceptor: Propagation of latest value
- Learner: Client Output, reaches consensus
Properties
- Non-triviality safety: learning only of proposed values possible
- Safety: At most one value can be learned
- No liveness: Possibility of livelock
Process
1. Phase
  - Proposer -> Acceptor: ProposeRequest with $n$ global totally ordered identifier
  - Acceptor -> Proposer
    - Current Propose if newer propose already seen
    - Promise else
2. Phase
  - Proposer: 1. Consensus
    - Awaiting of majority of Acceptor responses
    - Calculation of own consensus value
      - Previous propose with highest id
      - if empty set, own value
    - Sent as AcceptRequest to all Acceptors
  - Acceptor -> Learner: Forwarding of AcceptRequests
  - Learners: 2. Consensus
    - Awaiting of majority of AcceptRequests
    - Calculation of final value
    - Reply to client

Replication & Consistency

Consistency Models

Global ordering of messages too expensive
Weaker consistency requirements depending on application

Data-centric Consistency

Definition of results for concurrent reads / writes
Strict consistency
- Any read must return most recent write value
- Assumes absolute global time
- Impossible to realize: (Unpredictable) latency > 0
Sequential consistency
- Logical time
- Operations from single process appear in same order as locally at author process
- Final result equal to result of some sequential order of each process's operations
- Usage: System-wide consistent reads
Linearizable consistency
- Stronger than sequential consistency
- Assumes global time
- Ordering of operations by timestamp of execution
- Results after overlapping operations must be one of the operations results
- Usage: Strongest distributed solutions, HBase, Bigtable
Causal consistency
- Weaker than sequential consistency
- Potentially causally related writes must be seen in same order
- Concurrent writes can be seen in any order
- Usage: Seeing posts before replies
FIFO consistency
- Writes in single process seen in same order globally
- Writes from different processes seen in any order
- Read all messages from each friend in order but not across friends
Weak consistency
- Explicit synchronization call to create consensus on current state
- Any order before synchronization
- Responsibility delegation to developer for explicit synchronization

Client-centric Consistency

Definition of results for single client communicating with multiple replicas
Assumption: No write-write conflicts
- DNS: Single authority
- Key-Value Stores: Partitioned key range
- WWW: heavy client side caching
Maintain consistent view for client
Eventual consistency
- Eventual convergence once writes stop
- Lazy replication using gossiping
- Very weak consistency, highly available (but stale)
- "Weird" results when talking to multiple replicas
- Usage: Few concurrent writes, Dynamo, Cassandra
Read-Your-Writes consistency
- Writes always completed before subsequent reads on the same process
Monotonic-Reads consistency
- After reading some data, any subsequent reads will always return the same or newer data
Writes-Follow-Reads consistency
- Writing data after reading it operates on the same or newer version of the read data
Monotonic-Writes consistency
- FIFO consistency model
- Preserving a process's local write order at any other replica

CAP-Theorem

Consistency
Availability
Partition-Tolerance
Properties of any DS
Reality check
- Varying degrees of C,A,P (not seen as binary constraints)
- AP: Best effort consistency
  - Web caching
  - Network file system
  - Characteristics
    - Optimistic (lazy) replication
    - Time-to-live
    - Conflict resolution (CRDT)
  - Cassandra, Dynamo
- CP: Best effort availability
  - Distributed lock services
  - Eager replication
  - Pessimistic locking
  - Paxos, BigTable, HBase
Dynamic systems
- Change prioritized properties during runtime
Extended Model: PACELC
- AC trade-off in case of Partition
- L (latency) C trade-off in normal operation (Else)
- Examples
  - PA/EL: Low C, low L, high A
    - Dynamo
    - Cassandra
    - Web Caching
  - PC/EC: High C, high L, low A
    - BigTable
    - HBase
  - PA/EC: MongoDB
  - PC/EL: Yahoo! PNUTS

Consistent Hashing

Hash-function using ring data structure
Assign objects to closest hash point clockwise

File Systems

POSIX
- Superblock
- Inodes
- Data blocks
Network File System (NFS)
- User-centric
  - Few concurrent accesses
  - More common reads than writes
  - - Undefined behavior on concurrent writes
- Designed for local LANs
- Access layer on top of virtual files
- RPC client-server communication
  - - Bottleneck
  - - Server overload
  - + Guaranteed consistency
- Caching (=multi-primary replication)
- Crash-Handling & Robustness
  - Usage of stateless RPC
  - Time-bounded consistency: interval cache updates
  - Flush-on-close: synchronous update at file close
Google File System (GFS)
- Application in big data
- Commodity hardware with low stability
- Single master server with meta data
  - File & chunk namespaces
  - File-chunk mapping
  - Replica locations
  - Operation log (metadata diff log)
  - Shadow master replica
- Chunk servers with data
- Leases
  - Time-bound resource ownership
  - optional renewal
  - early release
  - basically auto-release locks (in case of crash)
- Append-driven: Append to file, then create new file from collected total file data
Hadoop Distributed File System (HDFS): practically the same as GFS
Ceph Storage Platform
- Unified Storage platform (Block-, Object- & File-Storage)
- Self-healing
- No single point of failure
- Object Storage Daemon (OSD): data management
- Monitors: Metadata management & consensus

Erasure Coding

Reed Solomon code
- Linear block code
  - Data blocks
  - Error-correction blocks
- Recovery of flipped bits or loss
- Pre-shared parity encoding matrix
- Restoration of lost data using inverse encoding sub-matrix of intact data
Luby Transform codes (LT)
- Fountain code
  - Limitless number of encoding symbols
- Recovery of missing chunks
- Encoding
  - Data split into equal sized chunks
  - Select degree from distribution (Soliton distribution)
  - Selection of uniformly random distinct input symbols
  - Droplet construction: XOR of symbols
- Droplets
  - Header: Information about used source symbols
  - Data
    - Degree = 1: Just data
    - Degree > 1: XOR of data
- Decoder
  - Data recovery using XOR properties

P2P

Benefits
- Efficient resource usage
- Scalability: All consumers also donors
- Reliability (in aggregate)
- Ease of administration
Use cases
- Large-Scale systems
  - Scientific data analysis: SETI@home
  - File Sharing
  - Only P2P sufficient management strategy
- Startup with danger of death by success
Overlay networks: P2P network connection topology
Network types
- Unstructured
- 1st generation
  - Centralized: Napster
    - Perfect knowledge
    - Bottleneck
  - Pure: Gnutella, Freenet
    - Query flooding
    - Imperfect knowledge
- 2nd generation: Hybrid
  - Skype, BitTorrent
  - Hierarchy of peers
  - Dynamic "supernodes" with more capacity
    - Participation in search protocol
    - Indexing and data caching
    - Reduction of message load
- Structured
  - Controlled topology
  - Based on distributed hashtable abstraction
    - Location independent data
    - Data distribution among nodes

Distributed hashtable (DHT)

Desirable properties
- Load balancing: even key distribution
- Small network effects on node change
- Local data storage of only a few other nodes
- Fast lookup of target nodes
Implementations
- Chord
  - hash IDs of bit size $m$
  - Organization of nodes in identifier circle based on node ID
  - Keys inserted at successor node (comparison by key- and node-ID)
  - Even distribution
  - Network change (node join/leave)
    - Local knowledge of adjacent ring segment
    - High probability of data recovery
    - Redistribution on node join
  - Search
    - At most linear (if only knowledge about single successor)
    - Finger table of node $n$ $FT_n$ with up to $m$ entries
      - $FT_n(i) = successor((n+2^{i-1}) ~\mathrm{mod}~ 2^m)$ Address lookup
      - $\mathcal{O}(\log n)$
      - Each jump halves distance to destination

Content addressable network (CAN)

Virtual multidimensional Cartesian coordinate space
Layering of nodes onto $n$d-torus
Dynamic partitioning of coordinate space among all nodes
Assignment of single zone (subspace) per node
Hashing of all keys to point in space
Local storage of IP-addresses of touching neighbor zone's nodes
Average path length
- $d$-dimensional space, $n$ equi-sized zones: avg pathlength = $d/4 * n^{i/d}$
- Local storage of $2d$ neighbors
- Node number growth increases path length by $\mathcal{O}(n^{1/d})$
- $2D: 1/2 * n^{1/2}$
- $3D: 3/4 * n^{1/3}$
Node joining: Split of existing zone by picking random point

P2P Examples

Spotify (ex)
- P2P with server backup in case of poor download
- Caching
- Security by obscurity
- Sequential order
- Supports search
BitTorrent
- Swarm overlay
- Exchange of downloaded blocks with peers
- Out-of-band search
- Centralized trackers
- Tit-for-tat
- Self-balancing node request protocol
  1. Random first policy (first piece)
  2. Rarest first policy (middle pieces)
  3. Coupon-collector mode: Global broadcast for missing piece

PubSub

Matching based communication
- No host-to-host communication
- Filtering based on matching model between published data and subscribed interests
Optimized for scalability and performance
- Large number of publishers and subscribers
- High rate of publications
- Fast (stateless) & simple (topic) matching
Decoupling properties
- Time: no need for simultaneous availability
- Space: free physical distribution of clients
- Synchronization: no blocking of control flow
Matching models
- Topic-based: topic as metadata
- Content-based: properties as metadata
- Type-based

Content based matching algorithms

Counting algorithm: Counting of matched predicates in phase 2
Clustering algorithm
General principle
1. Phase: Matching of all predicates
  - Decomposition and sorting of predicates by operator into Predicate Index (linked list per operator of value and predicate ID)
  - Creation of hash table linking attribute name to Predicate Index
2. Phase: Matching of subscriptions to results from Phase 1

Routing

Rendezvous-based
- Partitioning of publication space
- Management of subscriptions and publications meeting at same node
- Example: Scribe
  - Infrastructure-less, P2P-DHT based
  - Topics hashed in DHT
Overlay-based
- Client connection to any broker
- simple routing over overlay
- Matching and dissemination performed inside overlay
- Rendezvous-based overlay: Hermes
  - content-based matching
  - Single broker as rendezvous point: matching and dissemination
  - Local broker storage on how to reach rendezvous and subscribers
  - Optimizations
    - Multiple brokers: increases state complexity of forwarding brokers
    - Bloom filters using Link IDs
      - Reduce state complexity
      - Introduction of false positive disseminations
      - Increased processing time at RP
- Filtering-based: PADRES
  - Filtering at each broker
  - Flooding of subs to initialize routing paths
  - Support of any matching model
  - Optimizations
    - Subscription covering
    - Advertising-based routing
- Advertisement-based
  - Flooding of advertisements by publishers before publish
  - Storage of advertisements at each broker in Subscription Routing Table (SRT)
  - No flooding of subscriptions
  - Useful when publishers $<<$ subscribers

Bloom filters

Efficient data storage structure
Representation as fixed size bit vector
Verification of existence in BF but no retrieval of full list
Operation
- Insertion
  1. k hashes of inserted objects
  2. Setting of bit at position of each resulting hash
- Retrieval
  - Checking of set bits for all hashes
  - False positives on hash collision

Blockchains

Nothing new.

Cloud

Virtualization

Type 1: Hypervisor on hardware
Type 2: Hypervisor on OS

Namespaces

Wrapping of a global system resource in an abstraction
Creating a seemingly exclusive context
Hiding of changes across processes
Examples: IPC, UTS