Maximilian Stark (mail@dakror.de), WS2019
Stand: 01.02.2019
Distributed Systems
Distributed Systems
- Definitions
- Coulouris et al.: Message-Passing Hard- & Software nodes
- Tanenbaum & van Steen: Independent computers appearing as single coherent system
- Google Code University: Coordination of network processes for cooperation in single task
- Ilities
- Reliability
- Fault tolerance
- High availability
- Recoverability
- Consistency
- Scalability (horizontal)
- Performance predictability
- Security
- Heterogeneity
- Openness
Data Storage
- Centralized: BigTable / HBase
- Key-Value Store
- Tablet (data chunk) based processing
- Lock service: Chubby / Zookeeper
- Write-Ahead-Log
- Decentralized: Dynamo / Cassandra
- Gossip-based information propagation
- Hash based key-range designation
- Replication to other nodes
Time
- Physical Clocks
- Cristian's Algorithm
- External synchronization with time server
- $t' = t + t_{rtt} / 2$
- Accuracy $\pm t_{rtt}/2 - min$
- Multiple requests: take lowest $t_{rtt}$
- Berkeley Algorithm
- Internal synchronization
- Average of other node's time including RTT
- Correction using relative delta messages
- NTP
- Design features
- Authentication
- Redundancy
- UTC-Synchronization over Internet
- Scalability through stratum hierarchy
- Usage in Internet context
- Strata: Time Server Hierarchy
- Modes of Operation (in order of accuracy)
- Multicast: Usage in High-speed LAN
- Procedure-call: Analog to Cristian's Algorithm
- Symmetric: Prolonged pairwise synchronization
- Process
- A sends to B at $T_{i-3}$
- B received at $T_{i-2} = T_{i-3} + t + \theta$
- B replies at $T_{i-1}$
- B receives at $T_{i} = T_{i-1} + t' - \theta$
- $\delta_i = t+t' = T_{i-2}-T_{i-3}+T_i - T_{i-1}$
- $\theta_i = \frac{T_{i-2} - T_{i-3} + T_{i-1} - T_i}{2}$
- $\theta_i - \delta_i/2 \leq \theta \leq \theta_i + \delta_i/2$
- Logical Clocks
- Happened-Before relation
- Assignment of unique ids to events
- Lamport timestamp
- Incremented before each event
- Piggybacking when receiving process message with timestamp $newtime = \max(oldtime, receivedtime)$
- Total order enforced by including process id
- Vector clocks
- Vector of each process' Lamport timestamp
- Leads to clear total order for unconnected events
- Comparison
- $V \leq V' \iff \forall ~i~V_i \leq V'_i$
- $V ~||~ V' \iff V \nleq V' \land V' \nleq V$
Coordination
Mutual Exclusion
Requirements
- Safety: Only one process in CS at a time
- Liveness: All enter / exits eventually succeed
- Fairness: No starvation, preserving of order (happened-before)
Methods
- Centralized
- Token-Ring
- Lamport Algorithm
- Priority Queue of entry requests with Lamport timestamps
- Broadcast on entry request
- Reply from all others with their timestamps (head of queues)
- Entry if lowest timestamp
- Broadcast release
- Ricart & Agrawala
- State based (Released, Wanted, Held)
- Entry on positive reply from all others
- If held: withholding of reply until released
- Resolution of concurrent requests by timestamp comparison
- Can be reduced to Leader Election
- Key properties of LE algorithms
- Safety: Eventual consensus on the selection of a leader
- Liveness: Enforce either participation or crashing
- Chang-Roberts Ring-based LE
- First round: Passing of highest yet seen process id
- Second round: Propagation of total highest process id for election
- Concurrent ignorance of obsolete messages
- Bully algorithm
- Election of highest process id from alive processes
- Total knowledge of other processes in the system
- Allows crashing of processes in election
Group communication
- B-Multicast: Multicasts that will always eventually be sent and delivered
- R-Multicast
- Via B-Multicasts: if any group member receives a message, send it to rest of group
- Via IP-Multicasts
- Detect missing multicasts by skipped ACK number
- Store incoming messages in hold-back queue
- Ordered Multicast
- FIFO ordering
- A multicasts m, then m'; m must be delivered first by any B
- Implementation using sequence numbers and piggybacking upon delivery
- Hold-back until proper order of sequence numbers collected
- Causal ordering: multicast m → m' => delivery m → delivery m'
- order via happened-before relation
- implies FIFO ordering
Implementation using vector timestamps
- Total ordering
- A delivers m then m'; m must be delivered first by any other B
- Sequencer algorithm
- Single node in group to enforce global sequence
- Receives messages as well and multicasts global sequence number to other receivers
- ISIS Algorithm: collect proposed sequence numbers from potential receivers, pick highest for new message
- Consensus & Byzantine Generals problem
- Requirements
- Agreement
- Validity
- Integrity
- Termination
- Synchronous system
- at least $f+1$ rounds (at most $f$ crashed processes)
- B-Multicast of values collected from other processes that haven't been sent in a previous round already
- round duration determined by maximum timeout setting
- Eventual selection of minimum value
- Majority function to determine final result
- Impossibility in synchronous system of size $n \leq 3f$ (Pease et al)
- Impossibility of guarantee in any asynchronous system
- Solution for $n \geq 3f + 1$
- Followers not voting for their own preference
- Followers repeating Leader value
- Follower internal consensus by majority of received data
Replication
- Active: Client sends request to every replica
- Passive
- Primary-Backup
- Leader Election to select primary
- Client communicates only with primary
- Primary synchronizes backups
- Eager replication: Synchronization before reply to client
- - not scalable
- Multi-primary (MPR)
- Every replica can interact with client
- Consensus based synchronization
- Conflict-free replicated data types (CRDTs)
- Growth-only counter
- PN-Counter: pair of g-counter for plus and minus, returning difference
- Growth-only set
- 2P-Set: pair of g-sets for add and subtract
- Replicated state-based object
- Max-register
- Significant system slowdown in eager replication
- Optimistic Lazy MPR
- Immediate response to client
- Asynchronous update propagation
- Optimistic about minimal divergence of the global state
- Gossiping
- Chain
- Replicas arranged in linked list
- Updates sent to head and propagated along list, response sent by tail
- Queries sent to tail for immediate response
- Propagation
- Speculative history: messages top-down (unconfirmed)
- Stable history: messages bottom-up (acknowledged)
Paxos (Consensus Algorithm)
- Roles (independent from processes)
- Proposer: Client Input, seeks consensus
- Acceptor: Propagation of latest value
- Learner: Client Output, reaches consensus
- Properties
- Non-triviality safety: learning only of proposed values possible
- Safety: At most one value can be learned
- No liveness: Possibility of livelock
- Process
- Phase
- Proposer -> Acceptor: ProposeRequest with $n$ global totally ordered identifier
- Acceptor -> Proposer
- Current Propose if newer propose already seen
- Promise else
- Phase
- Proposer: 1. Consensus
- Awaiting of majority of Acceptor responses
- Calculation of own consensus value
- Previous propose with highest id
- if empty set, own value
- Sent as AcceptRequest to all Acceptors
- Acceptor -> Learner: Forwarding of AcceptRequests
- Learners: 2. Consensus
- Awaiting of majority of AcceptRequests
- Calculation of final value
- Reply to client
Replication & Consistency
Consistency Models
- Global ordering of messages too expensive
- Weaker consistency requirements depending on application
Data-centric Consistency
- Definition of results for concurrent reads / writes
- Strict consistency
- Any read must return most recent write value
- Assumes absolute global time
- Impossible to realize: (Unpredictable) latency > 0
- Sequential consistency
- Logical time
- Operations from single process appear in same order as locally at author process
- Final result equal to result of some sequential order of each process's operations
- Usage: System-wide consistent reads
- Linearizable consistency
- Stronger than sequential consistency
- Assumes global time
- Ordering of operations by timestamp of execution
- Results after overlapping operations must be one of the operations results
- Usage: Strongest distributed solutions, HBase, Bigtable
- Causal consistency
- Weaker than sequential consistency
- Potentially causally related writes must be seen in same order
- Concurrent writes can be seen in any order
- Usage: Seeing posts before replies
- FIFO consistency
- Writes in single process seen in same order globally
- Writes from different processes seen in any order
- Read all messages from each friend in order but not across friends
- Weak consistency
- Explicit synchronization call to create consensus on current state
- Any order before synchronization
- Responsibility delegation to developer for explicit synchronization
Client-centric Consistency
- Definition of results for single client communicating with multiple replicas
- Assumption: No write-write conflicts
- DNS: Single authority
- Key-Value Stores: Partitioned key range
- WWW: heavy client side caching
- Maintain consistent view for client
- Eventual consistency
- Eventual convergence once writes stop
- Lazy replication using gossiping
- Very weak consistency, highly available (but stale)
- "Weird" results when talking to multiple replicas
- Usage: Few concurrent writes, Dynamo, Cassandra
- Read-Your-Writes consistency
- Writes always completed before subsequent reads on the same process
- Monotonic-Reads consistency
- After reading some data, any subsequent reads will always return the same or newer data
- Writes-Follow-Reads consistency
- Writing data after reading it operates on the same or newer version of the read data
- Monotonic-Writes consistency
- FIFO consistency model
- Preserving a process's local write order at any other replica
CAP-Theorem
- Consistency
- Availability
- Partition-Tolerance
- Properties of any DS
- Reality check
- Varying degrees of C,A,P (not seen as binary constraints)
- AP: Best effort consistency
- Web caching
- Network file system
- Characteristics
- Optimistic (lazy) replication
- Time-to-live
- Conflict resolution (CRDT)
- Cassandra, Dynamo
- CP: Best effort availability
- Distributed lock services
- Eager replication
- Pessimistic locking
- Paxos, BigTable, HBase
- Dynamic systems
- Change prioritized properties during runtime
- Extended Model: PACELC
- AC trade-off in case of Partition
- L (latency) C trade-off in normal operation (Else)
- Examples
- PA/EL: Low C, low L, high A
- Dynamo
- Cassandra
- Web Caching
- PC/EC: High C, high L, low A
- PA/EC: MongoDB
- PC/EL: Yahoo! PNUTS
Consistent Hashing
- Hash-function using ring data structure
- Assign objects to closest hash point clockwise
File Systems
- POSIX
- Superblock
- Inodes
- Data blocks
- Network File System (NFS)
- User-centric
- Few concurrent accesses
- More common reads than writes
- - Undefined behavior on concurrent writes
- Designed for local LANs
- Access layer on top of virtual files
- RPC client-server communication
- - Bottleneck
- - Server overload
- + Guaranteed consistency
- Caching (=multi-primary replication)
- Crash-Handling & Robustness
- Usage of stateless RPC
- Time-bounded consistency: interval cache updates
- Flush-on-close: synchronous update at file close
- Google File System (GFS)
- Application in big data
- Commodity hardware with low stability
- Single master server with meta data
- File & chunk namespaces
- File-chunk mapping
- Replica locations
- Operation log (metadata diff log)
- Shadow master replica
- Chunk servers with data
- Leases
- Time-bound resource ownership
- optional renewal
- early release
- basically auto-release locks (in case of crash)
- Append-driven: Append to file, then create new file from collected total file data
- Hadoop Distributed File System (HDFS): practically the same as GFS
- Ceph Storage Platform
- Unified Storage platform (Block-, Object- & File-Storage)
- Self-healing
- No single point of failure
- Object Storage Daemon (OSD): data management
- Monitors: Metadata management & consensus
Erasure Coding
- Reed Solomon code
- Linear block code
- Data blocks
- Error-correction blocks
- Recovery of flipped bits or loss
- Pre-shared parity encoding matrix
- Restoration of lost data using inverse encoding sub-matrix of intact data
- Luby Transform codes (LT)
- Fountain code
- Limitless number of encoding symbols
- Recovery of missing chunks
- Encoding
- Data split into equal sized chunks
- Select degree from distribution (Soliton distribution)
- Selection of uniformly random distinct input symbols
- Droplet construction: XOR of symbols
- Droplets
- Header: Information about used source symbols
- Data
- Degree = 1: Just data
- Degree > 1: XOR of data
- Decoder
- Data recovery using XOR properties
P2P
- Benefits
- Efficient resource usage
- Scalability: All consumers also donors
- Reliability (in aggregate)
- Ease of administration
- Use cases
- Large-Scale systems
- Scientific data analysis: SETI@home
- File Sharing
- Only P2P sufficient management strategy
- Startup with danger of death by success
- Overlay networks: P2P network connection topology
- Network types
- Unstructured
- 1st generation
- Centralized: Napster
- Perfect knowledge
- Bottleneck
- Pure: Gnutella, Freenet
- Query flooding
- Imperfect knowledge
- 2nd generation: Hybrid
- Skype, BitTorrent
- Hierarchy of peers
- Dynamic "supernodes" with more capacity
- Participation in search protocol
- Indexing and data caching
- Reduction of message load
- Structured
- Controlled topology
- Based on distributed hashtable abstraction
- Location independent data
- Data distribution among nodes
Distributed hashtable (DHT)
- Desirable properties
- Load balancing: even key distribution
- Small network effects on node change
- Local data storage of only a few other nodes
- Fast lookup of target nodes
- Implementations
- Chord
- hash IDs of bit size $m$
- Organization of nodes in identifier circle based on node ID
- Keys inserted at successor node (comparison by key- and node-ID)
- Even distribution
- Network change (node join/leave)
- Local knowledge of adjacent ring segment
- High probability of data recovery
- Redistribution on node join
- Search
- At most linear (if only knowledge about single successor)
- Finger table of node $n$ $FT_n$ with up to $m$ entries
- $FT_n(i) = successor((n+2^{i-1}) ~\mathrm{mod}~ 2^m)$ Address lookup
- $\mathcal{O}(\log n)$
- Each jump halves distance to destination
Content addressable network (CAN)
- Virtual multidimensional Cartesian coordinate space
- Layering of nodes onto $n$d-torus
- Dynamic partitioning of coordinate space among all nodes
- Assignment of single zone (subspace) per node
- Hashing of all keys to point in space
- Local storage of IP-addresses of touching neighbor zone's nodes
- Average path length
- $d$-dimensional space, $n$ equi-sized zones: avg pathlength = $d/4 * n^{i/d}$
- Local storage of $2d$ neighbors
- Node number growth increases path length by $\mathcal{O}(n^{1/d})$
- $2D: 1/2 * n^{1/2}$
- $3D: 3/4 * n^{1/3}$
- Node joining: Split of existing zone by picking random point
P2P Examples
- Spotify (ex)
- P2P with server backup in case of poor download
- Caching
- Security by obscurity
- Sequential order
- Supports search
- BitTorrent
- Swarm overlay
- Exchange of downloaded blocks with peers
- Out-of-band search
- Centralized trackers
- Tit-for-tat
- Self-balancing node request protocol
- Random first policy (first piece)
- Rarest first policy (middle pieces)
- Coupon-collector mode: Global broadcast for missing piece
PubSub
- Matching based communication
- No host-to-host communication
- Filtering based on matching model between published data and subscribed interests
- Optimized for scalability and performance
- Large number of publishers and subscribers
- High rate of publications
- Fast (stateless) & simple (topic) matching
- Decoupling properties
- Time: no need for simultaneous availability
- Space: free physical distribution of clients
- Synchronization: no blocking of control flow
- Matching models
- Topic-based: topic as metadata
- Content-based: properties as metadata
- Type-based
Content based matching algorithms
- Counting algorithm: Counting of matched predicates in phase 2
- Clustering algorithm
- General principle
- Phase: Matching of all predicates
- Decomposition and sorting of predicates by operator into Predicate Index (linked list per operator of value and predicate ID)
- Creation of hash table linking attribute name to Predicate Index
- Phase: Matching of subscriptions to results from Phase 1
Routing
- Rendezvous-based
- Partitioning of publication space
- Management of subscriptions and publications meeting at same node
- Example: Scribe
- Infrastructure-less, P2P-DHT based
- Topics hashed in DHT
- Overlay-based
- Client connection to any broker
- simple routing over overlay
- Matching and dissemination performed inside overlay
- Rendezvous-based overlay: Hermes
- content-based matching
- Single broker as rendezvous point: matching and dissemination
- Local broker storage on how to reach rendezvous and subscribers
- Optimizations
- Multiple brokers: increases state complexity of forwarding brokers
- Bloom filters using Link IDs
- Reduce state complexity
- Introduction of false positive disseminations
- Increased processing time at RP
- Filtering-based: PADRES
- Filtering at each broker
- Flooding of subs to initialize routing paths
- Support of any matching model
- Optimizations
- Subscription covering
- Advertising-based routing
- Advertisement-based
- Flooding of advertisements by publishers before publish
- Storage of advertisements at each broker in Subscription Routing Table (SRT)
- No flooding of subscriptions
- Useful when publishers $<<$ subscribers
Bloom filters
- Efficient data storage structure
- Representation as fixed size bit vector
- Verification of existence in BF but no retrieval of full list
- Operation
- Insertion
- k hashes of inserted objects
- Setting of bit at position of each resulting hash
- Retrieval
- Checking of set bits for all hashes
- False positives on hash collision
Blockchains
Nothing new.
Cloud
Virtualization
- Type 1: Hypervisor on hardware
- Type 2: Hypervisor on OS
Namespaces
- Wrapping of a global system resource in an abstraction
- Creating a seemingly exclusive context
- Hiding of changes across processes
- Examples: IPC, UTS